爬虫入门学习手记

作者: Mered1th | 来源:发表于2018-01-13 22:07 被阅读0次

爬虫入门学习手记
Python爬虫入门
Python爬虫学习系列教程
3分钟带你了解世界第一语言Python 入门上手也这么简单！
Python入门爬虫必知的两套解析方法和四种爬虫实现方式
爬虫入门系列（六）：正则表达式完全指南（下）
没看错吧？5 行代码就能入门爬虫？
不懂这几个库，都不敢说我会Python爬虫
不会这几个库，都不敢说我会Python爬虫
学爬虫先学什么？写给小白的python爬虫入门方法论（第三期）

一、简单爬虫架构

简单爬虫架构

运行流程

URL管理器：管理待抓取URL集合和已抓取URL集合

添加新URL到待爬取集合中
判断待添加URL是否在容器中
判断是否还有待爬取URL
获取待爬取URL
将URL从待爬取移动到已爬取

二、URL管理器

实现方式：

内存
Python 内存
待爬取URL集合：set()
已爬取URL集合：set()
关系数据库
MySQL
urls ( url, is_crawled )
缓存数据库
redis
待爬取URL集合：set
已爬取URL集合：set

三、网页下载器(urllib2)

概念：将互联网上URL对应的网页下载到本地的工具

网页下载器

Python的网页下载器

urllib2---Python官方基础模块
requests---第三方包更强大

urllib2下载网页第一种方法

#直接请求
response = urllib2.urlopen('http://www.baidu.com')

#获取状态码，如果是200表示成功
print response.getcode()

#读取内容
cont = response.read()

urllib2下载网页第二种方法：添加data、http header

import urllib2

# 创建Request对象
request = urllib2.Request(url)
#添加数据
request.add_data('a','1')
#添加http的header
request.add_header('User-Agent','Mozilla/5.0')
#发送请求获取结果
response = urllib2.urlopen(request)

urllib2下载网页第三方法：添加特殊情景的处理器

# -*- coding: cp936 -*-
import urllib2, cookielib

#创建cookie容器
cj = cookielib.CookieJar()

#创建1个opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

#给urllib2安装opener
urllib2.install_opener(opener)

#使用带有cookie的urllib2访问网页
response = urllib2.urlopen("http://www.baidu.com/")

urllib2实例代码演示

#coding:utf8
import urllib2, cookielib

url = "http://www.baidu.com/"

print"第一种方法"
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())

print"第二种方法"
request = urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print"第三种方法"
cj= cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()

四、网页解析器(BeautifulSoup)

1、网页解析器：从网页中提取有价值数据的工具

网页解析器

Python的四种网页解析器：

1、正则表达式；-----字符串形式的模糊匹配
2、html.parser;------结构化解析
3、BeautifulSoup------结构化解析
4、lxml;------结构化解析

结构化解析---DOM树

DOM树

安装Beautifulsoup： pip install beautifulsoup4

Beautiful Soup Documentation：
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

BeautifulSoup实例代码演示

# -*- coding: cp936 -*-
import re
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')

print'获取所有的链接'
links = soup.find_all('a')
for link in links:
    print link.name, link['href'], link.get_text()


print'获取lacie的链接'
link_node = soup.find('a',href='http://example.com/lacie')
print link_node.name, link_node['href'], link_node.get_text()

print'正则匹配'
link_node = soup.find('a',href=re.compile(r"ill"))
print link_node.name, link_node['href'], link_node.get_text()

print'获取P段落名字'
p_node = soup.find('p',class_="title")
print p_node.name, p_node.get_text()

五、实例爬虫

爬虫流程：

确定目标
分析目标----URL格式、数据格式、页面编码
编码代码
执行爬虫

网友评论

本文标题：爬虫入门学习手记

本文链接：https://www.haomeiwen.com/subject/yochoxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬虫入门学习手记

一、简单爬虫架构

二、URL管理器

三、网页下载器(urllib2)

urllib2下载网页第一种方法

urllib2下载网页第二种方法：添加data、http header

urllib2下载网页第三方法：添加特殊情景的处理器

urllib2实例代码演示

四、网页解析器(BeautifulSoup)

1、网页解析器：从网页中提取有价值数据的工具

Python的四种网页解析器：

结构化解析---DOM树

安装Beautifulsoup： pip install beautifulsoup4

BeautifulSoup实例代码演示

五、实例爬虫

相关文章

爬虫入门学习手记

Python爬虫入门

Python爬虫学习系列教程

3分钟带你了解世界第一语言Python 入门上手也这么简单！

Python入门爬虫必知的两套解析方法和四种爬虫实现方式

爬虫入门系列（六）：正则表达式完全指南（下）

没看错吧？5 行代码就能入门爬虫？

不懂这几个库，都不敢说我会Python爬虫

不会这几个库，都不敢说我会Python爬虫

学爬虫先学什么？写给小白的python爬虫入门方法论（第三期）

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读