Python3 爬虫

作者: 碎念枫子 | 来源:发表于2019-12-13 18:46 被阅读0次

python3 爬虫学习python爬虫库-requests使用
崔庆才Python 3开发网络爬虫，教程+书籍
Python3 网络爬虫与开发实战
你对python爬虫略知一二？来看这篇文章我让你快速入门！
用Python爬取猫眼电影排行榜TOP100
电子书单列表
Windows 10 下python3.x安装scrapy
python学习笔记（二）——requests模块
5.3黑客成长日记——爬虫篇(1)
Python实战爬虫：练手爬虫用urllib模块获取

python3 如果用爬虫框架的话，可以使用Scrapy，文档如下Scrapy教程
本文主要介绍一下urllib和requests两个库

一、requests

这个库属于第三方库，需要自己安装，方法自寻，这个库用相对来说api方法更简洁，易于使用

1 get/post 方法

import requests
//get无参数
r = requests.get('https://httpbin.org/get')
//get有参数，使用params
data = {'key': 'value'}
r = requests.get('https://httpbin.org/get', params=data)
//post有参数，使用data
r = requests.post('https://httpbin.org/post', data = {'key':'value'})

2 自定义请求头

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
}
r = requests.get('https://httpbin.org/get', headers=headers)

3 响应内容

r.content 响应内容的字节码，一般处理二进制文件
r.text 自动选择适当的编码，对r.content解码
r.json() 解析json格式的数据，如果无法解析，则抛出异常

请求参数(无论是get/post/delete/put/patch/head/options都是调用requests的对应方法)

url 请求的URL地址
params GET请求参数
data POST请求参数
json 同样是POST请求参数，要求服务端接收json格式的数据
headers 请求头字典
cookies cookies信息（字典或CookieJar）
files 上传文件
auth HTTP鉴权信息
timeout 等待响应时间，单位秒
allow_redirects 是否允许重定向
proxies 代理信息
verify 是否校验证书
stream 如果为False，则响应内容将直接全部下载
cert 客户端证书地址

4 session

Session可以持久化请求过程中的参数，以及cookie
尤其是需要登录的网页，使用session可以避免每次的登录操作

s = requests.Session()
s.cookies = requests.cookies.cookiejar_from_dict({'key': 'value'})

r = s.get('https://httpbin.org/cookies')
print(r.text)
//=====下面为输出====
{
  "cookies": {
    "key": "value"
  }
}

另外session可以提供默认值

s = requests.Session()
s.headers.update({'h1':'val1', 'h2':'val2'})
r = s.get('https://httpbin.org/headers', headers={'h2': 'val2_modify'})
print(r.text)
//=====下面为输出====
"H1": "val1", 
"H2": "val2_modify",

5 response对象

cookies 返回CookieJar对象
encoding 报文的编码
headers 响应头
history 重定向的历史记录
status_code 响应状态码，如200
elaspsed 发送请求到接收响应耗时
text 解码后的报文主体
content 字节码，可能在raw的基础上解压方法
json() 解析json格式的响应
iter_content() 需配置stream=True，指定chunk_size大小
iter_lines() 需配置stream=True，每次返回一行
raise_for_status()400-500之间将抛出异常
close()

6 request一般情况下，会一次性包装好请求头，请求参数，cookies，鉴权等；但如果通过某些条件判断，可以局部组装requests

s = requests.Session()
req = requests.Request('GET', url='https://httpbin.org/get')

prep = s.prepare_request(req)
headers = {
    'User-Agent': 'Chrome/67.0.3396.62'
}
prep.prepare(
    method='POST',
    url='https://httpbin.org/post',
    headers=headers,
    data={'key': 'value'}
)

r = s.send(prep)
print(r.text)
//========下面是输出===========
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key": "value"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "9", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Chrome/67.0.3396.62"
  }, 
  "json": null, 
  "origin": "xx.xx.xx.xx", 
  "url": "https://httpbin.org/post"
}

二、urllib

在python2.x 环境提供了urllib和urllib2两个库，它们很相似，确有一些区别
在python3.x中 http 相关的所有包打包成了2个包：http 和 urllib 。也就是说urllib和urllib2合成了一个urllib包
下面讲述一下它们异同
1 它们都可以用urlopen来打开一个url
2 urllib2.urlopen可以接受一个Request对象或者url，（在接受Request对象时候，并以此可以来设置一个URL 的headers），urllib.urlopen只接收一个url
3 urllib 有urlencode,urllib2没有，这也是为什么总是urllib，urllib2常会一起使用的原因

基础简介

urllib是python内置的HTTP请求库，无需安装即可使用，它包含了4个模块：
request：它是最基本的http请求模块，用来模拟发送请求
error：异常处理模块，如果出现错误可以捕获这些异常
parse：一个工具模块，提供了许多URL处理方法，如：拆分、解析、合并等
robotparser：主要用来识别网站的robots.txt文件，然后判断哪些网站可以爬

urllib.request.urlopen(url,data=None,[timeout,],cafile=None,capath=None,cadefault=False,context=None)

HTTPResponse对象

方法：read()、readinto()、getheader(name)、getheaders()、fileno()
属性：msg、version、status、reason、bebuglevel、closed

import urllib.request
response=urllib.request.urlopen('https://www.python.org')  
#返回网页内容
print(response.read().decode('utf-8'))   
 #返回响应头中的server值
print(response.getheader('server'))
#以列表元祖对的形式返回响应头信息
print(response.getheaders()) 
 #返回文件描述符
print(response.fileno())
#返回版本信息
print(response.version)  
#返回状态码200，404代表网页未找到
print(response.status)  
#返回调试等级
print(response.debuglevel) 
#返回对象是否关闭布尔值
print(response.closed)  
#返回检索的URL
print(response.geturl()) 
#返回网页的头信息
print(response.info()) 
#返回响应的HTTP状态码
print(response.getcode()) 
 #访问成功则返回ok
print(response.msg) 
#返回状态信息
print(response.reason)

urlopen()方法可传递参数：

url：网站地址，str类型，也可以是一个request对象
data：data参数是可选的，内容为字节流编码格式的即bytes类型，如果传递data参数，urlopen将使用Post方式请求

from urllib.request import urlopen
import urllib.parse

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8') 
#data需要字节类型的参数，使用bytes()函数转换为字节，使用urllib.parse模块里的urlencode()方法来讲参数字典转换为字符串并指定编码
response = urlopen('http://httpbin.org/post',data=data)
print(response.read())
#=====output=====
b'{
"args":{},
"data":"",
"files":{},
"form":{"word":"hello"},  #form字段表明模拟以表单的方法提交数据，post方式传输数据
"headers":{"Accept-Encoding":"identity",
    "Connection":"close",
    "Content-Length":"10",
    "Content-Type":"application/x-www-form-urlencoded",
    "Host":"httpbin.org",
    "User-Agent":"Python-urllib/3.5"},
"json":null,
"origin":"114.245.157.49",
"url":"http://httpbin.org/post"}\n'

timeout参数：用于设置超时时间，单位为秒，如果请求超出了设置时间还未得到响应则抛出异常，支持HTTP,HTTPS,FTP请求

import urllib.request
response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)  #设置超时时间为0.1秒,将抛出异常
print(response.read())
#=====output====
urllib.error.URLError: <urlopen error timed out>

捕获异常

import urllib.request
import urllib.error
import socket
try:
    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
    print(response.read())
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout): #判断对象是否为类的实例
        print(e.reason) #返回错误信息
#output
timed out

其他参数：context参数，她必须是ssl.SSLContext类型，用来指定SSL设置，此外,cafile和capath这两个参数分别指定CA证书和它的路径，会在https链接时用到。

urllib.request.Requset()

urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)

参数：
url：请求的URL，必须传递的参数，其他都是可选参数
data：上传的数据，必须传bytes字节流类型的数据，如果它是字典，可以先用urllib.parse模块里的urlencode()编码
headers：它是一个字典，传递的是请求头数据，可以通过它构造请求头，也可以通过调用请求实例的方法add_header()来添加
例如：修改User_Agent头的值来伪装浏览器，比如火狐浏览器可以这样设置：
{'User-Agent':'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)'}
origin_req_host：指请求方的host名称或者IP地址
unverifiable：表示这个请求是否是无法验证的，默认为False，如我们请求一张图片如果没有权限获取图片那它的值就是true
method：是一个字符串，用来指示请求使用的方法，如：GET,POST,PUT等

#!/usr/bin/env python
#coding:utf8
from urllib import request,parse

url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)',
    'Host':'httpbin.org'
}  #定义头信息

dict={'name':'germey'}
data = bytes(parse.urlencode(dict),encoding='utf-8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
#req.add_header('User-Agent','Mozilla/5.0 (compatible; MSIE 8.4; Windows NT') #也可以request的方法来添加
response = request.urlopen(req) 
print(response.read())

urllib.request的高级类

在urllib.request模块里的BaseHandler类，他是所有其他Handler的父类，他是一个处理器，比如用它来处理登录验证，处理cookies，代理设置，重定向等
它提供了直接使用和派生类使用的方法：
add_parent(director)：添加director作为父类
close()：关闭它的父类
parent()：打开使用不同的协议或处理错误
defautl_open(req)：捕获所有的URL及子类，在协议打开之前调用

Handler的子类

HTTPDefaultErrorHandler：用来处理http响应错误，错误会抛出HTTPError类的异常
HTTPRedirectHandler：用于处理重定向
HTTPCookieProcessor：用于处理cookies
ProxyHandler：用于设置代理，默认代理为空
HTTPPasswordMgr：永远管理密码，它维护用户名和密码表
HTTPBasicAuthHandler：用户管理认证，如果一个链接打开时需要认证，可以使用它来实现验证功能
OpenerDirector类是用来处理URL的高级类，它分三个阶段来打开URL：

在每个阶段中调用这些方法的顺序是通过对处理程序实例进行排序来确定的；每个使用此类方法的程序都会调用protocol_request()方法来预处理请求，然后调用protocol_open()来处理请求，最后调用protocol_response()方法来处理响应。

之前的urlopen()方法就是urllib提供的一个Opener，通过Handler处理器来构建Opener实现Cookies处理,代理设置，密码设置等

Opener的方法包括：

add_handler(handler)：添加处理程序到链接中
open(url,data=None[,timeout])：打开给定的URL与urlopen()方法相同
error(proto,*args)：处理给定协议的错误

#!/usr/bin/env python
#coding:utf8
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

username='username'
passowrd='password'
url='http://localhost'
p=HTTPPasswordMgrWithDefaultRealm() #构造密码管理实例
p.add_password(None,url,username,passowrd) #添加用户名和密码到实例中
auth_handler=HTTPBasicAuthHandler(p) #传递密码管理实例构建一个验证实例
opener=build_opener(auth_handler)  #构建一个Opener
try:
    result=opener.open(url)  #打开链接，完成验证，返回的结果是验证后的页面内容
    html=result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

设置代理

#!/usr/bin/env python
#coding:utf8
from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener

proxy_handler=ProxyHandler({
    'http':'http://127.0.0.1:8888',
    'https':'http://127.0.0.1:9999'
})
opener=build_opener(proxy_handler) #构造一个Opener
try:
    response=opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

Cookies：

#!/usr/bin/env python
#coding:utf8
import http.cookiejar,urllib.request
cookie=http.cookiejar.CookieJar() #实例化cookiejar对象
handler=urllib.request.HTTPCookieProcessor(cookie) #构建一个handler
opener=urllib.request.build_opener(handler) #构建Opener
response=opener.open('http://www.baidu.com') #请求
print(cookie)
for item in cookie:
    print(item.name+"="+item.value)

Mozilla型浏览器的cookies格式，保存到文件

#!/usr/bin/env python
#coding:utf8
import http.cookiejar,urllib.request
fielname='cookies.txt'
cookie=http.cookiejar.MozillaCookieJar(filename=fielname) #创建保存cookie的实例，保存浏览器类型的Mozilla的cookie格式
#cookie=http.cookiejar.CookieJar() #实例化cookiejar对象
handler=urllib.request.HTTPCookieProcessor(cookie) #构建一个handler
opener=urllib.request.build_opener(handler) #构建Opener
response=opener.open('http://www.baidu.com') #请求
cookie.save(ignore_discard=True,ignore_expires=True)

也可以保存为libwww-perl(LWP)格式的Cookies文件

cookie=http.cookiejar.LWPCookieJar(filename=fielname)

从文件中读取cookies：

#!/usr/bin/env python
#coding:utf8
import http.cookiejar,urllib.request
#fielname='cookiesLWP.txt'
#cookie=http.cookiejar.MozillaCookieJar(filename=fielname) #创建保存cookie的实例，保存浏览器类型的Mozilla的cookie格式
#cookie=http.cookiejar.LWPCookieJar(filename=fielname) #LWP格式的cookies
#cookie=http.cookiejar.CookieJar() #实例化cookiejar对象
cookie=http.cookiejar.LWPCookieJar()
cookie.load('cookiesLWP.txt',ignore_discard=True,ignore_expires=True)

handler=urllib.request.HTTPCookieProcessor(cookie) #构建一个handler
opener=urllib.request.build_opener(handler) #构建Opener
response=opener.open('http://www.baidu.com') #请求
print(response.read().decode('utf-8'))

异常处理

urllib的error模块定义了由request模块产生的异常，如果出现问题，request模块便会抛出error模块中定义的异常。

1）URLError

URLError类来自urllib库的error模块，它继承自OSError类，是error异常模块的基类，由request模块产生的异常都可以通过捕获这个类来处理
它只有一个属性reason，即返回错误的原因

#!/usr/bin/env python
#coding:utf8
from urllib import request,error

try:
    response=request.urlopen('https://hehe,com/index')
except error.URLError as e:
    print(e.reason)  #如果网页不存在不会抛出异常，而是返回捕获的异常错误的原因(Not Found)

reason如超时则返回一个对象

#!/usr/bin/env python
#coding:utf8

import socket
import urllib.request
import urllib.error
try:
    response=urllib.request.urlopen('https://www.baidu.com',timeout=0.001)
except urllib.error.URLError as e:
    print(e.reason)
    if isinstance(e.reason,socket.timeout):
        print('time out')

2）HTTPError

它是URLError的子类，专门用来处理HTTP请求错误，比如认证请求失败，它有3个属性：
code：返回HTTP的状态码，如404页面不存在，500服务器错误等
reason：同父类，返回错误的原因
headers：返回请求头

#!/usr/bin/env python
#coding:utf8
from urllib import request,error

try:
    response=request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:  #先捕获子类异常
    print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:  #再捕获父类异常
    print(e.reason)
else:
    print('request successfully')

解析链接

urllib库提供了parse模块，它定义了处理URL的标准接口，如实现URL各部分的抽取，合并以及链接转换，它支持如下协议的URL处理：file,ftp,gopher,hdl,http,https,imap,mailto,mms,news,nntp,prospero,rsync,rtsp,rtspu,sftp,sip,sips,snews,svn,snv+ssh,telnet,wais

urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

通过urlparse的API可以看到，它还可以传递3个参数
urlstring：待解析的URL，字符串
scheme：它是默认的协议，如http或者https，URL如果不带http协议，可以通过scheme来指定，如果URL中制定了http协议则URL中生效
allow_fragments：是否忽略fragment即锚点，如果设置为False，fragment部分会被忽略，反之不忽略

1）urlparse()

该方法可以实现URL的识别和分段，分别是scheme(协议),netloc(域名),path(路径),params(参数),query(查询条件),fragment(锚点)

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result,sep='\n')  #返回的是一个元祖
print(result.scheme,result[0])  #可以通过属性或者索引来获取值
print(result.netloc,result[1])
print(result.path,result[2])
print(result.params,result[3])
print(result.query,result[4])
print(result.fragment,result[5])

#output
#返回结果是一个parseresult类型的对象，它包含6个部分，
#分别是scheme(协议),netloc(域名),path(路径),params(参数),query(查询条件),fragment(锚点)

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
http http
www.baidu.com www.baidu.com
/index.html /index.html
user user
id=5 id=5
comment comment

指定scheme协议，allow_fragments忽略锚点信息：

from urllib.parse import urlparse
result=urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https',allow_fragments=False)
print(result) 
#====output====
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5#comment', fragment='')

2）urlunparse()

与urlparse()相反，通过列表或者元祖的形式接受一个可迭代的对象，实现URL构造

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data)) #构造一个完整的URL

#output
http://www.baidu.com/index.html;user?a=6#comment

3)urlsplit()

与urlparse()方法类似，它会返回5个部分，把params合并到path中

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlsplit
result=urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

#output
SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

4)urlunsplit()

与urlunparse()类似，它也是将链接的各部分组合完整的链接的方法，传入的参数也是可迭代的对象，如列表元祖等，唯一的区别是长度必须是5个，它省略了params

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlsplit,urlunsplit
data=['http','www.baidu.com','index.html','a=5','comment']
result=urlunsplit(data)
print(result)
#output
http://www.baidu.com/index.html?a=5#comment
复制代码

5)urljoin()

通过将基本URL（base）与另一个URL(url)组合起来构建完整URL，它会使用基本URL组件，协议(schemm)、域名(netloc)、路径(path)、来提供给URL中缺失的部分进行补充，最后返回结果

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urljoin

print(urljoin('http://www.baidu.com','index.html'))
print(urljoin('http://www.baidu.com','http://cdblogs.com/index.html'))
print(urljoin('http://www.baidu.com/home.html','https://cnblog.com/index.html'))
print(urljoin('http://www.baidu.com?id=3','https://cnblog.com/index.html?id=6'))
print(urljoin('http://www.baidu.com','?id=2#comment'))
print(urljoin('www.baidu.com','https://cnblog.com/index.html?id=6'))

#output
http://www.baidu.com/index.html
http://cdblogs.com/index.html
https://cnblog.com/index.html
https://cnblog.com/index.html?id=6
http://www.baidu.com?id=2#comment
https://cnblog.com/index.html?id=6

base_url提供了三项内容scheme,netloc,path，如果这3项在新的链接中不存在就给予补充，如果新的链接存在就使用新的链接部分，而base_url中的params,query和fragment是不起作用的。通过urljoin()方法可以实现链接的解析、拼接和生成

6)urlencode()

urlencode()在构造GET请求参数时很有用，它可以将字典转化为GET请求参数

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlencode
params = {'username':'zs','password':'123'}
base_url='http://www.baidu.com'
url=base_url+'?'+urlencode(params) #将字典转化为get参数
print(url)

#output
http://www.baidu.com?password=123&username=zs

7)parse_qs()

parse_qs()与urlencode()正好相反，它是用来反序列化的，如将GET参数转换回字典格式

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlencode,parse_qs,urlsplit
params = {'username':'zs','password':'123'}
base_url='http://www.baidu.com'
url=base_url+'?'+urlencode(params) #将字典转化为get参数

query=urlsplit(url).query  #获去URL的query参数条件
print(parse_qs(query))  #根据获取的GET参数转换为字典格式

#output
{'username': ['zs'], 'password': ['123']}

8)parse_qsl()它将参数转换为元祖组成的列表

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlencode,urlsplit,parse_qsl

params = {'username':'zs','password':'123'}
base_url='http://www.baidu.com'
url=base_url+'?'+urlencode(params) #将字典转化为get参数

query=urlsplit(url).query  #获去URL的query参数条件
print(parse_qsl(query)) #将转换成列表形式的元祖对

#output
[('username', 'zs'), ('password', '123')]

9)quote()：该方法可以将内容转换为URL编码的格式，如参数中带有中文时，有时会导致乱码的问题，此时用这个方法将中文字符转化为URL编码

#!/usr/bin/env python
#coding:utf8
from urllib.parse import quote
key='中文'
url='https://www.baidu.com/s?key='+quote(key)
print(url)
#output
https://www.baidu.com/s?key=%E4%B8%AD%E6%96%87

10)unquote()：与quote()相反，他用来进行URL解码

#!/usr/bin/env python
#coding:utf8
from urllib.parse import quote,urlsplit,unquote
key='中文'
url='https://www.baidu.com/s?key='+quote(key)
print(url)
unq=urlsplit(url).query.split('=')[1] #获取参数值

print(unquote(unq))  #解码参数

分析Robots协议

利用urllib的robotparser模块，我们可以实现网站Robots协议的分析

1）Robots协议

Robots协议也称为爬虫协议、机器人协议，它的全名叫做网络爬虫排除标准(Robots Exclusion Protocol)，用来告诉爬虫和搜索引擎哪些网页可以抓取，哪些不可以抓取，它通常是一个robots.txt的文本文件，一般放在网站的根目录下。

当搜索爬虫访问一个站点时，它首先会检查这个站点根目录下是否存在robots.txt文件，如果存在，搜索爬虫会根据其中定义的爬去范围来爬取，如果没有找到，搜索爬虫会访问所有可直接访问的页面

我们来看下robots.txt的样例：

User-agent: *
Disallow: /
Allow: /public/

它实现了对所有搜索爬虫只允许爬取public目录的功能，将上述内容保存为robots.txt文件放在网站根目录下，和网站的入口文件（index.html）放在一起

User-agent描述了搜索爬虫的名称，将其设置为*则代表协议对任何爬虫有效，如设置为Baiduspider则代表规则对百度爬虫有效，如果有多条则对多个爬虫受到限制，但至少需要指定一条

一些常见的搜索爬虫名称：

BaiduSpider　　百度爬虫 www.baidu.com
Googlebot　　Google爬虫 www.google.com
360Spider　　360爬虫 www.so.com
YodaoBot　　有道爬虫 www.youdao.com
ia_archiver　　Alexa爬虫 www.alexa.cn
Scooter　　altavista爬虫 www.altavista.com

Disallow指定了不允许抓取的目录，如上例中设置的/则代表不允许抓取所有的页面

Allow一般和Disallow一起使用，用来排除单独的某些限制，如上例中设置为/public/则表示所有页面不允许抓取，但可以抓取public目录

禁止所有爬虫

User-agent: *
Disallow: /

允许所有爬虫访问任何目录,另外把文件留空也可以

User-agent: *
Disallow:

禁止所有爬虫访问某那些目录

User-agent: *
Disallow: /home/
Disallow: /tmp/

只允许某一个爬虫访问

User-agent: BaiduSpider
Disallow:
User-agent: *
Disallow: /

2）robotparser

rebotparser模块用来解析robots.txt，该模块提供了一个类RobotFileParser，它可以根据某网站的robots.txt文件来判断一个抓取爬虫时都有权限来抓取这个网页

urllib.robotparser.RobotFileParser(url='')

robotparser类常用的方法：
set_url()：用来设置robots.txt文件的连接，如果在创建RobotFileParser对象是传入了连接，就不需要在使用这个方法设置了
read()：读取reobts.txt文件并进行分析，它不会返回任何内容，但执行那个了读取和分析操作
parse()：用来解析robots.txt文件，传入的参数是robots.txt某些行的内容，并安装语法规则来分析内容
can_fetch()：该方法传入两个参数，第一个是User-agent，第二个是要抓取的URL，返回的内容是该搜索引擎是否可以抓取这个url,结果为True或False
mtime()：返回上次抓取和分析robots.txt的时间
modified()：将当前时间设置为上次抓取和分析robots.txt的时间

#!/usr/bin/env python
#coding:utf8
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()  #创建对象
rp.set_url('https://www.cnblogs.com/robots.txt') #设置robots.txt连接，也可以在创建对象时指定
rp.read()  #读取和解析文件
print(rp.can_fetch('*','https://i.cnblogs.com/EditPosts.aspx?postid=9170312&update=1')) #坚持链接是否可以被抓取

参考链接：
https://www.jianshu.com/p/3aa45f0e0aad
http://2.python-requests.org/zh_CN/latest/user/quickstart.html
http://2.pythonrequests.org/zh_CN/latest/user/advanced.html#advanced
https://blog.csdn.net/qq_36119192/article/details/82943326

python3 爬虫学习python爬虫库-requests使用
python3 爬虫库requests介绍在python3的爬虫库中，requests是日常使用的非常多的第三方...
崔庆才Python 3开发网络爬虫，教程+书籍
本书《Python3网络爬虫开发实战》全面介绍了利用 Python3 开发网络爬虫的知识，书中首先详细介绍了各种类...
Python3 网络爬虫与开发实战
下载地址：python3 网络爬虫与开发实战
你对python爬虫略知一二？来看这篇文章我让你快速入门！
Python3 爬虫快速入门攻略一、什么是网络爬虫？ 1、定义：网络爬虫（Web Spider），又被称为网页蜘...
用Python爬取猫眼电影排行榜TOP100
参考资料《Python3网络爬虫开发实践》，作者崔庆才这篇博客参考了崔庆才的《Python3网络爬虫开发实践》有...
电子书单列表
网络爬虫教程 Scarapy官方教程翻译篇 Scarapy官方教程翻译篇 Python3开发网络爬虫
Windows 10 下python3.x安装scrapy
Windows 10 下python3.x安装scrapy Python3网络爬虫(五)：Python3安装Scr...
python学习笔记（二）——requests模块
python的requests模块是爬虫的基本模块，让我们看看怎么用！参考：Python3 网络爬虫开发实战介绍...
5.3黑客成长日记——爬虫篇(1)
写一个小说网站的爬虫—Test reference Python3 网络爬虫（二）[新笔趣阁](https://w...
Python实战爬虫：练手爬虫用urllib模块获取
练手爬虫用urllib模块获取修改后python3的代码