Python 3中的Urllib库的使用 !

作者: 14e61d025165 | 来源:发表于2019-06-21 15:41 被阅读1次

Python中Requests库的用法
Python基础库使用(一)
02 urllib库的使用
spider最小程序
使用 Python 爬取网页数据
Python3中urllib使用
爬虫常用库介绍
Python网络——Urllib&Requests
自然语言处理（NLP）-1 从爬虫开始
urllib库python2和python3区别

在Python中有着这样一个常用的、基础的爬虫库。在Python2中为urllib.urllib，在Python3中官方为了便于管理，将和请求有关的函数封装进了urllib.request模块中。

在此处，使用Python3做记录。

urlopen函数

urlopen函数是request中的一个很常用的函数，它主要用于打开一个网页。其中有着很多的属性。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request Python学习交流群：1004391443
resp = request.open("https://www.baidu.com")
print(resp.read())
</pre>

urlretrieve函数

urlretrieve函数可以很方便的将网页保存至本地。其实用方法见下方代码：

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request
request.urlretrieve()
</pre>

urlencode函数

将字典类型的数据转化为url编码的数据

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import parse, request
params = {"wd": "爬虫之道"}
qs = parse.urlencode(params)
url = f"https://www.baidu.com/s?{qs}"
resp = request.urlopen(url)
</pre>

parse_qs函数

将url编码后的参数还原为字典形式，其中的值以列表形式进行存储。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import parse
ps = {"name": "爬虫之道", "vxcode": "spider_rold"}
results = parse.urlencode(ps)
new_results = parse.parse_qs(results)
print(new_results)
</pre>

urlparse和urlsplit函数

对URL各部分进行分割，其中在使用urlsplit函数时，没有params参数。（注：params参数指的是在以下链接中"?"前添加";"，位于这两者之间的内容，如：https://www.baidu.com/s;hello?wd=hello+world）

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import parse
params = {"wd": "爬虫之道"}
qs = parse.urlencode(params)
url = f"https://www.baidu.com/s?{qs}"
result = parse.urlparse(url)
print('scheme:', result.scheme)
print('netloc:', result.netloc)
print('path:', result.path)
print('params:', result.params)
print('query:', result.query)
print('fragment:', result.fragment)
</pre>

request.Request类

在爬虫中，如果想使用一些参数，如：请求头，请求数据等，需要使用到Request类。如果只是使用urlopen函数，没有办法进行添加请求头。在没有使用User-Agent的时候，如果服务器发现是爬虫，服务器可能会返回一条假的数据内容，或者是不返回内容。所以，将爬虫程序伪装成正常的浏览器访问是很有必要的一件事。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request, parse
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36"}
params = {"wd": "爬虫之道"}
qs = parse.urlencode(params)
url = f"https://www.baidu.com/s?{qs}"
req = request.Request(url, headers=headers)
resp = request.urlopen(req)
print(resp.read())
</pre>

注：可以去挑战一下网络爬虫界的“珠穆朗玛峰” ----- 拉勾网

ProxyHandler处理器

在爬取网站时，一般情况下网站都会做一些反爬虫机制，如封ip。如果只使用一个ip地址进行爬取，在网站封掉ip地址后，就无法获取数据了。所以，在有反爬虫机制的情况，反反爬虫机制便可以使用了。在urllib中通过使用ProxyHandler进行设置代理服务器。

在使用代理与未使用代理的区别：

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1561102808834 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request

未使用代理

resp = request.urlopen('http://httpbin.org/get')
print(resp.read())

使用代理

handler = request.ProxyHandler({"https": "58.253.152.231:9999"})
opener = request.build_opener(handler)
req = request.Request("http://httpbin.org/get")
resp = opener.open(req)
print(resp.read())
</pre>

常用的代理：

西刺代理：https://www.xicidaili.com
快代理: https://www.kuaidaili.com/free/

Cookie是什么

在网站汇总，http请求是无状态的。在第一次和服务器建立连接登录后，第二次请求服务器依旧不能知道当前请求的用户是谁。Cookie的出现就是为了解决该问题。第一次登录后，服务器会返回一些数据(cookie) 给浏览器，然后浏览器将数据保存在本地。当用户发送第二次请求时，浏览器会自动的将保存在本地的数据(cookie)一起发送给服务器。服务通过判断不同的cookie信息，进行确认用户。cookie的存储大小是有限的一般不会超过4KB，因此在设置cookie对的时候，只能存储少量数据。

cookie的格式：

Set-Cookie: NAME=VALUE；Expires=DATE；Path=PATH；Domain=DOMAIN_NAME；SECURE

参数含义：

NAME：Cookie的名字
VALUE：Cookie的值
Expires: Cookie的过期时间
Path：Cookie作用的路径
Domain： Cookie作用的域名
SECURE：是否只在HTTP协议下起作用

使用cookielib库和HTTPCookieProcessor模块模拟登陆

在Python中使用cookie，一般是通过http.cookiejar模块和urllib模块的HTTPCookieProcessor处理器一起使用的。

http.cookiejar：提供用于存储cookie的对象
HTTPCookieProcessor：处理cookie对象并构建handler对象

http.cookiejar模块

在该模块中主要有三个类，CookieJar、FileCookieJar、MozillaCookieJar、LWPCookieJar。其作用如下：

CookieJar：管理HTTP cookie值、存储HTTP请求生成的cookie，向传出的HTTP请求添加cookie对象。整个cookie都存储在内存中，对CookieJar实例进行垃圾回收后cookie失效。
FileCookieJar (filename, delayload=None, policy=None)：从CookieJar派生而来，用来创建FileCookieJar实例。检索cookie信息并将cookie存储到文件中。filename是存储cookie的文件名，delayload为True是支持延迟访问文件，即只有在需要时才读取文件或将数据存储在文件中。
MozillaCookieJar (filename, delayload=None, policy=None)：从FileCookieJar派生而来，创建与Mozilla浏览器cookies.txt兼容的FileCookieJar实例。
LWPCookieJar (filename, delayload=None, policy=None)：从FileCookieJar派生而来，创建与libwww·per标准的Set-Cookies3文件格式兼容的FileCookieJar实例。

保存Cookie至本地

保存cookie到本地，可以使用cookiejar的save方法。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request
from http.cookiejar import MozillaCookieJar
cookiejar = MozillaCookieJar('cookies.txt')
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)
handlers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
}
req = request.Request('http://httpbin.org/cookies', headers=handlers)
resp = opener.open(req)
print(resp.read())
cookiejar.save(ignore_discard=True, ignore_expires=True)
</pre>

加载本地cookie

从本地文件读取cookie信息，可以使用cookiejar中的load方法

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from http.cookiejar import MozillaCookieJar
cookiejar = MozillaCookieJar('cookies.txt')
cookiejar.load(ignore_discard=True)
for cookie in cookiejar:
print(cookie)
</pre>