1. Urllib -- urllib.request

作者: 江湖十年 | 来源:发表于2018-06-14 11:26 被阅读27次

1. Urllib -- urllib.request
爬虫：02.Urllib库
Urllib库相关操作
urllib
使用 Python 爬取网页数据
python内置爬虫请求模块-urllib
py爬虫3：urllib库
正则>>豆瓣电影排行榜
使用urllib.request发送请求
python爬虫笔记-weki数据传入数据库

使用 urllib 的 request 模块可以发送请求，返回 response
urllib.request 模块提供了基本的构造 http 请求的方法，同时可以处理 authentication(身份授权验证)、redirections(重定向)、cookies 等内容。

urlopen()

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

实例

import urllib.request


response = urllib.request.urlopen('https://www.baidu.com/')
print(type(response))  # 返回 HTTPResponse 对象
print(response.read().decode('utf-8'))  # 返回 bytes 对象, 所以要通过 decode() 解码成 str
print(type(response.read()))  # 返回 bytes 对象

<class 'http.client.HTTPResponse'>
<html>

<head>

    <script>

        location.replace(location.href.replace("https://","http://"));

    </script>

</head>

<body>

    <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>

</body>

</html>
<class 'bytes'>

这里用变量 response 接收返回的 HTTPResposne 类型对象，它有一些常用的属性和方法

常用属性包括：

headers/msg

version: http 版本

status: status code

reason: 原因

length: 返回的字节码 bytes 长度

常用方法包括：

read()

getheaders(): 返回所有 Response Headers 组成的列表

getheader(name): 返回 name 的响应头信息

geturl(): 返回请求的 url

fileno(): 返回整个文件描述符
更多属性和方法可以参照 http.client.HTTPResponse 源码

# 属性测试

In [65]: response.headers
Out[65]: <http.client.HTTPMessage at 0x1fd73e3e780>

In [66]: response.msg
Out[66]: 'OK'

In [67]: response.version
Out[67]: 11

In [68]: response.status
Out[68]: 200

In [69]: response.reason
Out[69]: 'OK'

# 方法测试

In [59]: response.getheaders()  # 返回所有 Response Headers 组成的列表
Out[59]:
[('Date', 'Thu, 03 May 2018 02:54:35 GMT'),
 ('Content-Type', 'text/html; charset=utf-8'),
 ('Transfer-Encoding', 'chunked'),
 ('Connection', 'Close'),
 ('Vary', 'Accept-Encoding'),
 ('Set-Cookie',
  'BAIDUID=EF104C6D8C25B57F4A3AA36BC378D3EF:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie',
  'BIDUPSID=EF104C6D8C25B57F4A3AA36BC378D3EF; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie',
  'PSTM=1525316075; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie', 'BDSVRTM=0; path=/'),
 ('Set-Cookie', 'BD_HOME=0; path=/'),
 ('Set-Cookie',
  'H_PS_PSSID=1438_21087_26306_20927; path=/; domain=.baidu.com'),
 ('P3P', 'CP=" OTI DSP COR IVA OUR IND COM "'),
 ('Cache-Control', 'private'),
 ('Cxy_all', 'baidu+be71f1e4f45482be566c40a0409b9d98'),
 ('Expires', 'Thu, 03 May 2018 02:53:55 GMT'),
 ('X-Powered-By', 'HPHP'),
 ('Server', 'BWS/1.1'),
 ('X-UA-Compatible', 'IE=Edge,chrome=1'),
 ('BDPAGETYPE', '1'),
 ('BDQID', '0xc2fe597000007ae2')]

In [60]: response.getheader('Server')  # 返回 Server 的响应头信息值
Out[60]: 'BWS/1.1'

In [61]: response.geturl()  # 返回请求的 url
Out[61]: 'http://www.baidu.com'

可以发现 urlopen 可以传递的参数有多个，只有 url 参数是必须参数，其他都是可选参数
data 参数

请求时加上 data 参数可以构造一个 POST 请求

import urllib.parse
import urllib.request


# urlencode() 将字典类型的键值对转化为字符串 'key=value' 的形式, 并且会对中文进行 url 编码
# data 参数要求是字节流编码格式，所以通过 bytes() 转换成字节码
# bytes() 方法需要两个参数，第一个参数是一个 str, 第二个参数指定编码格式
data = bytes(urllib.parse.urlencode({'keywords': 'Python中文'}), encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read().decode('utf-8'))

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "keywords": "Python\u4e2d\u6587"  # 通过 POST 请求，模拟表单提交传递过来的参数
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "15", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.6"
  }, 
  "json": null, 
  "origin": "103.192.227.247", 
  "url": "http://httpbin.org/post"
}

测试网址 http://httpbin.org/ 提供了 http 请求测试，http://httpbin.org/post 用来测试 POST 请求，返回 Request 信息，其中 form 就是请求时 data 参数传递过去的信息

timeout 参数

timeout 参数可以设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间还没有得到响应，就会抛出 socket.timeout 异常，如果不指定，就会使用全局默认时间。它支持 HTTP、HTTPS、FTP 请求。

import urllib.request


response = urllib.request.urlopen('https://www.baidu.com/', timeout=0.01)
print(response.read().decode('utf-8'))

Traceback (most recent call last):
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\urllib\request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\http\client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\http\client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\http\client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\http\client.py", line 1026, in _send_output
    self.send(msg)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\http\client.py", line 964, in send
    self.connect()
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\http\client.py", line 1400, in connect
    server_hostname=server_hostname)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\ssl.py", line 401, in wrap_socket
    _context=self, _session=session)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\ssl.py", line 808, in __init__
    self.do_handshake()
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\ssl.py", line 1061, in do_handshake
    self._sslobj.do_handshake()
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\ssl.py", line 683, in do_handshake
    self._sslobj.do_handshake()
socket.timeout: _ssl.c:733: The handshake operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/Spider/notes/Python文档/urllib_request.py", line 18, in <module>
    response = urllib.request.urlopen('https://www.baidu.com/', timeout=0.01)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\urllib\request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\urllib\request.py", line 526, in open
    response = self._open(req, data)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\urllib\request.py", line 544, in _open
    '_open', req)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\urllib\request.py", line 504, in _call_chain
    result = func(*args)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\urllib\request.py", line 1361, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "c:\users\wangjx\appdata\local\programs\python\python36\Lib\urllib\request.py", line 1320, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error _ssl.c:733: The handshake operation timed out>

其他参数

cafile(CA 证书) / capath(CA 证书路径): 自从3.6版本已被弃用
请使用 ssl.SSLContext.load_cert_chain()，或者 ssl.create_default_context() 选择系统的可信CA证书（这个在请求 HTTPS 链接时会有用）。

cadefault: 自从3.6版本已被弃用

context: context 参数必须是 ssl.SSLContext 类型，用来指定 SSL 设置
更多请参考官方文档：https://docs.python.org/3/library/urllib.request.html

Request()

urlopen() 只支持简单的请求，如果想在请求中加入 headers 等更多的信息，就需要利用 Request() 来构造更复杂的请求
Request 类 __init__() 部分源码如下

class Request:

    def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None):
        self.full_url = url
        self.headers = {}
        self.unredirected_hdrs = {}
        self._data = None
        self.data = data
        self._tunnel_host = None
        for key, value in headers.items():
            self.add_header(key, value)
        if origin_req_host is None:
            origin_req_host = request_host(self)
        self.origin_req_host = origin_req_host
        self.unverifiable = unverifiable
        if method:
            self.method = method
    ...

参数说明：

url: 请求的 url, 必须参数, 其他参数都是可选参数

data: 字节码 bytes 类型, 用来构造 POST 请求需要发送的数据

headers: Request Headers, 这个参数可以在构造 Request 的时候直接添加进去, 也可以调用 Request 实例的 add_header() 方法来添加

origin_req_host: 指请求方的 host 名称或者 IP 地址

unverifiable: 指的是这个请求是否是无法验证的，默认是False。意思就是说用户没有足够权限来选择接收这个请求的结果。例如我们请求一个 HTML 文档中的图片，但是我们没有自动抓取图像的权限，这时 unverifiable 的值就是 True

method: 用来指示请求方法, 如 GET、POST、PUT 等

实例

import urllib.parse
import urllib.request


url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
    'Host': 'httpbin.org'
}
dict_params = {
    'keywords': 'Python中文',
    'keywords1': 'urllib'
}
data = bytes(urllib.parse.urlencode(dict_params), encoding='utf-8')
# 构造 Request 请求对象, 通过构造 Request 添加参数 headers、data、method
request = urllib.request.Request(url, data=data, headers=headers, method='POST')

# 这次 urlopen() 传入参数不再是直接传入 url 的方式
# 而是传入之前构造好的 Request 对象
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "keywords": "Python\u4e2d\u6587", 
    "keywords1": "urllib"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "50", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"
  }, 
  "json": null, 
  "origin": "103.192.227.247", 
  "url": "http://httpbin.org/post"
}

高级用法

urllib.request 还提供了强大的工具 Handler

什么是 Handler：
简而言之我们可以把它理解为各种处理器，有专门处理登录验证的，有处理 Cookies 的，有处理代理设置的，利用它们我们几乎可以做到任何 HTTP 请求中所有的事情。

urllib.request 模块里的 Handler 类：
BaseHandler: 所有其他 Handler 的基类，提供了最基本的 Handler 的方法，例如 default_open()、protocol_request() 方法等。

HTTPDefaultErrorHandler 用于处理 HTTP 响应错误，错误都会抛出HTTPError 类型的异常。

HTTPRedirectHandler 用于处理重定向。

HTTPCookieProcessor 用于处理 Cookies。

ProxyHandler 用于设置代理，默认代理为空。

HTTPPasswordMgr 用于管理密码，它维护了用户名密码的表。

HTTPBasicAuthHandler 用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。
更多 Handler 类参考：https://docs.python.org/3/library/urllib.request.html#urllib.request.BaseHandler

明白了 Handler，还需要知道什么是 Opener：

OpenerDirector 类可以称之为 Opener，我们之前用过 urlopen() 这个方法，实际上它就是 Urllib 为我们提供的一个 Opener。

Opener 可以使用 open() 方法，返回的类型和 urlopen() 如出一辙。

之前我们使用的 Request()、urlopen() 相当于类库为我们封装好了极其常用的请求方法，利用它们两个我们就可以完成基本的请求。如果需要实现更高级的功能，更深入一层进行配置，就需要用到 Opener。

Opener 和 Handler 关系：
简而言之，就是利用 Handler 来构建 Opener。

认证
有时候，有些网站访问时会直接弹出一个认证弹框，需要验证用户名、密码登录后才可以访问网页
示例网址链接：http://httpbin.org/digest-auth/auth/user/passwd/MD5/never

image.png

认证大致分为两种：基本身份验证(Basic Authentication)、消息摘要式身份验证(Digest Authentication)
参考：http://blog.jobbole.com/41519/

此时如果想要访问此网页可以借助 HTTPBasicAuthHandler / HTTPDigestAuthHandler 来完成认证
利用 HTTPDigestAuthHandler 处理器完成消息摘要式身份验证(Digest Authentication)

"""
利用 HTTPDigestAuthHandler 处理器完成消息摘要式身份验证(Digest Authentication)
"""
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPDigestAuthHandler, build_opener, install_opener, urlopen, Request
from urllib.error import URLError


url = 'http://httpbin.org/digest-auth/auth/user/passwd/MD5/never'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
}
request = Request(url=url, headers=headers)

username = 'user'
password = 'passwd'

# 实例化 HTTPPasswordMgrWithDefaultRealm 对象
passman = HTTPPasswordMgrWithDefaultRealm()

# 将 realm、用户名、密码添加到 passman
# 其中 realm 如果不知道可以传递 None
passman.add_password(realm='me@kennethreitz.com', uri=url, user=username, passwd=password)

# 根据页面返回信息分析应该采用 Digest 认证
auth_handler = HTTPDigestAuthHandler(passman)

# 利用 auth_handler 构建一个 Opener, 用这个 Opener 发送请求可以完成认证
opener = build_opener(auth_handler)
install_opener(opener)

try:
    response = urlopen(request)
    json_data = response.read().decode('utf-8')
    print(json_data)
except URLError as e:
    print(e.reason)

{
  "authenticated": true, 
  "user": "user"
}

利用 HTTPBasicAuthHandler 处理器完成基本身份验证(Basic Authentication)

image.png

"""
利用 HTTPBasicAuthHandler 处理器完成基本身份验证(Basic Authentication)
"""
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener


basic_url = 'http://httpbin.org/basic-auth/user/passwd'
username = 'user'
password = 'passwd'

passman = HTTPPasswordMgrWithDefaultRealm()
passman.add_password(realm=None, uri=basic_url, user=username, passwd=password)
handler = HTTPBasicAuthHandler(password_mgr=passman)
opener = build_opener(handler)
response = opener.open(basic_url)
print(response.read().decode('utf-8'))

{
  "authenticated": true, 
  "user": "user"
}

相对于 urllib 来说， requests 库的认证要简单很多

import requests
from requests.auth import HTTPDigestAuth


url = 'http://httpbin.org/digest-auth/auth/user/passwd/MD5/never'

username = 'user'
password = 'passwd'

response = requests.get(url, auth=HTTPDigestAuth(username, password))
print(response.text)

{
  "authenticated": true, 
  "user": "user"
}

代理

from urllib.request import ProxyHandler, build_opener
from urllib.error import URLError


url = 'https://www.baidu.com'
# ProxyHandler 参数是一个字典, 键是协议类型(http/https), 值是代理链接(ip:port)
proxy_handler = ProxyHandler({
    'https': 'https://39.134.68.16:80',
    'http': 'http://117.127.0.198:8080'
})

opener = build_opener(proxy_handler)

try:
    response = opener.open(url)
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

<html>
<head>
    <script>
        location.replace(location.href.replace("https://","http://"));
    </script>
</head>
<body>
    <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

Cookies
获取Cookie，并保存到CookieJar()对象中

"""
获取Cookie，并保存到CookieJar()对象中
"""
import urllib.request
import http.cookiejar


# 实例化一个 CookieJar 对象, 用来存储 cookie
cookie = http.cookiejar.CookieJar()

# 利用 HTTPCookieProcessor 构造一个 cookie 处理器对象 handler, 参数为 CookieJar 对象
handler = urllib.request.HTTPCookieProcessor(cookie)

# 利用 build_opener() 来构建 opener
opener = urllib.request.build_opener(handler)

# get 请求访问网页后, 会自动将 cookie 保存到 cookie 变量中
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(f'{item.name}={item.value}')

BAIDUID=99E5872D133402B48CB472C2EFFB6645:FG=1
BIDUPSID=99E5872D133402B48CB472C2EFFB6645
H_PS_PSSID=1461_19033_21098_26350_20928
PSTM=1525339260
BDSVRTM=0
BD_HOME=0

获取Cookie，并保存到cookie.txt文件中

"""
获取Cookie，并保存到cookie.txt文件中
"""
import http.cookiejar
import urllib.request


# 定义 cookie 要保存的本地磁盘文件名
filename = 'cookie.txt'

# 实例化 MozillaCookieJar 对象来保存 cookie,
# 之后可以通过 save() 方法保存到本地文件, 参数为要保存到本地的文件名
cookie = http.cookiejar.MozillaCookieJar(filename)

# 构建 Handler 处理器对象
handler = urllib.request.HTTPCookieProcessor(cookie)

# 利用 build_opener() 来构建 opener
opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

# 保存 cookie 到本地文件
# ignore_discard: 即使 cookies 将被丢弃也将它保存下来
# ignore_expires: 即使过期的 cookies 也将被保存, 如果在该文件中 cookies 已经存在，则覆盖原文件写入
cookie.save(ignore_discard=True, ignore_expires=True)

"""
cookie.txt
"""
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com  TRUE    /   FALSE   3672825284  BAIDUID B537FA08ABD4B1D04757EB2051A7154F:FG=1
.baidu.com  TRUE    /   FALSE   3672825284  BIDUPSID    B537FA08ABD4B1D04757EB2051A7154F
.baidu.com  TRUE    /   FALSE       H_PS_PSSID  1442_21102_26350_26183_20930
.baidu.com  TRUE    /   FALSE   3672825284  PSTM    1525341636
www.baidu.com   FALSE   /   FALSE       BDSVRTM 0
www.baidu.com   FALSE   /   FALSE       BD_HOME 0

获取Cookie，并保存到cookie_lwp.txt文件中

"""
获取Cookie，并保存到cookie_lwp.txt文件中
LWP Cookies 2.0 是一种更适合人类阅读的 cookie 格式
"""
import urllib.request
import http.cookiejar


filename = 'cookie_lwp.txt'
cookie = http.cookiejar.LWPCookieJar(filename=filename)
handler = urllib.request.HTTPCookieProcessor(cookiejar=cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
cookie.save(ignore_discard=True, ignore_expires=True)

"""
cookie_lwp.txt
"""
#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="80DFD8B0D9743AF2FAD4004F1E3FDAFB:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-05-22 04:22:47Z"; version=0
Set-Cookie3: BIDUPSID=80DFD8B0D9743AF2FAD4004F1E3FDAFB; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-05-22 04:22:47Z"; version=0
Set-Cookie3: H_PS_PSSID=1433_21107_20929; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1525396119; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-05-22 04:22:47Z"; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0

从本地磁盘文件中读取 Cookie

"""
从本地磁盘文件 cookie_lwp.txt 中读取 cookie 并利用 cookie 进行网页请求
"""
import urllib.request
import http.cookiejar


# 如果是要读文件中的 cookie, 则 LWPCookieJar() 的参数 filename 不必传
cookie = http.cookiejar.LWPCookieJar()
# 从本地文件中读取 cookie, 前提是已经通过 LWPCookieJar() 将 cookie 保存到本地文件 cookie_lwp.txt 中
cookie.load(filename='cookie_lwp.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookiejar=cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
print(response.read().decode('utf-8'))

1. Urllib -- urllib.request
使用 urllib 的 request 模块可以发送请求，返回 response urllib.request 模...
爬虫：02.Urllib库
1. Urllib Python内置的HTTP请求库 urllib.request············# 请求...
Urllib库相关操作
1.什么是Urllib Python内置的HTTP请求库 urllib.request 请求模块 urllib.e...
urllib
urllib.request urllib.request模块定义函数和类用来打开URLsurllib.reque...
使用 Python 爬取网页数据
1. 使用 urllib.request 获取网页 urllib 是 Python 內建的 HTTP 库, 使用 ...
python内置爬虫请求模块-urllib
1.爬虫请求模块urllib.request urllib.request.urlopen('网址') 作用向一个...
py爬虫3：urllib库
1、urllib.request 1.1 urllib.request.urlopen 1.2 urllib.re...
正则>>豆瓣电影排行榜
import urllib.request import urllib.parse def main(): #...
使用urllib.request发送请求
使用urllib.request发送请求 urllib.request.urlopen()基本使用 urllib....
python爬虫笔记-weki数据传入数据库
#from urllib import request from urllib.request import ur...