Python之urllib库学习

作者: 见字如晤一 | 来源:发表于2019-03-08 16:22 被阅读0次

爬虫常用库介绍
2019-01-09 python 库之 requests
Python之urllib库学习
03_基本库的使用
Python3 urllib库的使用
Python中Requests库的用法
Python基础库使用(一)
比较基础的urllib库来了解一下
Urllib
Urllib库介绍

Python请求网络时，已经提供了很多库可供使用，最基础的http库有urllib、httplib2、requests、treq等。
这里先学习使用urllib库的使用。
urllib发起请求的英文教程

urllib有四个模块：request、error、parse、robotparser
1、request：基本的http请求模块
2、error：异常处理模块，可以通过捕获这些异常，进行其他操作
3、parse：一个工具模块，提供许多URL处理方法，如拆分、解析、合并等
4、robotparser：主要用来失败网址的robots.txt文件，然后判断哪些网址可以爬，哪些网址不可以爬，用的比较少。

一、发送请求：
1、urlopen()

import urllib.request
response=urllib.request.urlopen("https://www.python.org")
print (response.getheaders())
print (response.read())
print (type(response))

打印结果就不展示出来了，这里要说的是：
type(response)的输出结果是：

<class 'http.client.HTTPResponse'>

可发现，他是一个HTTPResponse类型的对象，主要包含：read() / readinto() / getheader(name) / getheaders() / fileno()等方法，以及msg/version/status/reason/debuglevel/closed等。

上面展示的是get请求，所以没有带参数，只是简单的打开url获取数据。
那需要添加数据怎么做呢？
urllib.request.urlopen(url, data=None, [timeout, ]***, cafile=None, capath=None, cadefault=False, context=None)
urlopen接口说明
其中data代表参数，是可选的。我们先看下实例：

1.1带参数data

# urlopen 带参data的请求
import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({'world':'hello'}),encoding='utf8')
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read())

要点：1、data参数必须是必须是bytes(字节流)类型的，请求的data如果是字典类型，先用urllib.parse.urlencode()编码。

1.2带参数timeout

# urlopen 带参数timeout,timeout单位是秒,timeout=0.1会报错
import urllib.request
response = urllib.request.urlopen("http://httpbin.org/get",timeout=1)
print(response.read())

2、Request
urlopen()发起请求，上面几个简单的参数还不足以满足实际开发中的完整请求，通常请求需要加入Header()信息，就可以利用更强大的Request类来构建：

# urlopen Request构建请求头Header
import urllib.request
request= urllib.request.Request('http://python.org')
response= urllib.request.urlopen(request)
print(response.read())

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
Request接口说明
第一个参数url是必须的，其余选择性传入，下面构建一个完整点的Header进行请求

# urlopen Request构建复杂的请求头Header
import urllib.request
import urllib.parse
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible;MSIE 5.5;Window NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(urllib.parse.urlencode(dict), encoding='utf8')
req = urllib.request.Request(url, data=data, headers=headers, method='POST')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

输出结果：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0 (compatible;MSIE 5.5;Window NT)"
  }, 
  "json": null, 
  "origin": "111.121.67.248, 111.121.67.248", 
  "url": "https://httpbin.org/post"
}

3、高级用法
上面说了构造请求，但是还有一些更高级的操作，如：Cookie处理、代理设置等
期待下一章讲解吧！
urllib之Cookies的获取及再载入