它是python自带的HTTP请求库
1)urllib.request:请求库
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
#参数意义:
#url:请求的链接
#data:post时用的请求
#[timeout]*超时时间
#后面的都是关于CA(证书)认证的选项
example:
a) get方法获取:
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
b) post方法获取:
1. 先建立data对象
2. 使用bytes与urllib.parse建立可以用在urlopen中的data
import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())
c)超时(timeout)功能使用:
import socket
import urllib.request
import urllib.error
try:
response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
if isinstance(e.reason,socket.timeout):
print('TIME OUT')
d) 将url与data(表单)分离
使用request.Request()封装请求对象
from urllib import request,parse
url = 'http://httpbin.org/post'
headers = {
'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',
'Host':'httpbin.org'
}
dict = {
'name':'Kim'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))
e)add_header()方法
from urllib import request,parse
url = 'http://httpbin.org/post'
dict = {
'name':'Kim'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,method='POST')
req.add_header( 'User-Agent','Mozilla/4.0(compatible;MSIE 5.5;Windows NT)')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))
Handler高级操作:通过urllib.request.build_opener(handler)来使用handler发送请求。
a) 代理设置:若遇到爬取网站限制同一IP访问时就需要代理来绕过限制:
1. 用urllib.request.ProxyHandler()封装代理对象
2. 将proxy_handler对象通过urllib.request.build_opener()函数来封装为代理对象
3. 使用 代理对象.open()方法打开目标链接。
import urllib.request
proxy_handler = urllib.request.ProxyHandler({
'http':'http://180.125.137.126:8000',
'https':'http://106.112.169.216:808'
})
opener = urllib.request.build_opener(proxy_handler)
try:
response = opener.open('http://httpbin.org/get')
except urllib.error.URLError as e:
if isinstance(e.reason,socket.timeout):
print('TIME OUT')
print(response.read().decode('utf-8'))
b) 对于Cookie的操作:看到需要登陆才能看到的页面
使用导入的http.cookiejar
1. http.cookiejar.MozillaCookieJar()获取火狐浏览器格式的cookie
2. urllib.request.HTTPCookieProcessor()制作handler
3. 使用urllib.request.build_opener()来创建opener
4. 使用opener.open()
import http.cookiejar,urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
print(item.name+"="+item.value)
-------------------------------------------------------------------------------------
import http.cookiejar,urllib.request
filename='cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
-------------------------------------------------------------------------------------
引入文件中的cookie并将其导入到请求中(cookie.load)
import http.cookiejar,urllib.request
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
2)urllib.error:获取错误,保持爬虫程序的健壮。
*先HTTPError后URLError
from urllib import request,error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
print(e.reason)
------------------------------------------------------------------------------
标准写法:
from urllib import request,error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:
print(e.reason)
else:
print('Request Successfully')
3)urllib.parse:url的解析模块(URL拆分)
from urllib.parse import urlparse
result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)
输出:ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')
urlunparse:
from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user','a=6', 'comment']
print(urlunparse(data))
输出:http://www.baidu.com/index.html;user?a=6#comment
urljoin:
from urllib.parse import urlencode
params = {
'name':'germey',
'age':22
}
base_url = 'http://www.baidu.com?'
url = base_url+urlencode(params)
print(url)
输出:http://www.baidu.com?name=germey&age=22
网友评论