网络爬虫：urllib模块应用1

作者: 牛耀 | 来源:发表于2018-12-23 14:01 被阅读0次

python网络爬虫基础模块安装
网络爬虫：urllib模块应用1
Java面试题：Python中爬虫框架或模块的区别
Python基础知识全网最全6(网络爬虫)
python爬虫入门之urllib库详解(二)
python内置爬虫请求模块-urllib
Python爬虫入门（urllib+Beautifulsoup）
网络爬虫：urllib模块应用9--urllib--parse包
网络爬虫：urllib模块应用4--urllib-post请求
(二)urllib和urllib3+爬虫一般开发流程？pytho

我们以百度为例发起请求

#使用urllib发起请求
from urllib import request
#目标url
url = 'http://www.baidu.com/'

# request.urlopen():使用urlopen方法模拟浏览器发起请求
response = request.urlopen(url,timeout=10,content=content)
"""
urllib模块发起请求的参数：
url, 请求的目标url地址
data=None,默认情况为None,表示发起的是一个get请求,不为None,则发起的是一个post请求
timeout=,设置请求的超时时间　
cafile=None, 设置证书
capath=None, 设置证书路径
cadefault=False, 是否要使用默认证书（默认为False）
context=None:是一个ssl值,None表示忽略ssl认证
"""
# 如果在请求的时候出现ssl认证错误，我们就需要以下设置忽略ssl认证
content = ssl._create_unverified_context()
在发起请求的时候括号中添加context值
response = request.urlopen(url,timeout=10,context=content)

#从response响应结果中获取参数
#状态码
code = response.status
#获取页面源码的二进制数据,我们需要把二进制数据解码成str类型数据方便读取或存储
b_html = response.read().decode('utf-8')
#获取响应的响应头部(Response Headers)
res_headers = response.getheaders()
#获取响应头中指定参数的值
cookie_data = response.getheader('Set-Cookie')
#reason返回一个响应结果的原因
reason = response.reason
# 将获取到的页面源码存储到本地
with open('b_baidu.page.html','w') as file:
    file.write(b_html)