Python爬虫认识urllib/urllib2和reque

作者: 一叶扁舟丶 | 来源:发表于2020-01-18 00:33 被阅读0次

tenliu的爬虫-抓包分析
tenliu的爬虫-python的urllib库
tenliu的爬虫-python库urllib、urllib2、
tenliu的爬虫-urllib2学习
tenliu的爬虫-requests学习
Python爬虫认识urllib/urllib2和reque
Java面试题：Python中爬虫框架或模块的区别
urllib2
python爬虫经典案例，看完这一篇就够了
2.常用的爬虫模块及使用方法

urllib与urllib2

urllib与urllib2是Python内置的，要实现Http请求，以urllib2为主,urllib为辅。

构建一个请求与响应模型

import urllib2
strUrl = "http://www.baidu.com"
response = urllib2.urlopen(strUrl)
print response.read()

得到：

<div class="s_tab" id="s_tab">
    <b>网页</b><a href="http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=" wdfield="word"  onmousedown="return c({'fm':'tab','tab':'news'})">新闻</a><a href="http://tieba.baidu.com/f?kw=&fr=wwwt" wdfield="kw"  onmousedown="return c({'fm':'tab','tab':'tieba'})">贴吧</a><a href="http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt" wdfield="word"  onmousedown="return c({'fm':'tab','tab':'zhidao'})">知道</a><a href="http://music.baidu.com/search?fr=ps&ie=utf-8&key=" wdfield="key"  onmousedown="return c({'fm':'tab','tab':'music'})">音乐</a><a href="http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=" wdfield="word"  onmousedown="return c({'fm':'tab','tab':'pic'})">图片</a><a href="http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=" wdfield="word"  onmousedown="return c({'fm':'tab','tab':'video'})">视频</a><a href="http://map.baidu.com/m?word=&fr=ps01000" wdfield="word"  onmousedown="return c({'fm':'tab','tab':'map'})">地图</a><a href="http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8" wdfield="word"  onmousedown="return c({'fm':'tab','tab':'wenku'})">文库</a><a href="//www.baidu.com/more/"  onmousedown="return c({'fm':'tab','tab':'more'})">更多»</a>
</div>

这样就过去整个页面的内容了。

说明:

urlopen(strUrl,data,timeout)

第一个参数URL必传的，第二个参数data是访问URL时要传送的数据，第三个timeout是设置超时时间，后面两个参数不是必传的.

Get与Post传送数据

post与get传送数据是两个比较常用的数据传送方式，一般只需要掌握这两种方式就可以了。

Get方式传送数据

import urllib2
import urllib
values = {}
values['username'] = '136xxxx0839'
values['password'] = '123xxx'
data = urllib.urlencode(values)#这里注意转换格式
url = 'https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001'
getUrl = url+'?'+data
request = urllib2.Request(getUrl)
response = urllib2.urlopen(request)
# print response.read()
print getUrl

得到：

https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001?username=136xxxx0839&password=123xxx

post数据传送方式

values = {}
values['username'] = '136xxxx0839'
values['password'] = '123xxx'
data = urllib.urlencode(values)
url = 'https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001'
request = urllib2.Request(url,data)
response = urllib2urlopen(request)
print response.read()

两种请求方式差异点：

post与request方式的数据传输时注意urllib2.Request(url,data)这里面的数据传输

注意处理请求的headers

很多时候我们服务器会检验请求是否来自于浏览器，所以我们需要在请求的头部伪装成浏览器来请求服务器.一般做请求的时候，最好都要伪装成浏览器，防止出现拒绝访问等错误，这也是一种反爬虫的一种策略

user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'}

header = {'User-Agent':user_agent}

url = 'http://www.qq.com/'

request = urllib2.Request(url,headers=header)

response = urllib2.urlopen(request)

print response.read().decode('gbk')    # 这里注意一下需要对读取的网页内容进行转码，先要查看一下网页的chatset是什么格式.

在浏览器上打开www.qq.com然后按F12，查看User-Agent:
User-Agent : 有些服务器或 Proxy 会通过该值来判断是否是浏览器发出的请求
Content-Type : 在使用 REST 接口时，服务器会检查该值，用来确定 HTTP Body 中的内容该怎样解析。
application/xml ：在 XML RPC，如 RESTful/SOAP 调用时使用
application/json ：在 JSON RPC 调用时使用
application/x-www-form-urlencoded ：浏览器提交 Web 表单时使用
在使用服务器提供的 RESTful 或 SOAP 服务时， Content-Type 设置错误会导致服务器拒绝服务

requests

requests是Python最为常用的http请求库，也是极其简单的.使用的时候，首先需要对requests进行安装，直接使用Pycharm进行一键安装。

响应编码

import requests
url = 'http://www.baidu.com'
r = requests.get(url)
print type(r)
print r.status_code
print r.encoding
#print r.content
print r.cookies

得到：

<class 'requests.models.Response'>
200
ISO-8859-1
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
Get请求方式
values = {'user':'aaa','id':'123'}
url = 'http://www.baidu.com'
r = requests.get(url,values)
print r.url

得到：

http://www.baidu.com/?user=aaa&id=123
Post请求方式
values = {'user':'aaa','id':'123'}
url = 'http://www.baidu.com'
r = requests.post(url,values)
print r.url
#print r.text

得到：

http://www.baidu.com/

请求头headers处理

user_agent = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400 QQBrowser/9.7.12661.400'}
header = {'User-Agent':user_agent}
url = 'http://www.baidu.com/'
r = requests.get(url,headers=header)
print r.content

响应码code与响应头headers处理

url = 'http://www.baidu.com'
r = requests.get(url)
if r.status_code == requests.codes.ok:
    print r.status_code
    print r.headers
    print r.headers.get('content-type')#推荐用这种get方式获取头部字段
else:
    r.raise_for_status()

得到：

200
{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Server': 'bfe/1.0.8.18', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:57 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Wed, 17 Jan 2018 07:21:21 GMT', 'Content-Type': 'text/html'}
text/html

cookie处理

url = 'https://www.zhihu.com/'
r = requests.get(url)
print r.cookies
print r.cookies.keys()

得到：

<RequestsCookieJar[<Cookie aliyungf_tc=AQAAACYMglZy2QsAEnaG2yYR0vrtlxfz for www.zhihu.com/>]>
['aliyungf_tc']

重定向与历史消息

处理重定向只是需要设置一下allow_redirects字段即可，将allow_redirectsy设置为True则是允许重定向的，设置为False则禁止重定向的。

r = requests.get(url,allow_redirects = True)
print r.url
print r.status_code
print r.history

得到：

http://www.baidu.com/
200
[]

超时设置

超时选项是通过参数timeout来设置的

python url = 'http://www.baidu.com' r = requests.get(url,timeout = 2)

代理设置

proxis = {
    'http':'http://www.baidu.com',
    'http':'http://www.qq.com',
    'http':'http://www.sohu.com',
}
url = 'http://www.baidu.com'
r = requests.get(url,proxies = proxis)

网友评论

爱编程

本文标题：Python爬虫认识urllib/urllib2和reque

本文链接：https://www.haomeiwen.com/subject/aoohzctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python爬虫认识urllib/urllib2和reque

urllib与urllib2

构建一个请求与响应模型

Get与Post传送数据

Get方式传送数据

post数据传送方式

两种请求方式差异点：

注意处理请求的headers

requests

cookie处理

重定向与历史消息

超时设置

代理设置

相关文章

tenliu的爬虫-抓包分析

tenliu的爬虫-python的urllib库

tenliu的爬虫-python库urllib、urllib2、

tenliu的爬虫-urllib2学习

tenliu的爬虫-requests学习