七.Python标准库：Urllib库

作者: 橄榄的世界 | 来源:发表于2018-03-24 11:12 被阅读0次

02 urllib库的使用
七.Python标准库：Urllib库
python爬虫脚本下载视频，同时借助FFmpeg合并视频
2018-09-13爬虫——数据大盗
Python 标准库和第三方库
Request库——Python实现的简单易用的HTTP库
爬虫常用库介绍
Python3 urllib库的使用
二、urllib和urllib3
技能 | IT@信息采集

Urllib库是Python用于操作Url的标准模块，Python2.x时分为Urllib和Urllib2，Python3.x时合并到Urllib里面。这里把常见的变化列举一下，便于查找修改。
官方文档：https://docs.python.org/3.6/library/urllib.html

Python2.x	Python3.x
import urllib2	import urllib.request，urllib.error
import urllib	import urllib.request，urllib.error，urllib.parse
import urlparse	import urllib.parse
urllib2.urlopen	urllib.request.urlopen
urllib2.request	urllib.request.Request
urllib.quote	urllib.request.quote
urllib.urlencode	urllib.parse.urlencode
cookielib.CookieJar	http.CookieJar

简单读取网页信息：urllib需制定内容的解码方式，requests可自动解码。

import urllib.request  
f = urllib.request.urlopen('http://python.org/') 
html1 = f.read()   #urlopen返回的是bytes对象，此时调用read()方法得到的也是bytes对象。
html2 = f.read().decode('utf-8')    #要获取字符串内容，需要指定解码方式。因此，更常用html2的方式。

#还可以写成以下方式：
import urllib.request
with urllib.request.urlopen('http://python.org') as f:
    html = f.read().decode('utf-8')
    print(f.status)
    print(html)

#html等价于requests库的r.text:
import requests
r = requests.get('http://python.org') 
print(r.status_code)
print(r.text)        #调用r.text时，Requests对象会使用其推测的文本编码自动解码。
print(r.encoding)    #查询Requests对象使用的编码方式。
r.encoding = 'utf-8'  #可直接通过赋值语句来改变Requests对象使用的编码方式。

2.urllib对含中文的URL进行手动编码

import urllib.request
a = urllib.request.quote("中文")
b = urllib.request.unquote(a)
print(a,b)

结果为：%E4%B8%AD%E6%96%87 中文

3.使用Request对象添加headers进行请求

import urllib.request
hds = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
req = urllib.request.Request('http://python.org')
req.add_header('User-Agent','Mozilla/5.0')  ##注意参数是用“，”进行分隔。
#req.add_header('User-Agent',hds['User-Agent'])  #另一种写法
with urllib.request.urlopen(req) as f:    ##urlopen可放入url或Request对象
    html = f.read().decode('utf-8')

#requests方法
import requests
hds = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
r = requests.get('http://python.org'，headers=hds)

4.超时设置

import urllib.request
#加上timeout参数即可
f = urllib.request.urlopen(req,timeout=1)
f = urllib.request.urlopen('http://python.org',timeout=1)

#完整用法（正常响应1秒，若网站服务器性能不好时可适当调高timeout值）
import urllib.request
for i in range(10):   #若超时，重复请求10次
    try:
        f = urllib.request.urlopen('http://python.org',timeout=1)
        print(f.read().decode('utf-8')[:100])
        break
    except Exception as e:
        print("出现异常: "+str(e))
        # print(type(e))

#requests库类似
for i in range(10):   #若超时，重复请求10次
    try:
        r = requests.get('http://python.org',timeout=0.25)   #响应比urllib.request快
        print(r.text[:100])
        break
    except Exception as e:
        print("第{}次请求出现异常:".format(str(i+1))+str(e))
        print(type(e))

5.下载HTML文件到本地
同理：图片、MP3、视频等文件格式也是用‘wb’形式下载。

#方法一：
import urllib.request

html = urllib.request.urlopen("http://www.baidu.com").read()
with open("1.html","wb") as f:     #使用b模式写入，此时传入的html不需解码
    f.write(html)


#方法二：最方便
#urlretrieve(url, filename=None, reporthook=None, data=None)  
#reporthook(可选)是回调函数，可以显示下载进度。
#data(可选)指post到服务器的数据。

import urllib.request
urllib.request.urlretrieve("http://www.baidu.com",filename="1.html")
#urllib.request.urlretrieve("http://www.baidu.com","1.html") 


#方法三：
import requests

r = requests.get("http://www.baidu.com")
with open("1.html",'wb') as f:
    f.write(r.content)

# 其他格式：
urllib.request.urlretrieve("XXX.jpg",filename="1.jpg")      #XXX表示服务器地址
urllib.request.urlretrieve("XXX.mp3",filename="1.mp3")
urllib.request.urlretrieve("XXX.rmvb",filename="1.rmvb")

6.get请求实例
get请求的url地址格式：http://网址？字段名1=内容1&字段名2=内容2
http://www.baidu.com/s?wd="python"&rqlang=cn # wd代表关键字, rqlang代表区域

import urllib.request

base_url = "http://www.baidu.com/s?wd="
keyword = "Python爬虫"
url = base_url + urllib.request.quote(keyword)
html = urllib.request.urlopen(url).read()
with open("1.html","wb") as f:
    f.write(html)

#requests库
import requests

base_url = "http://www.baidu.com/s?wd="
keyword = "Python爬虫"
url = base_url + keyword     #requests模块自动解析含中文的url
r = requests.get(url)
#print(r.url)                #可查看解析后的url
with open("2.html","wb") as f:
    f.write(r.content)

7.使用代理：urllib.request.ProxyHandler

import urllib.request  
 
# 创建代理字典
proxy1={'sock5': 'localhost:1080'}
proxy2={'http': '183.51.191.203:9797'}
# 使用ProxyHandler方法生成处理器对象
proxy_handler = urllib.request.ProxyHandler(proxy1) 
# 创建代理IP的opener实例
opener = urllib.request.build_opener(proxy_handler)  
# 创建全局默认的open对象，使用urlopen()时会自动使用已经安装的opener对象
urllib.request.install_opener(opener) 
  
a = urllib.request.urlopen("http://www.baidu.com").read().decode("utf8")  
print(len(a))

8.开启Debuglog：urllib.request.HTTPHandler，urllib.request.HTTPSHandler

import urllib.request

http_handler = urllib.request.HTTPHandler(debuglevel=1)
https_handler = urllib.request.HTTPSHandler(debuglevel=1)
opener = urllib.request.build_opener(http_handler,https_handler)
urllib.request.install_opener(opener)
urllib.request.urlopen("https://www.baidu.com")

9.异常处理：URLError，子类HTTPError

触发URLError的原因有以下四种可能：
①连接不上服务器
②远程URL不存在
③无网络
④触发HTTPError

#写法一：
import urllib.request
import urllib.error

try:
    # urllib.request.urlopen("http://www.google.com")       #对应URLError
    urllib.request.urlopen("https://login.taobao.com/member")   #对应HTTPError
except urllib.error.HTTPError as e:
    print(e.code,e.reason)
except urllib.error.URLError as e:
    print(e.reason)

#写法二：
import urllib.request
import urllib.error

try:
    #urllib.request.urlopen("http://www.google.com")
    urllib.request.urlopen("https://login.taobao.com/member")
except urllib.error.URLError as e:
    if hasattr(e,"code"):        #hasattr是自带函数，详见下方。
        print(e.code)
    if hasattr(e,"reason"):
        print(e.reason)

'''
hasattr(obj, name, /)
    Return whether the object has an attribute with the given name.
    
    This is done by calling getattr(obj, name) and catching AttributeError.
'''

HTTP状态码以及含义

状态码 (e.code)	英文(e.reason)	含义
200	OK	一切正常
301	Moved Permanently	重定向到新的URL，永久性
302	Found	重定向到新的URL，非永久性
304	Not Modified	请求的资源未更新
400	Bad Request	非法请求
401	Unauthorized	请求未经授权
403	Forbidden	禁止访问
404	Not Found	没有找到对应页面
500	Internal Server Error	服务器内部错误
501	Not Implemented	服务器不支持实现请求所需要的功能

10.post请求

import urllib.request
import urllib.parse

url = "https://www.douban.com/accounts/login"
params = {'source':'index_nav',
          'form_email':'XXXX',     #账号
          'form_password':'XXXX'   #密码
          }
postdata = urllib.parse.urlencode(params).encode('utf-8')  #对数据进行编码
req = urllib.request.Request(url,postdata)
html = urllib.request.urlopen(req).read()
with open('1.html','wb') as f:
    f.write(html)

#requests库
import requests
url = "https://www.douban.com/accounts/login"
params = {'source':'index_nav',
          'form_email':'XXXX',     #账号
          'form_password':'XXXX'   #密码
          }
r = requests.post(url,params)
with open('1.html','wb') as f:
    f.write(r.content)

#注：
urlencode:对key-value的字典数据进行编码转换，返回类似“a=XXX&b=XXX”的结果。
quote：对单个字符串进行编码转换，返回编码后的一串字符，多用于中文字符的编码。

11.使用cookies

import urllib.request
import urllib.parse
import http.cookiejar
url = "https://www.douban.com/accounts/login"
params = {'source':'index_nav',
          'form_email':'XXXX',     #账号
          'form_password':'XXXX'   #密码
          }
postdata = urllib.parse.urlencode(params).encode('utf-8')  #对数据进行编码
req = request.Request(url, postdata, method="POST")  # 构建Request对象

#创建CookieJar对象
cj = http.cookiejar.CookieJar()
pro = urllib.request.HTTPCookieProcessor(cj)
opener = urllib.request.build_opener(pro)
# 创建全局默认的open对象，使用urlopen()时会自动使用已经安装的opener对象
urllib.request.install_opener(opener)

html1 = urllib.request.urlopen(req).read()
with open('1.html', 'wb') as f:
    f.write(html1)


#requests库
import requests
url = "https://www.douban.com/accounts/login"
headers = {
    'Cookie':'xxxxxxx'
}
r = requests.get(url,headers=headers)
print(r.text)

02 urllib库的使用
02 urllib库的使用一、urllib库 1、概念 urllib 是一个用来处理网络请求的python标准库...
七.Python标准库：Urllib库
Urllib库是Python用于操作Url的标准模块，Python2.x时分为Urllib和Urllib2，Pyt...
python爬虫脚本下载视频，同时借助FFmpeg合并视频
requests是简洁的Python http库，相较Python标准库urllib， requests更加人性...
2018-09-13爬虫——数据大盗
Urllib库什么是Urllib？它是python自带的标准库，主要用它来获取网页信息。怎么获取网页信息？？...
Python 标准库和第三方库
Python 标准库和第三方库 urllib2 文档: http://zhuoqiang.me/python-ur...
Request库——Python实现的简单易用的HTTP库
Requests —— Python实现的简单易用的HTTP库 Python的标准库中自带一个urllib模块，可...
爬虫常用库介绍
urllib Urllib是 python 内置的库，在 Python 这个内置的 Urllib 库中有这么 4 ...
Python3 urllib库的使用
什么是 Urllib 库？ urllib 库是Python内置的 HTTP 请求库。urllib 模块提供的上层...
二、urllib和urllib3
1、urllib库 1.1 4个模块 urllib 是一个用来处理网络请求的python标准库，它包含4个模块。 ...
技能 | IT@信息采集
Python自带标准库urllib 第三方requests库第三方库XPath 第三方库BeautifulSoup