爬虫笔记（2）：Urllib库与URLError异常处理

作者: WeirdoSu | 来源:发表于2017-12-11 17:55 被阅读0次

什么是Urllib库：

是Python提供的一个用于操作URL的模块，2与3不太一样：见#581；

快速使用Urllib爬取网页：

# 导入包
In [1]: import urllib.request
# 打开并爬取
In [2]: file = urllib.request.urlopen("http://www.baidu.com")
# 读取全部内容；read会把读取到的内容赋给一个字符串变量，readlines会赋给一个列表变量，推荐这种方式。
In [3]: data = file.read()
# 读取一行
In [4]: dataline = file.readline()
# 保存网页到本地
In [22]: fhandle = open("desktop/programming/python_work/baidu.html", "wb")

In [23]: fhandle.write(data)
Out[23]: 112074

In [24]: fhandle.close()

urllib其他常见用法：

# 返回与当前环境有关的信息：
In [25]: file.info()
Out[25]: <http.client.HTTPMessage at 0x106984400>
# 当前爬取网页的状态码：
In [26]: file.getcode()
Out[26]: 200
# URL地址：
In [27]: file.geturl()
Out[27]: 'http://www.baidu.com'
# 在URL中使用不符合标准的字符时进行编码：urllib.request.quote()
In [28]: urllib.request.quote("http://www.sina.com.cn")
Out[28]: 'http%3A//www.sina.com.cn'
# 解码：urllib.request.unquote()
In [29]: urllib.request.unquote('http%3A//www.sina.com.cn')
Out[29]: 'http://www.sina.com.cn'

浏览器的模拟——Headers属性

反爬虫设置需要；
两种模拟方法：

方法一：使用build_opener()修改报头

In [32]: headers = ("User-Agent","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/
    ...: 537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36")

In [33]: opener = urllib.request.build_opener()

In [34]: opener.addheaders = [headers]

In [35]: data = opener.open(url).read()

方法二：使用add_header()添加报头

In [39]: req = urllib.request.Request(url)

In [40]: req.add_header("User-Agent","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWeb
    ...: Kit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36")

In [41]: res = urllib.request.urlopen(req).read()

超时(timeout)设置：

长时间未响应。

In [1]: import urllib.request

In [2]: for i in range(1,100):
   ...:     try:
   ...:         req = urllib.request.urlopen("http://www.baidu.com", timeout=1)
   ...:         res = req.read()
   ...:         print(len(res))
   ...:     except Exception as e:
   ...:         print("出现异常 -->" + str(e))
   ...:

HTTP协议请求实战：

HTTP请求主要分为6种：

GET请求：会通过URL网址传递信息，可以直接在URL中写上要传递的信息，也可以由表单传递，表单中的信息会自动转为URL中的数据；
POST请求：可以向服务器提交数据，是一种比较主流也比较安全的数据传递方式，比如登录时；
PUT请求：请求服务器存储一个资源，通常要指定存储的位置；
DELETE请求：请求服务器删除一个资源；
HEAD请求：请求获取对应的HTTP报头信息；
OPTIONS请求：可以获得当前URL所支持的请求类型

GET请求实例

In [3]: keywd = "hello"
In [4]: url = "http://www.baidu.com/s?wd=" + keywd
In [5]: req = urllib.request.Request(url)
In [6]: res = urllib.request.urlopen(req).read()
In [7]: with open("desktop/programming/python_work/4.html", "wb") as f:
   ...:     f.write(res)
   ...:

如果关键字为中文，可以先编码quote()

实现思路：

构建对应的URL地址，该地址包含GET请求的字段名和字段内容等信息，并且满足请求格式；
以对应的URL为参数，构建Request对象；
通过urlopen()打开构建的对象；
按需求进行后续的处理；

POST请求实例：

实现思路：

设置好URL；
构建表单数据，并使用urllib.parse.urlencode对数据进行编码处理；
创建Request对象，参数包括URL地址和要传递的数据；
使用add_header()添加头信息，模拟浏览器进行爬取；
使用urllib.request.urlopen()打开对应对象，完成信息传递；
后续处理。

In [10]: url = "http://www.iqianyue.com/mypost/"

In [11]: postdata = urllib.parse.urlencode({"name":"ceo@iqianyue.com", "pass":"aA123456"}).enc
    ...: ode('utf-8')

In [12]: req = urllib.request.Request(url, postdata)

In [13]: req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWe
    ...: bKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36')

In [14]: res = urllib.request.urlopen(req).read()

In [15]: with open("desktop/programming/python_work/6.html", "wb") as f:
    ...:     f.write(res)
    ...:

代理服务器的设置：

有时用同一个IP爬取同一个网站久了会被屏蔽，使用代理服务器,代理服务器地址可以从网上找，尽量找验证时间短的IP;

实现思路：

先建立一个函数；
两个形参：代理服务器地址，网页地址；
使用urllib.request.ProxyHandler()设置对应代理服务器信息，格式：urllib.request.ProxyHandler({‘http’:代理服务器地址})；
使用urllib.request.build_opener()创建一个自定义的opener对象，第一个参数为代理信息，第二个参数为urllib.request.HTTPHandler类；
为了方便，可以使用urllib.request.install_opener()创建全局默认的opener对象，在使用urlopen()时也会使用我们安装的opener对象；
urlopen()打开网页爬取，编码赋值；

In [16]: def use_proxy(proxy_addr, url):
    ...:     proxy = urllib.request.ProxyHandler({'http':proxy_addr
    ...: })
    ...:     opener = urllib.request.build_opener(proxy, urllib.req
    ...: uest.HTTPHandler)
    ...:     urllib.request.install_opener(opener)
    ...:     data = urllib.request.urlopen(url).read().decode('utf-
    ...: 8')
    ...:     return data
    ...: 
In [21]: proxy_addr = "114.115.140.25:3128"

In [22]: res = use_proxy(proxy_addr, "http://www.baidu.com")

In [23]: print(len(res))
111712

DebugLog实战：

在程序运行时，边运行边打印调试日志，此时需要开启DebugLog。

实现思路：

分别使用urllib.request.HTTPHandler()和urllib.request.HTTPSHandler()将debuglevel设置为1；
使用urllib.request.build_opener()创建自定义的opener对象，并使用1中设置的值作为参数；
用urllib.request.install_opener()创建全局默认的opener对象，这样，在使用urlopen()时也会使用安装的opener对象；
进行后续操作；

In [24]: httphd = urllib.request.HTTPHandler(debuglevel=1)

In [25]: httpshd = urllib.request.HTTPSHandler(debuglevel=1)

In [26]: opener = urllib.request.build_opener(httphd, httpshd)

In [27]: urllib.request.install_opener(opener)

In [28]: res = urllib.request.urlopen("http://edu.51cto.com")
send: b'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: edu.51cto.com\r\nUser-Agent: Python-urllib/3.6\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date header: Content-Type header: Transfer-Encoding header: Connection header: Set-Cookie header: Server header: Vary header: Vary header: X-Powered-By header: Set-Cookie header: Set-Cookie header: Set-Cookie header: Load-Balancing header: Load-Balancing

异常处理神器——URLError实战：

urllib.error模块，try…except语句，主要讲URLError和HTTPError；

产生URLError的原因：

连接不上服务器；
远程URL不存在；
无网络；
触发了HTTPError；
HTTPError无法处理前三种，所以优化为先让HTTPError处理，无法处理再用URLError：

In [31]: try:
    ...:     urllib.request.urlopen("http://blog.baiduuss.net")
    ...: except urllib.error.HTTPError as e:
    ...:     print(e.code)
    ...:     print(e.reason)
    ...: except urllib.error.URLError as e:
    ...:     print(e.reason)
    ...:     
[Errno 8] nodename nor servname provided, or not known

常见状态码及含义：

200 OK：一切正常；
301 Moved Permanently：重定向到临时URL，永久性；
302 Found：重定向到临时URL，非永久性；
304 Not Modified：请求的资源未更新；
400 Bad Request：非法请求；
401 Unauthorized：请求未授权；
403 Forbidden：禁止访问；
404 Not Found：没有找到对应页面；
500 Internal Server Error：服务器内部出现错误；
501 Not Implemented：服务器不支持实现请求所需要的功能；

网友评论

本文标题：爬虫笔记（2）：Urllib库与URLError异常处理

本文链接：https://www.haomeiwen.com/subject/znqwixtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬虫笔记（2）：Urllib库与URLError异常处理

什么是Urllib库：

快速使用Urllib爬取网页：

urllib其他常见用法：

浏览器的模拟——Headers属性

方法一：使用build_opener()修改报头

方法二：使用add_header()添加报头

超时(timeout)设置：

HTTP协议请求实战：

HTTP请求主要分为6种：

GET请求实例

实现思路：

POST请求实例：

实现思路：

代理服务器的设置：

实现思路：

DebugLog实战：

实现思路：

异常处理神器——URLError实战：

产生URLError的原因：

常见状态码及含义：

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读