Python爬虫基础:urllib库的基本shiypn

作者: 常伟波 | 来源:发表于2018-11-05 19:23 被阅读0次

Python爬虫基础:urllib库的基本shiypn
tenliu的爬虫-抓包分析
tenliu的爬虫-python的urllib库
tenliu的爬虫-python库urllib、urllib2、
tenliu的爬虫-urllib2学习
tenliu的爬虫-requests学习
爬虫学习(一)网络请求
Python爬虫学习（十六）初窥Scrapy
Python 3中的Urllib库的使用 !
爬虫手册（二）：urllib和requests

urllib库我们从下面的三个方面讲解:

request: 它是最基本的HTTP请求模块，可以用来模拟发送请求，就像在浏览器中输入网址，然后敲击回车键一样，使用的时候只需要给库方法传入相关的URL和相关的参数即可．
error: 异常处理模块，如果出现请求错误，我们可以使用这个模块来捕获异常，然后进行重试或者其他操作，保证程序不会意外终止．
parse: 这是一个工具模块，提供了许多url的处理方法，比如拆分,解析,合并等等.

# 导入urllib.request 库
import urllib.request

# 向指定的url发送请求，并返回服务器响应的类文件对象
response = urllib.request.urlopen("http://www.baidu.com")

print(type(response))

# 类文件对象支持文件对象的操作方法，如read()方法读取文件全部内容，返回字符串
html = response.read()

# 打印响应结果（betys类型）
print (html)
# 打印状态码
print (response.status)
# 获取响应头
print (response.getheaders())
# 获取响应头信息
print (response.getheader('Server'))
# 获取响应结果原因
print (response.reason)

urlopen中通常设置如下的常用参数:

url:设置目标url
data:如果设置该参数，则请求为post请求,否则是get请求
timeout：用于设置超时时间，单位为秒
context：必须是一个ssl.SSLContext类型,用来指定SSL设置,忽略未认证的CA证书．

Request

当我们需要进行更复杂的操作或者网站的有反爬虫机制的时候,这些参数是不够的(比如要携带请求头header),所有要 创建一个Request的实例,然后用作urlopen的参数

# 
import urllib.request

# url作为Request()方法的参数，构造并返回一个Request对象
ua_header = {"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",}
request = urllib.request.Request("http://www.baidu.com",headers=ua_header)

# Request对象作为urlopen()方法的参数，发送给服务器并接收响应
response = urllib.request.urlopen(request)

#将获取到的页面源码，转为字符串
html = response.read().decode('utf-8')

也可以添加/修改User-Agent

# 
import urllib.request

# url作为Request()方法的参数，构造并返回一个Request对象
user_agent = {"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",}
request = urllib.request.Request("http://www.baidu.com")
#也可以通过调用Request.add_header() 添加/修改一个特定的header
request.add_header("User-Agent", user_agent)
# get_header()的字符串参数，第一个字母大写，后面的全部小写
request.get_header("User-agent")
# Request对象作为urlopen()方法的参数，发送给服务器并接收响应
response = urllib.request.urlopen(request)

#将获取到的页面源码，转为字符串
html = response.read().decode('utf-8')

随机User-Agent三方库的使用

https://github.com/hellysmile/fake-useragent
下载:

pip install fake-useragent

代码实例:

from fake_useragent import UserAgent
ua = UserAgent()

ua.ie
# Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);
ua.msie
# Mozilla/5.0 (compatible; MSIE 10.0; Macintosh; Intel Mac OS X 10_7_3; Trident/6.0)'
ua['Internet Explorer']
# Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)
ua.opera
# Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11
ua.chrome
# Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'
ua.google
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13
ua['google chrome']
# Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11
ua.firefox
# Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1
ua.ff
# Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1
ua.safari
# Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25

# and the best one, random via real world browser usage statistic
ua.random

网友评论

本文标题：Python爬虫基础:urllib库的基本shiypn

本文链接：https://www.haomeiwen.com/subject/jccexqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python爬虫基础:urllib库的基本shiypn

urllib库我们从下面的三个方面讲解:

urlopen中通常设置如下的常用参数:

Request

也可以添加/修改User-Agent

随机User-Agent三方库的使用

代码实例:

相关文章

Python爬虫基础:urllib库的基本shiypn

tenliu的爬虫-抓包分析

tenliu的爬虫-python的urllib库

tenliu的爬虫-python库urllib、urllib2、

tenliu的爬虫-urllib2学习

tenliu的爬虫-requests学习

爬虫学习(一)网络请求

Python爬虫学习（十六）初窥Scrapy

Python 3中的Urllib库的使用 !

爬虫手册（二）：urllib和requests

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读