0.爬虫 urlib库讲解 urlopen()与Request(

作者: 那是个好男孩 | 来源:发表于2019-04-11 00:01 被阅读1次

**注意一下是import urllib.request 还是 form urllib import request**

0. urlopen()

语法：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

实例0：(这个函数一般就使用三个参数 url data timeout)

添加的data参数需要使用bytes()方法将参数转换为字节流（区别于str的一种类型是一种比特流 010010010）编码的格式的内容，即bytes类型。
response.read()是bytes类型的数据，需要decode（解码）一下。

import urllib.parse
import urllib.request
import urllib.error

url = 'http://httpbin.org/post'
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
try:
    response = urllib.request.urlopen(url, data=data,timeout=1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')
else:
    print(response.read().decode("utf-8"))
输出结果：
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.6"
  }, 
  "json": null, 
  "origin": "101.206.170.234, 101.206.170.234", 
  "url": "https://httpbin.org/post"
}

实例1：查看i状态码、响应头、响应头里server字段的信息

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
输出结果：
200
[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Content-Length', '48410'), ('Accept-Ranges', 'bytes'), ('Date', 'Tue, 09 Apr 2019 02:32:34 GMT'), ('Via', '1.1 varnish'), ('Age', '722'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2126-IAD, cache-hnd18751-HND'), ('X-Cache', 'MISS, HIT'), ('X-Cache-Hits', '0, 1223'), ('X-Timer', 'S1554777154.210361,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx

使用urllib库的urlopen()方法有很大的局限性，比如不能设置响应头的信息等。所以需要引入request()方法。

1. Request()

实例0：（这两种方法的实现效果是一样的）

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

######################################

import urllib.request

req = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

下面主要讲解下使用Request()方法来实现get请求和post请求,并设置参数。

实例1：(post请求)

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
亦可使用add_header()方法来添加报头，实现浏览器的模拟，添加data属性亦可如下书写：
补充：还可以使用bulid_opener()修改报头，不过多阐述，够用了就好。
from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {
    'name': 'Germey'
}
data = parse.urlencode(dict).encode('utf-8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

实例2：(get请求) 百度关键字的查询

from urllib import request,parse

url = 'http://www.baidu.com/s?wd='
key = '路飞'
key_code = parse.quote(key)
url_all = url + key_code
"""
#第二种写法
url = 'http://www.baidu.com/s'
key = '路飞'
wd = parse.urlencode({'wd':key})
url_all = url + '?' + wd
"""
req = request.Request(url_all)
response = request.urlopen(req)
print(response.read().decode('utf-8'))

在这里，对编码decode、reqest模块里的quote()方法、urlencode()方法等就有疑问了，，对此，做一些说明：

parse.quote：将str数据转换为对应的编码
parse.urlencode：将字典中的k:v转换为K:编码后的v
parse.unquote：将编码后的数据转化为编码前的数据
decode 字符串解码 decode("utf-8")跟read()搭配很配！
encode 字符串编码
encoding指定编码格式

>>> str0 = '我爱你'
>>> str1 = str0.encode('gb2312')    
>>> str1 
b'\xce\xd2\xb0\xae\xc4\xe3'
>>> str2 = str0.encode('gbk')
>>> str2
b'\xce\xd2\xb0\xae\xc4\xe3'
>>> str3 = str0.encode('utf-8')
>>> str3
b'\xe6\x88\x91\xe7\x88\xb1\xe4\xbd\xa0'
>>> str00 = str1.decode('gb2312')
>>> str00
'我爱你'
>>> str11 = str1.decode('utf-8') #报错，因为str1是gb2312编码的
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    str11 = str1.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte

在这里，可能又有疑问了？read()、readline()、readlines()的区别：

read():全部，字符串str
reasline():一行
readlines():全部，列表list

网友评论

大数据爬虫Python AI Sql

本文标题：0.爬虫 urlib库讲解 urlopen()与Request(

本文链接：https://www.haomeiwen.com/subject/uliaiqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

0.爬虫 urlib库讲解 urlopen()与Request(

0. urlopen()

实例0：(这个函数一般就使用三个参数 url data timeout)

实例1：查看i状态码、响应头、响应头里server字段的信息

1. Request()

实例0：（这两种方法的实现效果是一样的）

实例1：(post请求)

实例2：(get请求) 百度关键字的查询

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据爬虫Python AI Sql

0.爬虫 urlib库讲解 urlopen()与Request(

0. urlopen()

实例0：(这个函数 一般就使用三个参数 url data timeout)

实例1：查看i状态码、响应头、响应头里server字段的信息

1. Request()

实例0：（这两种方法的实现效果是一样的）

实例1：(post请求)

实例2：(get请求) 百度关键字的查询

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据 爬虫Python AI Sql

实例0：(这个函数一般就使用三个参数 url data timeout)

大数据爬虫Python AI Sql