Urllib 数据抓取

作者: 部落大圣 | 来源:发表于2018-07-28 23:35 被阅读24次

Urllib 数据抓取
2.模块简介
BD第2课：抓取天猫商城胸罩销售数据
爬虫
Python实战计划学习笔记（6）爬取图片
Python 爬虫的工具列表附Github代码下载链接
python爬虫常用第三方库
Python 初学者的福利爬虫的工具列表大全
python爬虫常用第三方库
从零开始学爬虫—urllib

Python3中是将Urllib2和Urllib3合并在一起使用，并命名为Urllib

urllib.request:用于打开和读取URL
urllib.error:包含提出的例外urllib.request。
urllib.parse:用于解析URL
urllib.robotparser:用于解析robots.txt

发送请求

urllib.request.urlopen的语法如下

urllib.request.urlopen(url, data = None, [timeout,  ]*, cafile = None, capth = None, cadefault = False, context =None)

功能说明：Urllib是用于访问URL（请求链接）的唯一方法。
例子如下

import urllib.request
# 打开url
response = urllib.request.urlopen('https://movie.douban.com/', None, 2)
# 读取返回的内容
html = response.read().decode('utf-8')
# 写人txt
with open('html.txt' , 'wt', encoding = 'utf-8') as f :
     f.write(html)

首先导入urrli.request模块,然后通过urlopen访问一个URL，请求方式是GET，所以参数data设置为None；最后的参数用于设置超时时间，设置为2秒，如果超过2秒，网站没有返回相应数据，就会提示请求失败的错误信息。
当得到服务器的响应后，通过reponse.read()获取响应内容。read()方法返回的是一个bytes类型的数据，需要通过decode()来转换成str类型。最后将数据写入文本文档中，encoding用于设置文本文档的编码格式，数据编码必须与文本文档编码一直，否则会出现乱码。

复杂的请求

urllib.request.Requrest的语法如下

urllib.request.Request(url , data = None, headers=headers, method=None)

示例代码如下

# -*- coding: utf-8 -*-
"""
Created on Sat Jul 28 22:38:07 2018

@author: 部落大圣
"""
import urllib.request

url = 'https://movie.douban.com/'
"""
自定义请求头
"""
headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) \
            Gecko/20100101 Firefox/61.0',
        'Referer':'https://movie.douban.com/',
        'Connection':'keep-alive'
        }
# 设置request的请求头
req = urllib.request.Request(url, headers=headers)
# 使用urlopen打开req
html = urllib.request.urlopen(req).read().decode('utf-8')
# 写入文件
with open('html1.txt', 'w', encoding='utf-8') as f:
    f.write(html)