【python爬虫保姆级教学】urllib的使用以及页面解析

作者: 查理不是猹 | 来源:发表于2021-12-12 09:45 被阅读0次

【python爬虫保姆级教学】urllib的使用以及页面解析
自然语言处理（NLP）-1 从爬虫开始
Python爬虫学习（十六）初窥Scrapy
Python爬虫基础之urllib与requests
零基础如何高效的学习好Python爬虫技术？
python网络爬虫基础模块安装
python使用chrome driver做简单爬虫
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例

1.urllib库

1.1 基本使用

使用urllib来获取百度首页的源码

import urllib.request

# 1、定义一个url  就是你要访问的地址
url = 'http://www.baidu.com'

# 2、模拟浏览器向服务器发送请求 response响应
response = urllib.request.urlopen(url)

# 3、获取响应中的页面的源码
content = response.read().decode('utf-8')

# 4、打印数据
print(content)

read方法，返回的是字节形式的二进制数据，我们要将二进制的数据转换为字符串，需解码： decode(‘编码的格式’)

1.2 1个类型和6个方法

import urllib.request

url = 'http://www.baidu.com'

# 模拟浏览器向服务器发送请求
response = urllib.request.urlopen(url)

# 一个类型：response是HTTPResponse的类型
print(type(response))

# 按照一个字节一个字节的去读
content = response.read()
print(content)

# 返回多少个字节
content = response.read(5)
print(content)

# 读取一行
content = response.readline()
print(content)

# 一行一行读取 直至结束
content = response.readlines()
print(content)

# 返回状态码  如果是200了 那么就证明我们的逻辑没有错
print(response.getcode())

# 返回的是url地址
print(response.geturl())

# 获取是一个状态信息
print(response.getheaders())

一个类型：HTTPResponse

六个方法： read、readline、readlines、getcode、geturl、getheaders

1.3 下载

import urllib.request

# 下载网页
url_page = 'http://www.baidu.com'

# url代表的是下载的路径  filename文件的名字
urllib.request.urlretrieve(url_page,'baidu.html')

# 下载图片
url_img = 'https://img1.baidu.com/it/u=3004965690,4089234593&fm=26&fmt=auto&gp=0.jpg'
urllib.request.urlretrieve(url= url_img,filename='lisa.jpg')

# 下载视频
url_video = 'https://vd3.bdstatic.com/mda-mhkku4ndaka5etk3/1080p/cae_h264/1629557146541497769/mda-mhkku4ndaka5etk3.mp4?v_from_s=hkapp-haokan-tucheng&auth_key=1629687514-0-0-7ed57ed7d1168bb1f06d18a4ea214300&bcevod_channel=searchbox_feed&pd=1&pt=3&abtest='

urllib.request.urlretrieve(url_video,'hxekyyds.mp4')

在python中，可以写变量的名字，也可以直接写值

1.4 请求对象的定制

import urllib.request

url = 'https://www.baidu.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# 因为urlopen方法中不能存储字典 所以headers不能传递进去
# 请求对象的定制
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf8')
print(content)

1.5 get请求的quote方法

get请求参数，如果是中文，需要对中文进行编码，如下面这样，如果不编码会报错。

需求获取 https://www.baidu.com/s?wd=周杰伦的网页源码

编码后如下： https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6

import urllib.request
import urllib.parse

url = 'https://www.baidu.com/s?wd='

# 请求对象的定制为了解决反爬的第一种手段
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# 将周杰伦三个字变成unicode编码的格式，需要依赖于urllib.parse
name = urllib.parse.quote('周杰伦')

# 将转码后的字符串拼接到路径后面
url = url + name

# 请求对象的定制
request = urllib.request.Request(url=url,headers=headers)

# 模拟浏览器向服务器发送请求
response = urllib.request.urlopen(request)

# 获取响应的内容
content = response.read().decode('utf-8')

# 打印数据
print(content)

quote适用于将中文转码成Unicode编码

1.6 get请求的urlencode方法

urlencode应用场景：多个参数的时候。如下

https://www.baidu.com/s?wd=周杰伦&sex=男

# 获取https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6&sex=%E7%94%B7的网页源码

import urllib.request
import urllib.parse

base_url = 'https://www.baidu.com/s?'

data = {
    'wd':'周杰伦',
    'sex':'男',
    'location':'中国台湾省'
}

new_data = urllib.parse.urlencode(data)

# 请求资源路径
url = base_url + new_data

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# 请求对象的定制
request = urllib.request.Request(url=url,headers=headers)

# 模拟浏览器向服务器发送请求
response = urllib.request.urlopen(request)

# 获取网页源码的数据
content = response.read().decode('utf-8')

# 打印数据
print(content)

1.7 post请求

import urllib.request
import urllib.parse

url = 'https://fanyi.baidu.com/sug'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

data = {
    'kw':'spider'
}

# post请求的参数，必须要进行编码
data = urllib.parse.urlencode(data).encode('utf-8')
request = urllib.request.Request(url=url,data=data,headers=headers)

# 模拟浏览器向服务器发送请求
response = urllib.request.urlopen(request)

# 获取响应的数据
content = response.read().decode('utf-8')

# 字符串--》json对象
import json
obj = json.loads(content)
print(obj)

post请求的参数必须要进行编码：data = urllib.parse.urlencode(data)
编码之后必须调用encode方法： data = urllib.parse.urlencode(data).encode(‘utf-8’)
post的请求的参数，是不会拼接在url的后面的，而是需要放在请求对象定制的参数中：
request = urllib.request.Request(url=url,data=data,headers=headers)

1.8 异常

import urllib.request
import urllib.error

url = 'http://www.doudan1.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

try:
    request = urllib.request.Request(url = url, headers = headers)
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    print(content)
except urllib.error.HTTPError:
    print('系统正在升级。。。')
except urllib.error.URLError:
    print('我都说了 系统正在升级。。。')

1.9 handler

为什么要学习handler？

urllib.request.urlopen(url) 不能定制请求头
urllib.request.Request(url,headers,data) 可以定制请求头
Handler：定制更高级的请求头（随着业务逻辑的复杂请求对象的定制已经满足不了我们的需求，动态cookie和代理不能使用请求对象的定制）

# 需求 使用handler来访问百度  获取网页源码

import urllib.request

url = 'http://www.baidu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

request = urllib.request.Request(url = url,headers = headers)

# handler   build_opener  open

#（1）获取hanlder对象
handler = urllib.request.HTTPHandler()

#（2）获取opener对象
opener = urllib.request.build_opener(handler)

# (3) 调用open方法
response = opener.open(request)
content = response.read().decode('utf-8')
print(content)

1.10 代理

为什么需要代理？因为有的网站是禁止爬虫的，如果用真实的ip去爬虫，容易被封掉。

import urllib.request

url = 'http://www.baidu.com/s?wd=ip'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}

# 请求对象的定制
request = urllib.request.Request(url = url,headers= headers)

# 模拟浏览器访问服务器
# response = urllib.request.urlopen(request)

proxies = {
    'http':'118.24.219.151:16817'
}

# handler  build_opener  open
handler = urllib.request.ProxyHandler(proxies = proxies)
opener = urllib.request.build_opener(handler)
response = opener.open(request)

# 获取响应的信息
content = response.read().decode('utf-8')

# 保存
with open('daili.html','w',encoding='utf-8')as fp:
    fp.write(content)

代理可以使用：快代理。可以使用代理池来代替一个代理

2.解析技术

2.1 xpath

xpath安装及加载

1.安装lxml库

pip install lxml ‐i https://pypi.douban.com/simple

2.导入lxml.etree

from lxml import etree

3.etree.parse() 解析本地文件

html_tree = etree.parse(‘XX.html’)

4.etree.HTML() 服务器响应文件

html_tree = etree.HTML(response.read().decode(‘utf‐8’)

5.解析获取DOM元素

html_tree.xpath(xpath路径)

按照xpath的chrome插件，使用 ctrl + shift + x 打开插件

xpath基本语法

1.路径查询

//：查找所有子孙节点，不考虑层级关系
/ ：找直接子节点

2.谓词查询

//div[@id]
//div[@id=“maincontent”]

3.属性查询

//@class

4.模糊查询

//div[contains(@id, “he”)]
//div[starts‐with(@id, “he”)]

5.内容查询

//div/h1/text()

6.逻辑运算

//div[@id=“head” and @class=“s_down”]
//title | //price

示例：

from lxml import etree

# xpath解析本地文件
tree = etree.parse('test.html')

# 查找ul下面的li
li_list = tree.xpath('//body/ul/li')

# 查找所有有id的属性的li标签，text()获取标签中的内容
li_list = tree.xpath('//ul/li[@id]/text()')

# 找到id为l1的li标签  注意引号的问题
li_list = tree.xpath('//ul/li[@id="l1"]/text()')

# 查找到id为l1的li标签的class的属性值
li = tree.xpath('//ul/li[@id="l1"]/@class')

# 查询id中包含l的li标签
li_list = tree.xpath('//ul/li[contains(@id,"l")]/text()')

# 查询id的值以l开头的li标签
li_list = tree.xpath('//ul/li[starts-with(@id,"c")]/text()')

#查询id为l1和class为c1的
li_list = tree.xpath('//ul/li[@id="l1" and @class="c1"]/text()')

li_list = tree.xpath('//ul/li[@id="l1"]/text() | //ul/li[@id="l2"]/text()')

2.2 JsonPath

JsonPath只能解析本地文件。

jsonpath的安装及使用

pip安装：

pip install jsonpath

jsonpath的使用：

obj = json.load(open(‘json文件’, ‘r’, encoding=‘utf‐8’))
ret = jsonpath.jsonpath(obj, ‘jsonpath语法’)

示例：

{
  "store": {
    "book": [
      {
        "category": "修真",
        "author": "六道",
        "title": "坏蛋是怎样练成的",
        "price": 8.95
      },
      {
        "category": "修真",
        "author": "天蚕土豆",
        "title": "斗破苍穹",
        "price": 12.99
      },
      {
        "category": "修真",
        "author": "唐家三少",
        "title": "斗罗大陆",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      {
        "category": "修真",
        "author": "南派三叔",
        "title": "星辰变",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "author": "老马",
      "color": "黑色",
      "price": 19.95
    }
  }
}

解析上面的json数据，具体语法，参考如下博客：

https://blog.csdn.net/luxideyao/article/details/77802389

import json
import jsonpath

obj = json.load(open('jsonpath.json','r',encoding='utf-8'))

# 书店所有书的作者
author_list = jsonpath.jsonpath(obj,'$.store.book[*].author')

# 所有的作者
author_list = jsonpath.jsonpath(obj,'$..author')

# store下面的所有的元素
tag_list = jsonpath.jsonpath(obj,'$.store.*')

# store里面所有东西的price
price_list = jsonpath.jsonpath(obj,'$.store..price')

# 第三个书
book = jsonpath.jsonpath(obj,'$..book[2]')

# 最后一本书
book = jsonpath.jsonpath(obj,'$..book[(@.length-1)]')

#  前面的两本书
book_list = jsonpath.jsonpath(obj,'$..book[0,1]')
book_list = jsonpath.jsonpath(obj,'$..book[:2]')

# 条件过滤需要在（）的前面添加一个？
#   过滤出所有的包含isbn的书。
book_list = jsonpath.jsonpath(obj,'$..book[?(@.isbn)]')

# 哪本书超过了10块钱
book_list = jsonpath.jsonpath(obj,'$..book[?(@.price>10)]')

2.3 BeautifulSoup

基本介绍

BeautifulSoup简称：bs4
什么是BeatifulSoup？ BeautifulSoup，和lxml一样，是一个html的解析器，主要功能也是解析和提取数据
优缺点

缺点：效率没有lxml的效率高
优点：接口设计人性化，使用方便

安装以及创建

安装

pip install bs4 -i https://pypi.douban.com/simple
导入

from bs4 import BeautifulSoup
创建对象
- 服务器响应的文件生成对象
  
  soup = BeautifulSoup(response.read().decode(), ‘lxml’)
- 本地文件生成对象
  
  soup = BeautifulSoup(open(‘1.html’), ‘lxml’)

注意：默认打开文件的编码格式gbk所以需要指定打开编码格式

节点定位

1.根据标签名查找节点

soup.a # 只能找到第一个a
soup.a.name
soup.a.attrs

2.函数

find(返回一个对象)

find(‘a’)：只找到第一个a标签

find(‘a’, title=‘名字’)

find(‘a’, class_=‘名字’)
find_all(返回一个列表)

find_all(‘a’) ：查找到所有的a

find_all([‘a’, ‘span’]) 返回所有的a和span

find_all(‘a’, limit=2) 只找前两个a
select(根据选择器得到节点对象)【☆☆☆】
- element：p
- .class：.firstname
- #id：#firstname
- 属性选择器：
  [attribute]：li = soup.select(‘li[class]’)
  [attribute=value]：li = soup.select(‘li[class=“hengheng1”]’)
- 层级选择器:
  div p 后代选择器
  div>p 子代选择器：某标签的第一级子标签
  div,p div或p标签的所有的对象

节点信息

获取节点内容：适用于标签中嵌套标签的结构

obj.string

obj.get_text()【推荐】
节点的属性

tag.name：获取标签名

tag.attrs：将属性值作为一个字典返回
获取节点属性

obj.attrs.get(‘title’)【常用】

obj.get(‘title’)

obj[‘title’]

示例：

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <div>
        <ul>
            <li id="l1">张三</li>
            <li id="l2">李四</li>
            <li>王五</li>
            <a href="" id="" class="a1">尚硅谷</a>
            <span>嘿嘿嘿</span>
        </ul>
    </div>
    <a href="" title="a2">百度</a>
    <div id="d1">
        <span>
            哈哈哈
        </span>
    </div>
    <p id="p1" class="p1">呵呵呵</p>
</body>
</html>

使用BeautifulSoup解析上面的html

from bs4 import BeautifulSoup

# 默认打开的文件的编码格式是gbk，所以在打开文件的时候需要指定编码
soup = BeautifulSoup(open('bs4的基本使用.html',encoding='utf-8'),'lxml')

# 根据标签名查找节点，找到的是第一个符合条件的数据
print(soup.a)
# 获取标签的属性和属性值
print(soup.a.attrs)

# bs4的一些函数
# （1）find：返回的是第一个符合条件的数据
print(soup.find('a'))

# 根据title的值来找到对应的标签对象
print(soup.find('a',title="a2"))

# 根据class的值来找到对应的标签对象  注意的是class需要添加下划线
print(soup.find('a',class_="a1"))

# （2）find_all  返回的是一个列表，并且返回了所有的a标签
print(soup.find_all('a'))

# 如果想获取的是多个标签的数据 那么需要在find_all的参数中添加的是列表的数据
print(soup.find_all(['a','span']))

# limit的作用是查找前几个数据
print(soup.find_all('li',limit=2))

# （3）select（推荐）
# select方法返回的是一个列表，并且会返回多个数据
print(soup.select('a'))

# 可以通过.代表class  我们把这种操作叫做类选择器
print(soup.select('.a1'))
print(soup.select('#l1'))

# 属性选择器：通过属性来寻找对应的标签
# 查找到li标签中有id的标签
print(soup.select('li[id]'))

# 查找到li标签中id为l2的标签
print(soup.select('li[id="l2"]'))

# 层级选择器
#  后代选择器：找到的是div下面的li
print(soup.select('div li'))

# 子代选择器：某标签的第一级子标签
print(soup.select('div > ul > li'))

# 找到a标签和li标签的所有的对象
print(soup.select('a,li'))

# 获取节点内容
obj = soup.select('#d1')[0]

# 如果标签对象中，只有内容，那么string和get_text()都可以使用
# 如果标签对象中，除了内容还有标签，那么string就获取不到数据 而get_text()是可以获取数据
# 推荐使用get_text()
print(obj.string)
print(obj.get_text())

# 节点的属性
obj = soup.select('#p1')[0]
# name是标签的名字
print(obj.name)
# 将属性值左右一个字典返回
print(obj.attrs)

# 获取节点的属性
print(obj.attrs.get('class'))
print(obj.get('class'))
print(obj['class'])

【python爬虫保姆级教学】urllib的使用以及页面解析
[https://blog.csdn.net/zxd1435513775/article/details/1202...
自然语言处理（NLP）-1 从爬虫开始
1.爬python官网，解析页面html信息，python3使用urllib库 import urllib.req...
Python爬虫学习（十六）初窥Scrapy
Python爬虫学习（一）概述Python爬虫学习（二）urllib基础使用Python爬虫学习（三）urllib...
Python爬虫基础之urllib与requests
Python爬虫-Urllib方式 - 前言此次我将讲述Python爬虫urllib与requests访问方式的...
零基础如何高效的学习好Python爬虫技术？
如何高效学习Python爬虫技术？大部分Python爬虫都是按“发送请求-获得页面-解析页面-抽取并储存内容”流程...
python网络爬虫基础模块安装
python网络爬虫基础模块安装 python的网络爬虫一般需要requests模块，urllib，urllib2...
python使用chrome driver做简单爬虫
使用python的urllib来抓取网页很容易被当作爬虫来对待下面是一个使用urllib的例子：使用selen...
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（七）- 深度爬虫CrawlSpider
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（二）- urllib爬虫案例
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...

【python爬虫保姆级教学】urllib的使用以及页面解析

1.urllib库

1.1 基本使用

1.2 1个类型和6个方法

1.3 下载

1.4 请求对象的定制

1.5 get请求的quote方法

1.6 get请求的urlencode方法

1.7 post请求

1.8 异常

1.9 handler

1.10 代理

2.解析技术

2.1 xpath

xpath安装及加载

xpath基本语法

2.2 JsonPath

jsonpath的安装及使用

2.3 BeautifulSoup

基本介绍

安装以及创建

节点定位

节点信息

相关文章

【python爬虫保姆级教学】urllib的使用以及页面解析

自然语言处理（NLP）-1 从爬虫开始

Python爬虫学习（十六）初窥Scrapy

Python爬虫基础之urllib与requests

零基础如何高效的学习好Python爬虫技术？

python网络爬虫基础模块安装

python使用chrome driver做简单爬虫

Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序

Python网络爬虫（七）- 深度爬虫CrawlSpider

Python网络爬虫（二）- urllib爬虫案例

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python