python网络爬虫3：使用正则表达式匹配

作者: 0清婉0 | 来源:发表于2020-12-28 20:56 被阅读0次

Python简单爬虫 - 正则表达式
python网络爬虫3：使用正则表达式匹配
用Pyquery重写崔庆才的《Python3网络爬虫开发实战》的
python爬虫学习-day7-实战
Python 基础爬虫目录
python爬虫学习-day5-selenium
python爬虫学习-day6-ip池
python爬虫学习-day3-BeautifulSoup
python爬虫学习-day4-使用lxml+xpath提取内容
python爬虫学习-day2正则表达式

2.非贪婪匹配之(.*?)

\d 匹配1个数字字符

\w 匹配1个字母，数字或下划线字符

\s 匹配1个空白字符，如换行符、制表符、普通空格等

\S 匹配1个非空白字符

\n 匹配1个换行符，相当于按1次Enter键

\t 匹配1个制表符，相当于按1次Tab键或按8次空格键

# . 匹配1个任意字符，换行符除外 * 匹配0个或多个表达式

+ 匹配1个或多个表达式

？非贪婪限定符，常与.和*配合使用

() 匹配括号内的表达式，也表示一个组

例1：

import re

res = '文本A百度新闻文本B，新闻标题文本A新闻财经文本B，文本A搜狗新闻文本B新闻网址'

p_source = '文本A(.*?)文本B'

source = re.findall(p_source, res)

print(source) # ['百度新闻']

例2：

import re

res = '<div class="news-source"><div class="c-img c-img1 c-img-circle news-source-icon_1tdlx c-gap-right-xsmall">网易新闻 2020年12月27日 18:37</div></div>'

p_info = '<div class="news-source">(.*?)</div>'

info = re.findall(p_info, res)

print(info)

3.非贪婪匹配之.*?

import re

res = '<h3>文本C<变化的网址>文本D新闻标题</h3>'

p_title = '<h3>文本C.*?文本D(.*?)</h3>'

title = re.findall(p_title, res)

print(title) # ['新闻标题']

import re

res = '<h3 class="c-title"><a href=" 网址" data-clicck="{英文& 数字}"><em> 阿里巴巴</em> 代码竞赛现全球首位AI评委能为代码质量打分</a></h3>'

p_title = '<h3 class="c-title">.*?>(.*?)</a>'

title = re.findall(p_title, res)

print(title) # .*?> 填充我们不要的内容 (.*?) 要查找的内容

# ['<em> 阿里巴巴</em> 代码竞赛现全球首位AI评委能为代码质量打分']

res2 = '<h3 class="c-title"><a href="https://www.baidu.com/"></a>'

p_href = '<h3 class="c-title"><a href="(.*?)"'

href = re.findall(p_href, res2)

print(href) # ['https://www.baidu.com/']

4.自动考虑换行的修饰符re.S

(.*?)和.*?无法自动匹配换行，可以用re.S

re.findall(匹配规则, 原始文本, re.S)

import re

res = ''' 文本A

百度新闻文本B'''

p_source = ' 文本A(.*?)文本B'

source = re.findall(p_source, res, re.S)

print(source) # ['\n 百度新闻']

import re

res = '''<h3 class="c-title">

<a href="http://baijiahao.baidu.com/s?id=163111&wfr=spider&for=pc"

data-click="{

英文&数字

}"

target="_blank"

>

<em>阿里巴巴</em> 代码竞赛现全球首位

</a>

'''

p_href = '<h3 class="c-title">.*?<a href="(.*?)"'

p_title = '<h3 class="c-title">.*?>(.*?)</a>'

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

print(href) # ['http://baijiahao.baidu.com/s?id=163111&wfr=spider&for=pc']

print(title) # ['\n <em>阿里巴巴</em> 代码竞赛现全球首位\n ']

5.补充知识

（1）sub()函数：用于清洗正则表达式获取的内容

# re.sub(需要替换的内容, 替换值, 原字符串)

import re

title = ['<em>阿里巴巴</em> 代码竞赛全球首位AI评委能为代码质量打分']

title[0] = re.sub('<em>', '', title[0])

title[0] = re.sub('</em>', '', title[0])

print(title[0]) # 阿里巴巴代码竞赛全球首位AI评委能为代码质量打分

import re

title = ['<em>阿里巴巴</em> 代码竞赛全球首位AI评委能为代码质量打分']

title[0] = re.sub('<.*?>', '', title[0])

print(title[0]) # 阿里巴巴代码竞赛全球首位AI评委能为代码质量打分

# <.*?> 任何 <>形式的内容

# '' 替换后的内容

# 第一个title[0]是替换后的标题

# 最后一个title[0]是原来的标题

（2）中括号[]的用法：使中括号里的内容不再有特殊含义

import re

company = '* 华能信托'

company1 = re.sub('[*]', '', company)

print(company1) # 华能信托

网友评论

本文标题：python网络爬虫3：使用正则表达式匹配

本文链接：https://www.haomeiwen.com/subject/jnntoktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python网络爬虫3：使用正则表达式匹配

相关文章

Python简单爬虫 - 正则表达式