py爬虫7：正则表达式re

作者: _百草_ | 来源:发表于2022-07-26 09:19 被阅读0次

py爬虫7：正则表达式re
【Python爬虫】太刺激了！本来只想爬个视频的，谁知自己沦陷进
爬虫之道-正则表达式1
利用正则表达式实现python强口令检测
python 网络爬虫之requests库和Re正则表达式
好用的python库目录
正则表达式
零零散散
Python爬取招聘网站数据，并可视化展示招聘需求、薪资、招聘人
【Python实战】全球疫情数据采集, 并做可视化展示

正则解析只是其中一种，常见的还有BeautifulSoup和lxml,支持HTML元素解析操作

1、正则表达式

正则表达式：regular expression
一种字符串匹配模式或规则，可以用来检索、替换那些符合特定规则的文本
step 0:确定页面类型（静态页面或动态页面）
step 0.5:找出页面的 url 规律
step 1:审查元素来获取网页的大体结构
step 2:使用解析模块来提取你想要的网页信息
step 3:实现数据抓取

为什么使用

测试字符串模式
替换文本
基于模式匹配从字符串中提取子字符串

1.1 元字符

. 匹配除换行符以外的任意字符；若要匹配.则使用\.

\w 匹配所以的普通字符（数字、字母、下划线）
\W 匹配非(字母或数字或下划线)

\s 匹配任意的空白符,如空格、换行\n、制表符\t
\S 匹配非空白符

\d 匹配数字
\D 匹配非数字

\n 匹配一个换行符
\t匹配一个制表符
\b 匹配一个单词的结尾(位于要匹配的字符串的开始，它在单词的开始处查找匹配项。若位于结尾，它在单词的结尾处查找匹配项。?)

^ 匹配字符串的开头
$ 匹配字符串的结尾

a|b 匹配字符a或字符b
() 正则表达式分组所用符号，匹配括号内的表达式，表示一个组
[……] 匹配字符组中的字符
[^……] 匹配除字符组中字符的所有字符

1.2 量词

* 重复零次或者更多次
+ 重复1次或者更多次
? 重复0次或1次
{n} 重复n次
{n,} 重复n次或更多次
{n,m}重复n到m次

1.3 字符组

[0123456789]在字符组中枚举所有字符，字符组里的任意一个字符和待匹配字符相同都视为可以匹配
[0-9] 同上
[a-z]匹配所有的小写字母
[A-Z] 匹配多有的大写字母
[0-9a-zA-Z] 匹配所欲数组，和大小写字母

1.4 贪婪模式&非贪婪模式

默认贪婪模式，即尽可能多的匹配字符
如{n,m} 贪婪模式下首先以m为目标；非贪婪模式匹配n次即可
贪婪模式转为非贪婪模式，后面添加?即可

贪婪模式VS非贪婪模式

1.5 转义字符

匹配特殊字符时，需要在前面添加转义字符
* + ? ^ $ [ ] ( ) { } | \

2、re

2.1 re.match

匹配字符串的开始位置


import re

"""
re.match(pattern, string, flags=0)
从字符串的起始位置开始匹配，若不是起始位置匹配成功的话，返回None;匹配成功则返回一个匹配的对象
- pattern :匹配的正则表达式
- string:要匹配的字符串
- flags : 标志位，用于控制正则表达式的匹配方式，如是否区分大小写，多行匹配等待
"""
# 练习：正则匹配不是以4和7结尾的手机号

phones = ["13166781234", "15876546657", "15912345678", "1321234567997"]
for phone in phones:
    res = re.match(r"1\d{9}[0-3,5-6,8-9]$", phone)  # $表示以什么结尾
    print(res)
    if res:
        print("正则匹配不是以4和7结尾的手机号：", res.group())
    else:
        print("%s 不是想要的手机号" % phone)
# None
# 13166781234 不是想要的手机号
# None
# 15876546657 不是想要的手机号
# <re.Match object; span=(0, 11), match='15912345678'>
# 正则匹配不是以4和7结尾的手机号： 15912345678
# None
# 1321234567997 不是想要的手机号

2.1.1 groups & group

"""
groups() 或group(num=0) 获取匹配的值 
- group()  匹配成功的整个字符串，可以一次输入多个组号，在这种情况下将返回一个包含哪些组对应值的元组
            group(n) 第n个匹配成功的子串
- groups() 返回一个包含所有小组字符串（匹配成功）的元组，相当于(group(1)，group(2)……)

span() 获取匹配对象中的span,即匹配成功的整个字串的下标起始值
    span(1):有分组时，第n个匹配成功的子串的索引
"""
s = '1234567890QWER'
match_obj = re.match(r'\d{2}', s)   # \d匹配数字，{n}表示数量
print(match_obj)  # <re.Match object; span=(0, 2), match='12'>
if match_obj:
    print("match_obj.span() = ", match_obj.span())  # match_obj.span() =  (0, 2)
    print("match_obj.groups() = ", match_obj.groups())  # match_obj.groups() =  ()
    print("match_obj.group() = ", match_obj.group())  # match_obj.group() =  12
    # print("match_obj.group(1) = ", match_obj.group(1))  # IndexError: no such group  不存在的分组
else:
    print("No Match!")

<re.Match object; span=(0, 4), match='This'>

2.2 re.search

扫描整个字符串，直到匹配到第一个

"""
re.search(pattern, string, flags)
Scan through string looking for a match to the pattern, returning  a Match object, or None if no match was found.
扫描整个字符串，并返回第一个成功的匹配的Match 对象；无满足要求的返回None
"""
s = "Cats are smarter than dogs"
search_obj = re.search(r'(.*?) are (.*?) (.*?) ', s)
# ()分组，| 或，多个匹配表达式其一匹配即可；.单个字符；*表示数量,0以及以上；?关闭贪婪模式
print("search_obj = ", search_obj)
# search_obj =  <re.Match object; span=(0, 22), match='Cats are smarter than '>
if search_obj:
    print("search_obj.groups() =", search_obj.groups())  # search_obj.groups() = ('Cats', 'smarter', 'than')
    print("search_obj.group() =", search_obj.group())  # search_obj.group() = Cats are smarter than
    print("search_obj.group(1) =", search_obj.group(1))  # search_obj.group(1) = Cats
    print("search_obj.span() =", search_obj.span())  # search_obj.span() = (0, 22)
else:
    print("No search!")

2.3 re.sub

"""
re.sub(pattern, repl, string, count=0, flags=0)
替换字符串中的匹配项，返回新的string
- pattern:正则表达式对象
- repl :替换的字符，也可以是函数
- string: 要被查找的原始字符串
- count:模式匹配后替换的最大次数，默认0表示替换所有的匹配
- flags:代表功能标志位，扩展正则表达式的匹配
"""
phone = "158-2765-1234  # 手机号码"

# 移除非数字内容
num = re.sub(r'\D', "", phone)  # \D 匹配非数字;\d匹配数字
print("替换后的phone:", num)
print(phone)  # 158-2765-1234  # 手机号码 【即本身不改变】


# repl 还可以是一个函数
#  If it is a callable, it's passed the Match object and must return a replacement string to be used.
 def double(matched):
     v = int(matched.group())
     print(v)
    return str(v*2)

# print(re.sub("\D+?", lambda x: x * 2, num))
# TypeError: unsupported operand type(s) for *: 're.Match' and 'int'
print(re.sub("\D+?", lambda x: x.group() * 2, num))  # 158--2567--9876

print(re.sub(r"(\d+)", double, phone))   # double是自定义函数
print("repl是函数：", num)

2.4 re.compile

regex = re.compile(pattern="\w+", flags=0)
# Compile a regular expression pattern, returning a Pattern object(正则表达式对象).
- pattern:正则表达式对象
- flags:代表功能标志位，扩展正则表达式的匹配
print(regex)  # re.compile('\\d+')

2.5 regex.findall

# regex.findall(string,pos,endpos)
# findall(self, string: AnyStr, pos: int = ..., endpos: int = ...) -> list[Any]
# string:目标字符串
# pos/endpos:截取目标字符串的开始匹配位置/结束匹配位置
print(regex.findall(s))  # ['h1', 'Python', 're模块用法详解', 'h1']
"""
regex = re.compile(pattern, flags)
regex.findall(string)
两者结合
等同于re.findall(pattern,string)
"""

2.6 re.findall

def findall(pattern, string, flags=0):
    Return a list of all non-overlapping matches in the string.
    # 返回匹配到的内容列表
    If one or more capturing groups are present in the pattern, return
    # 若是有子组，则只能获取到子组对应内容
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.
 - pattern:正则表达式
 - string:目标字符串
 - flags:代表功能标志位，扩展正则表达式的匹配

re.finditer
re.finditer(pattern, string, flags=0)
和 findall 类似，在字符串中找到正则表达式所匹配的所有子串，并把它们作为一个迭代器返回。

2.7 re.split

split(pattern, string, maxsplit=0, flags=0)
# 切割即删除目标字符串中匹配处，返回切割后的子字符串列表
# 实现效果与re.findall()相反
 - pattern:正则表达式
 - string:目标字符串
 - maxsplit:不是0时，则最多分割maxsplit分，将字符串的其余部分作为列表的最后一个元素返回。
 - flags:代表功能标志位，扩展正则表达式的匹配

print(re.findall("\w+", s))  # ['h1', 'Python', 're模块用法详解', 'h1']

res = re.split("\w+",s)  # ['<', '>', ' ', '</', '>']
print(res)
res = re.split("\w+",s, 3)  # ['<', '>', ' ', '</h1>']

2.8 flags 功能标志位

"""
flags 
正则表达式修饰符
re.A  原元符只能匹配ASCII字符（让 \w, \W, \b, \B, \d, \D, \s 和 \S 只匹配ASCII，而不是Unicode）
re.I 使得匹配对大小写不敏感（即IGNORECASE）
re.L 表示特殊字符集\w,\W,\b,\B,\s,\S依赖当前环境（即LOCALE）
re.M 多行模式
re.S 即为 . 包括换行符在内的任意字符（. 默认匹配模式下不包含换行符）
re.U 表示特殊字符集 \w,\W,\b,\B,\s,\S依赖于Unicode字符属性库
re.X 为了增加可读性，忽略空格和# 后面的注释

多个标志，使用 |
如 re.I | re.M 被设置成I和M标志
"""

练习

html = """
<div class="arc-info">
    <span class="position">
        <span class="iconfont iconfont-home2"></span> 
        <a href="/">首页</a> &gt; 
        <a href="/python_spider/">Python爬虫</a>
    </span>
    <span class="read-num">阅读：12,736</span>
</div>
       """
# 创建正则表达式对象
# pattern = re.compile("<span.*?>(.*?)</span>", re.S)  # re.S 使得.可以匹配换行符
pattern = re.compile("<span.*?>(.*?)</span>", re.S)
res = pattern.findall(html)
print(res)  # ['\n        <span class="iconfont iconfont-home2">', '阅读：12,736']
# flags=0, 输出 ['', '阅读：12,736']
# 贪婪模式,输出 ['\n        <span class="iconfont iconfont-home2"></span> \n        <a href="/">首页</a> &gt; \n        <a href="/python_spider/">Python爬虫</a>\n    </span>\n    <span class="read-num">阅读：12,736']

2.9 正则表达式分组

通过分组( )提取想要的信息

s = "菜鸟教程 www.runoob.com"

# 提取所有信息
regex = re.compile("\w+\s+\w+\.\w+\.\w+")
print(regex.findall(s))  # ['菜鸟教程 www.runoob.com'] ;其中\. 用于匹配.

# 匹配第一项信息
# regex2 = re.compile("\w+\s+")
# print(regex2.findall(s))  # ['菜鸟教程 ']
regex2 = re.compile("(\w+)\s+")
print(regex2.findall(s))  # ['菜鸟教程']; 仅输出了（）中匹配的信息

# 提取多个信息,以元组形式显示
regex3 = re.compile("(\w+)\s+(\w+\.\w+\.\w+)")
print(regex3.findall(s))  # [('菜鸟教程', 'www.runoob.com')]

练习：解析网页信息-获取版本信息

with open("test.html", "r", encoding="utf-8") as f:
    html = f.read()
# print(html)
# <span data-v-695815d8="" style="font-size: 16px;">1.0.0&nbsp;&nbsp;</span>
pattern = re.compile(r'<span .*? style="font-size: 16px;">(\d+\.\d+\.\d+?)&nbsp;&nbsp;</span>', re.S)
print(pattern.findall(html))  # ['1.0.1', '1.0.0']

3、参考

1、re--- 正则表达式操作
2、Python3 正则表达式
3、正则表达式基本语法
4、re模块
5、正则表达式 - 教程

py爬虫7：正则表达式re
正则解析只是其中一种，常见的还有BeautifulSoup和lxml,支持HTML元素解析操作 1、正则表达式正...
【Python爬虫】太刺激了！本来只想爬个视频的，谁知自己沦陷进
知识点爬虫基本流程 re正则表达式简单使用 requests json数据解析方法视频数据保存开发环境 Py...
爬虫之道-正则表达式1
应该明确的是，在python爬虫中，正则表达式应该是分为两部分： re模块中各个方法的运用正则表达式语法 re模...
利用正则表达式实现python强口令检测
Chapter 7 模式匹配和正则表达式用import re 导入正则表达式模块用re.compile()函数...
python 网络爬虫之requests库和Re正则表达式
这周学习了python网络爬虫，主要学了requests库，Beautiful Soup库和Re正则表达式...
好用的python库目录
爬虫 requests：网页抓取，可替代官方库urllib lxml：正则表达式库，可替代官方库re beauti...
正则表达式
笔记正则表达式：用来做字符串查找、匹配、切割用的一种工具。 python对正则表达式的支持：提供了re模块（py...
零零散散
正则表达式## py的正则表达式只能通过字符串来表示，配合re模块使用。由于有转义字符的存在，为了简便，一般都会附...
Python爬取招聘网站数据，并可视化展示招聘需求、薪资、招聘人
课程亮点爬虫的基本流程 re正则表达式模块的简单使用 requests模块的使用保存csv 环境介绍 pyth...
【Python实战】全球疫情数据采集, 并做可视化展示
前言 ? 大家早好、午好、晚好吖~ 知识点: 爬虫基本流程 requests 发送请求 re 正则表达式 json...

py爬虫7：正则表达式re

1、正则表达式

1.1 元字符

1.2 量词

1.3 字符组

1.4 贪婪模式&非贪婪模式

1.5 转义字符

2、re

2.1 re.match

2.1.1 groups & group

2.2 re.search

2.3 re.sub

2.4 re.compile

2.5 regex.findall

2.6 re.findall

2.7 re.split

2.8 flags 功能标志位

2.9 正则表达式分组

3、参考

相关文章

py爬虫7：正则表达式re

【Python爬虫】太刺激了！本来只想爬个视频的，谁知自己沦陷进

爬虫之道-正则表达式1

利用正则表达式实现python强口令检测

python 网络爬虫之requests库和Re正则表达式

好用的python库目录

正则表达式

零零散散

Python爬取招聘网站数据，并可视化展示招聘需求、薪资、招聘人

【Python实战】全球疫情数据采集, 并做可视化展示

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python待用