聚沙成塔--爬虫系列（三）（正则表达式）

作者: 爱做饭的老谢 | 来源:发表于2017-09-29 14:24 被阅读24次

聚沙成塔--爬虫系列（三）（正则表达式）
聚沙成塔--爬虫系列（五）（请做个「优雅」的人）
聚沙成塔--爬虫系列（九）（落地生根）
聚沙成塔--爬虫系列（十九）（BeautifulSoup的使用）
聚沙成塔--爬虫系列（十七）（初识http协议）
爬虫入门系列（一）：快速理解HTTP协议
聚沙成塔--爬虫系列（十四）（群架要怎么打）
聚沙成塔--爬虫系列（七）（妹子，快到碗里来）
聚沙成塔--爬虫系列（二十）（一份答卷，结束爬虫的基础系列）
python爬虫系列-3

版权声明:本文为作者原创文章，可以随意转载，但必须在明确位置表明出处！！！

为什么要学习正则表达式

处理文本和数据是件大事，也是我们日常工作中占比较多的一部分，文字处理、网页填表、来自数据库的信息流、股票报价信息、新闻列表等等。但是因为我们可能不知道这些需要计算机编程处理文本或数据的具体内容，所有把这些文本或数据以某种被计算机识别和处理的模式表达出来是非常有用的，而正则表达式（高级文本模式匹配）可以通过一些字符和特殊符号组合成能被计算机识别的模式在文本数据中去匹配我们定义的字符串集合。python的爬虫是离不开正则表达式的，如果正则表达式学不好，那么爬虫肯定是学不好的，爬虫学不好怎么去爬妹子的信息，怎么去给女朋友提高工作效率，怎么去抢票...

正则表达式使用的特殊符号和字符

记号	说明	举例
literal	匹配字符串的值	foo
re1	匹配正则表达re1或re2	foo\|bar
.	匹配任何字符串（除换行符外）	b.b
^	匹配字符串的开始	^Dear
$	匹配字符串结尾	/bin/*sh$
*	匹配前面出现的正则表达式零次或多次	[A-Za-z0-9]*
+	匹配前面出现的正则表达式一次或多次	[a-z]+.com
？	匹配前面出现的正则表达式零次或一次	goo?
{N}	匹配前面出现的正则表达式N次	[0-9]{3}
{M, N}	匹配重复出现M次到N次的正则表达式	[0-9]{5, 9}
[...]	匹配字符组里出现的任意一个字符	[aeiou]
[..x-y..]	匹配从字符x到y中的任意一个字符	[0-9],[A-Za-z]
[^...]	不匹配此字符集中出现的任何一个字符，包括某一范围的字符（如果在次字符集中出现）	[^aeiou],^[A-Za-z0-9]
(\|+?\|{})?	用于上面出现的任何“非贪婪”版本重复匹配次数符号（*,+,?,{}）	.*?[a-z]
(...)	匹配封闭括号中的正则表达式，并保存为子组	（[0-9]{3}）?, f（oo\|u）bar
特殊字符
\d	匹配任何数字，和[0-9]一样（\D：匹配任何非数字字符）	data\d+.txt
\w	匹配任何数字字母字符，和[A-Za-z0-9]一样(\W是\w的反义)	[A-Za-z_]\w+
\s	匹配任何空白符，和[\n\t\r\v\f]相同(\S是\s的反义)	of\sthe
\b	匹配单词边界(\B是\b的反义)	\bThe\b
\nn	匹配已保存的子组	price：\16
\c	逐一匹配特殊字符c(既，取消它的特殊含义，按字面匹配)	.,\,*
\A, \Z	匹配字符串的起始和结束	\ADear, \ZDear

python的正则表达式模块re模块

re模块主要的函数和方法如下：

compile(pattern, flags=0)
对正则表达式模式pattern进行编译，flags可选标志，具体的可选标志可以查看api文档，通常情况下使用re.S(除换行符外匹配所有字符)函数返回一个regex对象
match(pattern, string, flags=0)
在字符串string中匹配表达式模式pattern, 如果成功匹配则返回一个匹配对象，否则返回None
search(pattern, string, flags=0)
在字符串string中查找正则表达式模式pattern的第一次出现，flags可选标志，如果成功匹配则返回一个匹配对象，否则返回None
findall(pattern, string[,flags])
在字符串string中查找正则表达式pattern的所有（非重复）出现，返回一个匹配对象的列表
finditer(pattern, string[,flags])
和findall()相同，但返回的不是列表而是迭代器；对于每个匹配，该迭代器返回一个匹配对象
split(pattern, string, max=0)
根据正则表达式pattern 中的分隔符把字符string 分割为一个列表，返回成功匹配的列表，最多分割max 次(默认是分割所有匹配的地方)。
sub(pattern, repl, string, max=0)
把字符串string 中所有匹配正则表达式pattern 的地方替换成字符串repl,如果max 的值没有给出，则对所有匹配的地方进行替换
group(num=0)
返回全部匹配对象(或指定编号是num 的子组)
groups()
返回一个包含全部匹配的子组的元组(如果没有成功匹配，就返回一个空元组)

实战

记号literal使用

import re

content = '''Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or
the code—we would be grateful if you could report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you find any errata, please report them by visiting http://www.packtpub.
com/submit-errata,'''

# literal 匹配字符串的值
pattern = re.compile('find', re.S)

# 使用findall函数，该函数返回一个列表
items = re.findall(pattern, content)
print('findall() 返回一个列表: %s' % items)
print('*'*100)
# 使用finditer函数，该函数返回一个迭代器,迭代器是一个match对象，需要用match对象的group()函数取值
items = re.finditer(pattern, content)
print('finditer()返回一个迭代器:%s' % items)
for index, item in enumerate(items):
    print('item[%d] = %s, item-value = %s' % (index, item, item.group()))
print('*'*100)

# 使用match函数, 该函数返回一个match对象, 该函数只返回一个匹配的对象，切记！切记！
# 这个函数只检查正则表达式是不是在string的开始位置匹配,所以下面的表达式返回的是None
items = re.match(pattern, content)
print(items)
print('*'*100)

# 使用search, 该函数返回表达式模式pattern的第一次出现, 同时该函数返回一个match对象
item = re.search(pattern, content)
print('item = %s, item-value = %s' % (item, item.group()))
print('*'*100)

# sub替换字符串中pattern匹配的地方, 下面是将字符串中的find替换成hello
item = re.sub(pattern, 'hello', content)
print(item)
print('*'*100)

pattern = re.compile('find|we', re.S)
# 使用findall函数，该函数返回一个列表
items = re.findall(pattern, content)
print('findall() 返回一个列表: %s' % items)
print('*'*100)

执行结果

findall() 返回一个列表: ['find', 'find']
****************************************************************************************************
finditer()返回一个迭代器:<callable_iterator object at 0x000001464B11AE80>
item[0] = <_sre.SRE_Match object; span=(100, 104), match='find'>, item-value = find
item[1] = <_sre.SRE_Match object; span=(346, 350), match='find'>, item-value = find
****************************************************************************************************
None
****************************************************************************************************
item = <_sre.SRE_Match object; span=(100, 104), match='find'>, item-value = find
****************************************************************************************************
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you hello a mistake in one of our books—maybe a mistake in the text or
the code—we would be grateful if you could report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you hello any errata, please report them by visiting http://www.packtpub.
com/submit-errata,
****************************************************************************************************

正则表达是的综合运用

# |表示或的意思，x|y|z|...表示匹配x或y或z或者其它
pattern = re.compile(r'find|we', re.S)
# 使用findall函数，该函数返回一个列表
items = re.findall(pattern, content)
print('findall() 返回一个列表: %s' % items)
print('*'*100)

# .匹配任何字符除换行符外, 
pattern = re.compile(r'f..d', re.S) #匹配f,d之间包含两个任意字符的字符串结果
items = re.findall(pattern, content)
print(items)
print('*'*100)

# ^匹配字符串的开始
pattern = re.compile(r'^Although', re.S) # 匹配以Although开始的字符串
items= re.findall(pattern, content)
print(items)
print('*'*100)

# $匹配字符串结尾
pattern = re.compile(r'submit-errata,$', re.S)
items= re.findall(pattern, content)
print(items)
print('*'*100)

# 匹配以字符串Although，并且以字符串submit-errata,结尾的字符串
pattern = re.compile(r'^Although.*?submit-errata,$', re.S)
items= re.findall(pattern, content)
print(items)
print('*'*100)

# split() 分割字符串
pattern = re.compile(r' ', re.S)
items= re.split(pattern, content)
print(items)
print('*'*100)

# \b匹配单词边界 \bAlthough表示匹配以单词Although开头的字符串，Although\b表示以Although的单词
pattern = re.compile(r'\bAlthough\b', re.S) #精确匹配单词Although
items= re.findall(pattern, content)
print(items)
print('*'*100)

# 查找字符串中的url
pattern = re.compile(r'\w{3}\.\w+\.\w{3}', re.S) #精确匹配单词Although
items= re.findall(pattern, content)
print(items)
print('*'*100)

# 查找字符串中的url
pattern = re.compile(r'[a-zA-z]+://[^\s]*', re.S) #精确匹配单词Although
items= re.findall(pattern, content)
print(items)
print('*'*100)

执行结果

****************************************************************************************************
findall() 返回一个列表: ['we', 'find', 'we', 'find']
****************************************************************************************************
['find', 'find']
****************************************************************************************************
['Although']
****************************************************************************************************
['submit-errata,']
****************************************************************************************************
['Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or\nthe code—we would be grateful if you could report this to us. By doing so, you can\nsave other readers from frustration and help us improve subsequent versions of this\nbook. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata,']
****************************************************************************************************
['Although', 'we', 'have', 'taken', 'every', 'care', 'to', 'ensure', 'the', 'accuracy', 'of', 'our', 'content,', 'mistakes', 'do', 'happen.', 'If', 'you', 'find', 'a', 'mistake', 'in', 'one', 'of', 'our', 'books—maybe', 'a', 'mistake', 'in', 'the', 'text', 'or\nthe', 'code—we', 'would', 'be', 'grateful', 'if', 'you', 'could', 'report', 'this', 'to', 'us.', 'By', 'doing', 'so,', 'you', 'can\nsave', 'other', 'readers', 'from', 'frustration', 'and', 'help', 'us', 'improve', 'subsequent', 'versions', 'of', 'this\nbook.', 'If', 'you', 'find', 'any', 'errata,', 'please', 'report', 'them', 'by', 'visiting', 'http://www.packtpub.com/submit-errata,']
****************************************************************************************************
['Although']
****************************************************************************************************
['www.packtpub.com']
****************************************************************************************************
['http://www.packtpub.com/submit-errata,']
****************************************************************************************************

note:不管是学习爬虫还是运用到其它文本处理方面正则表达是都是我们不得不学习的东西，正则表达是需要多用，长时间不用很快就会忘记，还有最重要的一点如果你正在读这篇文章请记得一定要自己去实践，只有实践你才回发现问题。正则表达式没有通用的表达式，只有适合的表达式，一种匹配可已有不同的正则表达式，正所谓不管是白猫还是黑猫，只要能抓到老鼠都是好猫。

更多的文章可以关注我的blog:http://www.gavinxyj.com

欢迎关注我的公众号:爱做饭的老谢，老谢一直在努力...

聚沙成塔--爬虫系列（三）（正则表达式）
版权声明:本文为作者原创文章，可以随意转载，但必须在明确位置表明出处！！！为什么要学习正则表达式处理文本和数据...
聚沙成塔--爬虫系列（五）（请做个「优雅」的人）
版权声明:本文为作者原创文章，可以随意转载，但必须在明确位置表明出处！！！通过上一篇文章聚沙成塔--爬虫系列（四...
聚沙成塔--爬虫系列（九）（落地生根）
版权声明:本文为作者原创文章，可以随意转载，但必须在明确位置标明出处！！！上一章我们讲了类的概念，专业术语叫OO...
聚沙成塔--爬虫系列（十九）（BeautifulSoup的使用）
版权声明:本文为作者原创文章，可以随意转载，但必须在明确位置标明出处！！！ tips：本基础系列旨在以爬虫带大家入...
聚沙成塔--爬虫系列（十七）（初识http协议）
版权声明:本文为作者原创文章，可以随意转载，但必须在明确位置标明出处！！！ tips：本基础系列旨在以爬虫带大家入...
爬虫入门系列（一）：快速理解HTTP协议
4月份给自己挖一个爬虫系列的坑，主要涉及HTTP 协议、正则表达式、爬虫框架 Scrapy、消息队列、数据库等内容...
聚沙成塔--爬虫系列（十四）（群架要怎么打）
版权声明:本文为作者原创文章，可以随意转载，但必须在明确位置标明出处！！！ tips：本基础系列旨在以爬虫带大家入...
聚沙成塔--爬虫系列（七）（妹子，快到碗里来）
版权声明:本文为作者原创文章，可以随意转载，但必须在明确位置标明出处！！！前面章节的代码可以看出，我们并没有把「...
聚沙成塔--爬虫系列（二十）（一份答卷，结束爬虫的基础系列）
版权声明:本文为作者原创文章，可以随意转载，但必须在明确位置标明出处！！！ tips：本基础系列旨在以爬虫带大家入...
python爬虫系列-3
1.系列文章列表 python爬虫系列-1python爬虫系列-2 源码本篇是第三篇文章解决上篇文章的三个问题....