re模块

作者: 安哥生个信 | 来源:发表于2018-06-20 07:10 被阅读0次

Python正则表达法规则

符号	意义	例子
.	表示任意字符，如果说指定了 DOTALL 的标识，就表示包括新行在内的所有字符。
^	表示字符串开头。
$	表示字符串结尾。	‘ test ’ 可以匹配‘ test ’和‘ testtool ’，但‘ test$ ’只能匹配‘ test ’。
*, +, ?	'*'表示后面可跟 0 个或多个字符，'+'表示后面可跟 1 个或多个字符，'?'表示后面可跟 0 个或多个字符	‘ abc* ’可以匹配‘ abc ’ 或者‘ abcd ’或者‘ abcdefg ’等等。
*?, +?, ??	在上面的结果中只去第一个	<.> 会匹配'<H1>title</H1>'整个字符串（贪婪匹配），使用<. ?> 可以只找出 <H1>（非贪婪匹配）
{m}	对于前一个字符重复 m 次	a{6} 匹配 6 个'a'
{m,n}	对于前一个字符重复 m 到 n 次	a{2,4} 匹配 2-4 个 a，a{2,} 匹配 2 个以上 a，a{,4} 匹配 4 个以下 a
{m,n}?	对于前一个字符重复 m 到 n 次，并且取尽可能少的情况	在字符串'aaaaaa'中，a{2,4} 会匹配 4 个 a，但 a{2,4}? 只匹配 2 个 a
**	对特殊字符进行转义，或者是指定特殊序列
[]	表示一个字符集	[abc] 会匹配字符 a，b 或者 c，[a-z] 匹配所有小写字母，[a-zA-Z0-9] 匹配所有字母和数字，[^6] 表示除了 6 以外的任意字符
\|	或者，只匹配其中一个表达式	A\|B，如果 A 匹配了，则不再查找 B，反之亦然
( … )	匹配括号中的任意正则表达式
(?#...)	注释，忽略括号内的内容
(?= … )	表达式’…’之前的字符串	在字符串’ pythonretest ’中 (?=test) 会匹配’ pythonre ’
(?!...)	后面不跟表达式’…’的字符串	如果’ pythonre ’后面不是字符串’ test ’，那么 (?!test) 会匹配’ pythonre ’
(?<= … )	跟在表达式’…’后面的字符串符合括号之后的正则表达式	正则表达式’ (?<=abc)def ’会在’ abcdef ’中匹配’ def ’
(?<!...)	括号之后的正则表达式不跟在’…’的后面

特殊表达式序列	意义
\A	只在字符串开头进行匹配。
\b	匹配位于开头或者结尾的空字符串
\B	匹配不位于开头或者结尾的空字符串
\d	匹配任意十进制数，相当于 [0-9]
\D	匹配任意非数字字符，相当于 [^0-9]
\s	匹配任意空白字符，相当于 [ \t\n\r\f\v]
\S	匹配任意非空白字符，相当于 [^ \t\n\r\f\v]
\w	匹配任意数字和字母，相当于 [a-zA-Z0-9_]
\W	匹配任意非数字和字母的字符，相当于 [^a-zA-Z0-9_]
\Z	只在字符串结尾进行匹配

re模块常用函数

re.match

re.match(pattern, string, flags=0)

match函数从字符串开头进行匹配，匹配上的话，返回match对象;如果遇到无法匹配的字符，立即返回None;如果直到字符串末尾，仍未匹配完，也返回None。

参数：

pattern：一个字符串形式的正则表达式(具体见最下面的re.compile函数)
flags：可选，表示匹配模式(具体见最下面的re.compile函数)

In[26]: print(re.match(r"www","com.www.com"))
None
In[27]: print(re.match(r"www","www.com"))
<_sre.SRE_Match object; span=(0, 3), match='www'>
In[28]: print(re.match(r"www","www.com").group())
www
In[29]: print(re.match(r"www","com.www.com"))
None
In[30]: print(re.match(r"wwww","www.com"))
None

re.search()

re.search(pattern, string, flags=0)

search函数会在字符串内任意处查找模式匹配，只要匹配到一个后就返回match对象；如果没有匹配，返回None

In[32]: print(re.search(r"com","www.com").group())
com
In[33]: print(re.search(r"\dcom","www.4com+www.5com").group())
4com

group()

group()和group(0)返回被RE整体匹配的字符串

group(n,m)返回组号为n,m RE所匹配的字符串元组，即被括号括起来的group

groups()返回被RE匹配的所有小组（括号括起来的group）的字符串元组

In[46]: line = "This is a test"
In[28]: re.match( r'This (\w+\s)(\w+\s)(\w+)', line).group()
Out[28]: 
'This is a test'
In[29]: re.match( r'This (\w+\s)(\w+\s)(\w+)', line).group(0)
Out[29]: 
'This is a test'
In[26]: re.match( r'This (\w+\s)(\w+\s)(\w+)', line).group(2)
Out[26]: 
'a '
In[27]: re.match( r'This (\w+\s)(\w+\s)(\w+)', line).group(3)
Out[27]: 
'test'
In[25]: re.match( r'This (\w+\s)(\w+\s)(\w+)', line).group(1,3)
Out[25]: 
('is ', 'test')
In[30]: re.match( r'This (\w+\s)(\w+\s)(\w+)', line).groups()
Out[30]: 
('is ', 'a ', 'test')

matched.string同matched.group(0)是一样的

matched.start(n)获取第n个group的起始位置

In[31]: re.match( r'This (\w+\s)(\w+\s)(\w+)', line).start(1)
Out[31]: 
5
In[32]: re.match( r'This (\w+\s)(\w+\s)(\w+)', line).end(1)
Out[32]: 
8
In[33]: s1 = re.match( r'This (\w+\s)(\w+\s)(\w+)', line).start(1)
In[34]: e1 = re.match( r'This (\w+\s)(\w+\s)(\w+)', line).end(1)
In[35]: re.match( r'This (\w+\s)(\w+\s)(\w+)', line).string
Out[35]: 
'This is a test'
In[36]: str = re.match( r'This (\w+\s)(\w+\s)(\w+)', line).string
In[38]: str[s1:e1]
Out[38]: 
'is '

(?P=name)

通过 (?P=name)的方式，来引用，正则表达式中，前面已经命名tagName的group的

In[3]: reNamedGroupTestStr = u'标签：<a href="/tag/情侣电话粥/">情侣电话粥</a>'
In[7]: foundTagA = re.search(u'.+?<a href="/tag/(?P<tagName>.+?)/">(?P=tagName)</a>', reNamedGroupTestStr) #(?P=tagName)关联前面的(?P<tagName>.+?)
In[8]: foundTagA.group("tagName")
Out[8]: 
'情侣电话粥'
In[12]: foundTagA.groups()
Out[12]: 
('情侣电话粥',)  #通过(?P=tagName)命名，只有一个group
In[13]: re.search(u'.+?<a href="/tag/(.+?)/">(.+?)</a>', reNamedGroupTestStr).groups()
Out[13]: 
('情侣电话粥', '情侣电话粥') #两个group

re.findall

re.findall(pattern,string,flags=0)

findall函数遍历匹配，获取字符串中所有匹配的子字符串，返回一个列表

In[64]: print(re.findall(r"\d+","this 123 is a 456 test"))
['123', '456']
In[65]: print(re.search(r"\d+","this 123 is a 456 test"))
<_sre.SRE_Match object; span=(5, 8), match='123'>
In[66]: print(re.match(r"\d+","this 123 is a 456 test"))
None

re.finditer

re.finditer(pattern,string,flags=0)

finditer函数返回一个顺序访问每一个macth对象的迭代器。

In[69]: for i in re.finditer(r"\d+","this 123 is a 456 test"):
   ...:     print(i.group())
   ...:     
123
456

re.split

re.split(pattern, string, maxsplit=0, flags=0)

split函数匹配字符串，返回分割后的列表；如果匹配不上，返回原字符串

maxsplit可以指定分割次数，0表示分割所有

In[3]: re.split("\d","one1two2three3four")
Out[3]: 
['one', 'two', 'three', 'four']
In[4]: re.split("\d","one1two2three3four",2)
Out[4]: 
['one', 'two', 'three3four']
In[6]: re.split("\d","one,two,three")
Out[6]: 
['one,two,three']

re.sub

re.sub(pattern, repl, string, count=0, flags=0)

re.subn(pattern, repl, string, count=0, flags=0)

sub函数通过RE替换子字符串，返回替换后的字符串；如果匹配不上，返回原字符串

subn函数返回替换后的数组(字符串，替换次数)；如果匹配不上，仅返回原字符串

count指定替换次数，0表示替换所有

In[9]: re.sub("\d","\n","one1two2three3four")
Out[9]: 
'one\ntwo\nthree\nfour'
In[10]: re.sub("\d","\n","one1two2three3four",2)
Out[10]: 
'one\ntwo\nthree3four'
In[11]: re.sub("\d","\n","one,two,three,four")
Out[11]: 
'one,two,three,four'
In[12]: re.subn("\d","\n","one1two2three3four")
Out[12]: 
('one\ntwo\nthree\nfour', 3)
In[13]: re.sub("\d","\n","one,two,three,four")
Out[13]: 
'one,two,three,four'
In[14]: re.subn("\d","\n","one1two2three3four",2)
Out[14]: 
('one\ntwo\nthree3four', 2)

re.compile

compile函数用于编译正则表达式，生成一个正则表达式Pattern对象

re.compile(pattern[, flag])

参数：

pattern：一个字符串形式的正则表达式
flags：可选，表示匹配模式
1. re.I 忽略大小写
2. re.M 多行模式，改变^可以匹配和$的行为，否则^只匹配字符串起始，$只匹配字符串末尾
3. re.S 点.匹配包括\n在内的任意模式，否则.不匹配换行符
4. re.X 允许将正则表达分行撰写，增加可读性，忽略空格和#后面的注释

re.I

In[5]: pattern = re.compile("I",re.I) #构建一个可以忽略大小写匹配字符i的RE
In[6]: re.match(pattern, "i am an Lau")
Out[6]: 
<_sre.SRE_Match object; span=(0, 1), match='i'>

re.M

In[15]: re.findall("\w+$","I am An Liu\nAn Lau") #搜索末尾的连续字符
Out[15]: 
['Lau']
In[16]: re.findall("\w+$","I am An Liu\nAn Lau",re.M) #搜索不同行末尾的连续字符
Out[16]: 
['Liu', 'Lau']

re.S

In[35]: re.findall(".","te\tst\n") # .可以匹配\t，但是无法匹配\n
Out[35]: 
['t', 'e', '\t', 's', 't']
In[36]: re.findall(".","te\tst\n",re.S) # re.S的模式下，.可以匹配\n
Out[36]: 
['t', 'e', '\t', 's', 't', '\n']

re.X

In[46]: line = "this is a test"
In[47]: re.match( r'(\w+\s)(\w+\s)(\w+)', line).group() #匹配前三个单词
Out[47]: 
'this is a'
In[48]: pattern_x = re.compile("""(\w+\s) # first words
   ...:                        (\w+\s) # second words
   ...:                        (\w+) # third words""") # 将多个RE分行撰写
In[49]: re.match(pattern_x,line) #返回none
In[50]: pattern_x = re.compile("""(\w+\s) # first word
   ...:                        (\w+\s) # second word
   ...:                        (\w+) # third word""",re.X) #设置re.X模式
In[51]: re.match(pattern_x,line) #成功匹配
Out[51]: 
<_sre.SRE_Match object; span=(0, 9), match='this is a'>

贪婪匹配和非贪婪匹配

标准量词修饰的子表达式，在可匹配可不匹配的情况下，总会优先尝试进行匹配，称这种方式为贪婪模式。

之前提及的一些量词，{m}, {m,n}, {m,}, ?, *和+都是匹配优先。
一些NFA正则引擎支持忽略优先量词，也就是在标准量词后面加一个?，此时，在可匹配可不匹配的情况下，总会优先忽略匹配。只有在由忽略优先量词修饰的子表达式，必须进行匹配才能使整个表达式匹配成功，才会进行匹配，称这种方式为非贪婪模式。

忽略优先量词包括{m}?, {m,n}?, {m,}?, ??, *?和+?。

In[3]: pattern1 = re.compile("www\..*") #贪婪匹配
In[4]: match1 = pattern1.match("www.baidu.com")
In[5]: print(match1.group())
www.baidu.com
In[8]: pattern2 = re.compile("www\..*?") #非贪婪匹配
In[9]: match2 = pattern2.match("www.baidu.com")
In[10]: print(match2.group())
www.

21.Python之re模块
Python之re模块 re模块介绍re 模块使 Python 语言拥有全部的正则表达式功能。 re模块的内置方法...
python（学会正则走天下）
python通过re模块来实现。本篇文章着重对Python的RE进行介绍re 模块首先通过 re.compiler...
Python 脚本之统计基因组文件中染色体长度及N碱基数目
模块介绍 re模块 re模块是Python中的正则表达式调用模块，在python中，通过将正则表达式内嵌集成re模...
遇见正则表达式(2)
昨天我已经埋好了伏笔，今天来重点学习re模块。学习re模块，主要是学习该模块的几个重要的方法。 re.finda...
小猪的Python学习之旅 —— 3.正则表达式
re模块 Python中通过re模块使用正则表达式，该模块提供的几个常用方法： 1.匹配 re.match(pat...
python05-正则表达式(二)
正则表达式(二) re模块(regex) python中没有正则表达式的函数，需要引入内置的re模块 re模块方法...
re模块
匹配标签匹配整数数字匹配爬虫练习
re模块
参考资料https://www.ibm.com/developerworks/cn/opensource/os-c...
re 模块
1、Python中的模块有过C语言编程经验的朋友都知道在C语言中如果要引用sqrt函数，必须用语句#include...
re模块
1、re.match函数原型：match(pattern, string, flags=0) 参数：patter...

re模块

Python正则表达法规则

re模块常用函数

re.match

re.search()

group()

(?P=name)

re.findall

re.finditer

re.split

re.sub

re.compile

re.I

re.M

re.S

re.X

贪婪匹配和非贪婪匹配

相关文章

21.Python之re模块

python（学会正则走天下）

Python 脚本之统计基因组文件中染色体长度及N碱基数目

遇见正则表达式(2)

小猪的Python学习之旅 —— 3.正则表达式

python05-正则表达式(二)

re模块

re模块

re 模块

re模块

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读