所有的正则表达式默认贪婪模式,意思就是尽可能尝试匹配更多的字符。
关闭贪婪模式之后,非贪婪模式是让匹配到的尽可能少;取决于被关掉贪婪的表达式之后的规则,看后面一个规则是谁,尽量地去满足它。
>>> re.match(r"aa(\d+)","aa2343ddd").group(1)
'2343'
>>> re.match(r"aa(\d+?)","aa2343ddd").group(1)
'2'
>>> re.match(r"aa(\d+)ddd","aa2343ddd").group(1)
'2343'
# 虽然关闭了贪婪模式,但是为了考虑到前后的规则,还是必须把数字完整提取出来
>>> re.match(r"aa(\d+?)ddd","aa2343ddd").group(1)
'2343'
>>>
过滤网页例题:
>>> s = """<img data-original="https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973_201611131917_small.jpg" src="https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973_201611131917_small.jpg" style="display: inline;">"""
>>> re.findall(r"https.+?\.jpg", s).group()
Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: 'list' object has no attribute 'group'
>>> re.findall(r"https.+?\.jpg", s)
['https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973_201611131917_small.jpg', 'https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973_201611131917_small.jpg']
- 例题1:把网址中后面的部分替换掉
后面的找规则不好找,要从前面找开始入手。
>>> s = """http://www.interoem.com/messageinfo.asp?id=35
... http://3995503.com/class/class09/news_show.asp?id=14
... http://lib.wzmc.edu.cn/news/onews.asp?id=769
... http://www.zy-ls.com/alfx.asp?newsid=377&id=6
... http://www.fincm.com/newslist.asp?id=415"""
>>> re.sub(r"(http://.+?/).*", lambda x: x.group(1), s)
'http://www.interoem.com/\nhttp://3995503.com/\nhttp://lib.wzmc.edu.cn/\nhttp://www.zy-ls.com/\nhttp://www.fincm.com/'
- 例题2:找出所有英文单词
方法1:
>>> s = "hello world ha ha"
>>> re.split(r" ", s)
['hello', 'world', 'ha', 'ha']
方法2:
s
'hello world ha ha'
re.findall(r"\b[a-zA-Z]+\b", s)
['hello', 'world', 'ha', 'ha']
网友评论