
用requests爬取https://china-testing.github.io/首页的博客标题,共10条。
参考答案:
01_blog_title.py
# qq群144081101 567351477
import requests
from bs4 import BeautifulSoup
def get_upcoming_events(url):
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml')
events = soup.findAll('article')
for event in events:
event_details = {}
event_details['name'] = event.find('h1').find("a").text
print(event_details)
get_upcoming_events('https://china-testing.github.io/')
执行结果:
$ python3 01_blog_title.py
{'name': '10分钟学会API测试'}
{'name': 'python数据分析快速入门教程4-数据汇聚'}
{'name': 'python数据分析快速入门教程6-重整'}
{'name': 'python数据分析快速入门教程5-处理缺失数据'}
{'name': 'python库介绍-pytesseract: OCR光学字符识别'}
{'name': '软件自动化测试初学者忠告'}
{'name': '使用opencv转换3d图片'}
{'name': 'python opencv3实例(对象识别和增强现实)2-边缘检测和应用图像过滤器'}
{'name': 'numpy学习指南3rd3:常用函数'}
{'name': 'numpy学习指南3rd2:NumPy基础'}

Requests和Beautiful Soup 爬取python.org
- 目标: 爬取https://www.python.org/events/python-events/中事件的名称、地点和时间。
01_events_with_requests.py
import requests
from bs4 import BeautifulSoup
def get_upcoming_events(url):
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml')
events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')
for event in events:
event_details = dict()
event_details['name'] = event.find('h3').find("a").text
event_details['location'] = event.find('span', {'class', 'event-location'}).text
event_details['time'] = event.find('time').text
print(event_details)
get_upcoming_events('https://www.python.org/events/python-events/')
执行结果:
$ python3 01_events_with_requests.py
{'name': 'PyCon US 2018', 'location': 'Cleveland, Ohio, USA', 'time': '09 May – 18 May 2018'}
{'name': 'DjangoCon Europe 2018', 'location': 'Heidelberg, Germany', 'time': '23 May – 28 May 2018'}
{'name': 'PyCon APAC 2018', 'location': 'NUS School of Computing / COM1, 13 Computing Drive, Singapore 117417, Singapore', 'time': '31 May – 03 June 2018'}
{'name': 'PyCon CZ 2018', 'location': 'Prague, Czech Republic', 'time': '01 June – 04 June 2018'}
{'name': 'PyConTW 2018', 'location': 'Taipei, Taiwan', 'time': '01 June – 03 June 2018'}
{'name': 'PyLondinium', 'location': 'London, UK', 'time': '08 June – 11 June 2018'}
注意:因为事件的内容未必相同,所以每次的结果也不会一样
find()和find_all()
Beautiful Soup定义了许多搜索解析树的方法,但它们都非常相似。find()和find_all()是最常用的。
参考资料
- python测试开发项目实战-目录
- python工具书籍下载-持续更新
- python 3.7极速入门教程 - 目录
- 讨论qq群630011153 144081101
- 原文地址
- 本文涉及的python测试开发库 谢谢点赞!
- [本文相关海量书籍下载](https://github.com/china-testing/python-api-tesing/blob/master/books.md
- 本文持续更新,最新版本https://www.jianshu.com/p/fbc416635987
过滤器
- 字符串
最简单的过滤器是字符串。在搜索方法中传入字符串,Beautiful Soup会查找与字符串完整匹配的内容,下例查找文档中所有的<b>标签:
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
...
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
...
... <p class="story">...</p>
... """
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>> soup.find_all('b')
[<b>The Dormouse's story</b>]
如果传入byte字符串,Beautiful Soup用UTF-8编码处理。

- 正则表达式
如果传入正则表达式,Beautiful Soup会通过正则表达式的search()来匹配内容。下例子中找出所有以b开头的标签:
>>> import re
>>> for tag in soup.find_all(re.compile("^b")):
... print(tag.name)
...
body
b
>>> for tag in soup.find_all(re.compile("t")):
... print(tag.name)
...
html
title
- 列表
如果传入列表,Beautiful Soup会将与列表返回任一元素匹配的内容。下例找到文档中所有<a>标签和<b>标签:
>>> soup.find_all(["a", "b"])
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
- True
True可以匹配任何值,下例查找到所有的tag, 但是不会返回文本
>>> for tag in soup.find_all(True):
... print(tag.name)
...
html
head
title
body
p
b
p
a
a
a
p
- 函数
函数只接受一个参数,返回布尔类型。
>>> def has_class_but_no_id(tag):
... return tag.has_attr('class') and not tag.has_attr('id')
...
>>> soup.find_all(has_class_but_no_id)
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
>>> soup.find_all(lambda tag:tag.has_attr('class') and not tag.has_attr('id'))
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
网友评论