pip install bs4
pip install lxml(用c语言库)
find_all和find找
![](https://img.haomeiwen.com/i6878902/96d05b053a487ecf.png)
![](https://img.haomeiwen.com/i6878902/ef91c3aedbfa14b1.png)
![](https://img.haomeiwen.com/i6878902/5fbdd9ea65f6d84e.png)
![](https://img.haomeiwen.com/i6878902/af1c80c583ef1aaa.png)
![](https://img.haomeiwen.com/i6878902/d59d3a7fc208ef3f.png)
![](https://img.haomeiwen.com/i6878902/1131087ccf7c155d.png)
find_all找所有,find找第一个
获得标签属性
image.png
获得标签下的文字
css选择器
![](https://img.haomeiwen.com/i6878902/98b558f5474cec67.png)
![](https://img.haomeiwen.com/i6878902/f0d67cceaee50a60.png)
select找
![](https://img.haomeiwen.com/i6878902/2e3ccd3cefc24ed2.png)
![](https://img.haomeiwen.com/i6878902/498fd10f72ac7b88.png)
string多行就获取不到了,要用contents
![](https://img.haomeiwen.com/i6878902/6f2a25c99f2a2d67.png)
![](https://img.haomeiwen.com/i6878902/764649653e85a709.png)
爬取天气预报
![](https://img.haomeiwen.com/i6878902/ca9b748fbecba11a.png)
![](https://img.haomeiwen.com/i6878902/b027e9c7ed984a30.png)
![](https://img.haomeiwen.com/i6878902/92b662064b502d9f.png)
![](https://img.haomeiwen.com/i6878902/a9fc9e5fda4ab957.png)
pip install html5lib,这个解析器能自动补充不完整的html标签,但是没有lxml快
![](https://img.haomeiwen.com/i6878902/073c7375bca8446b.png)
![](https://img.haomeiwen.com/i6878902/60dce12c9c296a43.png)
完整代码
import requests
from bs4 import BeautifulSoup
from pyecharts.charts import Bar
ALL_DATA = []
def parse_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
response = requests.get(url,headers=headers)
text = response.content.decode('utf-8')
soup = BeautifulSoup(text,'html5lib')
conMidtab = soup.find('div',class_='conMidtab')
tables = conMidtab.find_all('table')
for table in tables:
trs = table.find_all('tr')[2:]
for index,tr in enumerate(trs):
tds = tr.find_all('td')
city_td = tds[0]
if index == 0:
city_ed = tds[1]
high_temp = tds[-5]
city = list(city_td.stripped_strings)[0]
temp = list(high_temp.stripped_strings)[0]
#print({'city':city,'temp':int(temp)})
ALL_DATA.append({'city':city,'temp':int(temp)})
ALL_DATA.sort(key=lambda data:data['temp'],reverse=True)
data = ALL_DATA[0:10]
cities = list(map(lambda x:x['city'],data))
temps = list(map(lambda x:x['temp'],data))
bar = Bar()
bar.add_xaxis(cities)
bar.add_yaxis("高温城市TOP10", temps)
bar.render('temperture.html')
def main():
urls = ['http://www.weather.com.cn/textFC/hb.shtml',
'http://www.weather.com.cn/textFC/db.shtml',
'http://www.weather.com.cn/textFC/hd.shtml',
'http://www.weather.com.cn/textFC/hz.shtml',
'http://www.weather.com.cn/textFC/hn.shtml',
'http://www.weather.com.cn/textFC/xb.shtml',
'http://www.weather.com.cn/textFC/xn.shtml',
'http://www.weather.com.cn/textFC/gat.shtml',]
for url in urls:
parse_page(url)
if __name__ == "__main__":
main()
![](https://img.haomeiwen.com/i6878902/37d3bce9cf17625d.png)
放一个列表
![](https://img.haomeiwen.com/i6878902/73d3c126ce2ecf94.png)
排序
![](https://img.haomeiwen.com/i6878902/c823273d6d1273be.png)
之前要把temp变成int
![](https://img.haomeiwen.com/i6878902/d87e92507bbfc153.png)
![](https://img.haomeiwen.com/i6878902/c577920052253d97.png)
pyecharts文档
注意pyecharts写法和图中不一样了,详见文档,高温要倒序,reverse=True
![](https://img.haomeiwen.com/i6878902/b3f531cd26785f09.png)
网友评论