美文网首页
2019-06-25——BeautifulSoup4

2019-06-25——BeautifulSoup4

作者: ElfACCC | 来源:发表于2019-06-25 14:21 被阅读0次

    pip install bs4
    pip install lxml(用c语言库)

    find_all和find找

    image.png
    image.png
    image.png
    image.png
    image.png
    image.png

    find_all找所有,find找第一个

    获得标签属性 image.png

    获得标签下的文字

    css选择器


    image.png

    select找

    image.png
    image.png

    string多行就获取不到了,要用contents

    ime.png
    image.png

    爬取天气预报

    image.png
    image.png
    image.png
    image.png

    pip install html5lib,这个解析器能自动补充不完整的html标签,但是没有lxml快


    完整代码

    import requests
    from bs4 import BeautifulSoup
    from pyecharts.charts import Bar
    
    ALL_DATA = []
    
    def parse_page(url):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url,headers=headers)
        text = response.content.decode('utf-8')
        soup = BeautifulSoup(text,'html5lib')
        conMidtab = soup.find('div',class_='conMidtab')
        tables = conMidtab.find_all('table')
        for table in tables:
            trs = table.find_all('tr')[2:]
            for index,tr in enumerate(trs):
                tds = tr.find_all('td')
                city_td = tds[0]
                if index == 0:
                    city_ed = tds[1]
                high_temp = tds[-5]
                city = list(city_td.stripped_strings)[0]
                temp = list(high_temp.stripped_strings)[0]
                #print({'city':city,'temp':int(temp)})
                ALL_DATA.append({'city':city,'temp':int(temp)})
        
        ALL_DATA.sort(key=lambda data:data['temp'],reverse=True)
        data = ALL_DATA[0:10]
        cities = list(map(lambda x:x['city'],data))
        temps = list(map(lambda x:x['temp'],data))
    
    
        bar = Bar()
        bar.add_xaxis(cities)
        bar.add_yaxis("高温城市TOP10", temps)
        bar.render('temperture.html')
    
    def main():
        urls = ['http://www.weather.com.cn/textFC/hb.shtml',
        'http://www.weather.com.cn/textFC/db.shtml',
        'http://www.weather.com.cn/textFC/hd.shtml',
        'http://www.weather.com.cn/textFC/hz.shtml',
        'http://www.weather.com.cn/textFC/hn.shtml',
        'http://www.weather.com.cn/textFC/xb.shtml',
        'http://www.weather.com.cn/textFC/xn.shtml',
        'http://www.weather.com.cn/textFC/gat.shtml',]
    
        for url in urls:
            parse_page(url)
    
    if __name__ == "__main__":
        main()
    
    image.png

    放一个列表


    image.png

    排序

    image.png
    之前要把temp变成int image.png
    image.png

    pyecharts文档

    注意pyecharts写法和图中不一样了,详见文档,高温要倒序,reverse=True


    image.png

    相关文章

      网友评论

          本文标题:2019-06-25——BeautifulSoup4

          本文链接:https://www.haomeiwen.com/subject/rgmzqctx.html