01_豆瓣

作者: 过桥 | 来源:发表于2017-03-09 14:16 被阅读89次

    序言

    2017,定下一个小目标,深度学Python~

    爬取对象

    抓取豆瓣电影Top250

    使用包

    import codecs  #非必须,用于指定创建记录爬取文件字符集
    import requests #数据抓取
    from bs4 import BeautifulSoup #抓取内容解析
    

    实现步骤

    1、编写抓取伪代码
    url = 'https://www.douban.com/'
    webPage = requests.get(url).text
    soup = BeautifulSoup(webPage,"html.parser")
    print(soup.title)  # <title>豆瓣</title>
    
    2、抓取对象分析
    • 通过F12捕获页面内容得知待抓取内容以列表形式显示在grid_view中,每“行”主要分为左侧图片、右侧电影名称等文本介绍
      肖申克的救赎
    • 分析翻页机制,重点关注不同页间URL区别
      翻页
    3、电影名称抓取实现
    import codecs  #非必须,用于指定创建记录爬取文件字符集
    import requests #数据抓取
    from bs4 import BeautifulSoup #抓取内容解析
    
    DOWNLOAD_URL = 'http://movie.douban.com/top250/'
    
    def download_page(url):
        return requests.get(url, headers={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'
        }).content
    
    def parse_html(html):
        soup = BeautifulSoup(html, "html.parser")
        movie_list_soup = soup.find('ol', attrs={'class': 'grid_view'})
    
        movie_name_list = []
    
        print("当前准备解析页码:" + soup.find('span', attrs={'class': 'thispage'}).getText())
    
        top_num = 1 + (int(soup.find('span', attrs={'class': 'thispage'}).getText())-1) * 25
    
        for movie_li in movie_list_soup.find_all('li'):
            detail = movie_li.find('div', attrs = {'class': 'hd'})
    
            movie_name_list.append("## Top " + str(top_num))
    
            #名称
            movie_name = ""
    
            for sp in detail.find_all('span', attrs={'class': 'title'}):
                movie_name += sp.text
    
            img = movie_li.find('div', attrs={'class': 'pic'}).find('a').find('img')
    
            movie_name_list.append("### "+ movie_name )
    
            top_num +=1
    
        next_page = soup.find('span', attrs={'class': 'next'}).find('a')
    
        if next_page:
            return movie_name_list, DOWNLOAD_URL + next_page['href']
        return movie_name_list, None
    
    def main():
        url = DOWNLOAD_URL
    
        with codecs.open('douban_moviesList_top250.md', 'wb', encoding='utf-8') as fp:
            while url:
                html = download_page(url)
                movies, url = parse_html(html)
                fp.write(u'{movies}\n'.format(movies='\n'.join(movies)))
    
    if __name__ == '__main__':
        main()
    
    电影名称
    4、基于名称,补充剩余信息解析
    • 封面解析示意
            img = movie_li.find('div', attrs={'class': 'pic'}).find('a').find('img')
            
            try:
                img_req = requests.get(img["src"], timeout=20)
                img_localhost = 'douban_moviesList_top250\\'+str(top_num)+ '.jpg'
                with open(img_localhost, 'wb') as f:
                    f.write(img_req.content)
    
                movie_name_list.append('![](douban_moviesList_top250/'+str(top_num)+'.jpg "douban_moviesList_top250")')
                
            except requests.exceptions.ConnectionError:
                print('【错误】当前图片无法下载,失效地址为:' + img["src"])
    

    总结

    因豆瓣大部分均为静态页面,相对较为简单,主要涉及翻页循环、图片抓取异常处理、BeautifulSoupfindfind_all使用......
    好了,到此结束,我是不会承认我会抓取豆瓣妹子什么的......
    完整代码

    豆瓣Top250

    相关文章

      网友评论

        本文标题:01_豆瓣

        本文链接:https://www.haomeiwen.com/subject/wupwgttx.html