美文网首页
爬取豆瓣电影TOP250 - 完整示例代码

爬取豆瓣电影TOP250 - 完整示例代码

作者: 帅气的_xiang | 来源:发表于2017-04-19 19:14 被阅读111次

    目标网站:https://movie.douban.com/top250
    爬取目的:豆瓣电影排行榜前250的电影的电影名,保存到 movies.txt 文件

    源代码:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    # @Time    : 2017/4/19 16:30
    # @Author  : zxp
    # @Site    : 
    # @File    : Douban_top250.py
    # @Software: PyCharm
    import codecs
    import requests
    from bs4 import BeautifulSoup
    
    DOWNLOAD_URL = 'http://movie.douban.com/top250'
    
    def download_page(url):
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
                   }
        data = requests.get(url, headers=headers).content
        return data
    
    def parse_html(html):
        soup = BeautifulSoup(html, 'html.parser')
        movie_list_soup = soup.find('ol', attrs={'class':'grid_view'})
    
        movie_name_list = []
    
        for movie_li in movie_list_soup.find_all('li'):
            detail = movie_li.find('div', attrs={'class':'hd'})
            movie_name = detail.find('span', attrs={'class':'title'}).getText()
            movie_name_list.append(movie_name)
    
        next_page = soup.find('span', attrs={'class':'next'}).find('a')
        if next_page:
            return movie_name_list, DOWNLOAD_URL + next_page['href']
        return movie_name_list, None
    
    
    def main():
        url = DOWNLOAD_URL
    
        with codecs.open('movies.txt', 'wb', encoding='utf-8') as fp:
            while url:
                html = download_page(url)
                movies, url = parse_html(html)
                fp.write(u'{movies}\n'.format(movies='\n'.join(movies)))
    
    
    if __name__ == '__main__':
        main()
    

    ①使用了 User-Agent 模拟浏览器访问,防止被服务器认为是爬虫而拒绝访问。
    ②一般在我们确定内容的前提下,可以直接在代码中写死如何跳转页面,但是为了让我们的爬虫更像爬虫,我们让它找到页码导航中的下一页的链接。

    参考博客

    相关文章

      网友评论

          本文标题:爬取豆瓣电影TOP250 - 完整示例代码

          本文链接:https://www.haomeiwen.com/subject/tymhzttx.html