美文网首页
爬取豆瓣电影(requests+lxml)

爬取豆瓣电影(requests+lxml)

作者: NJUNLP | 来源:发表于2018-08-18 12:55 被阅读0次

    一、摘要

    本文利用requests和lxml两个库实现了对豆瓣电影网址中每一部电影的名称、导演、主演、评分以及简介进行爬取,方法比较老套,但是也会收获一些东西。


    c9c20be4de7fe783bee20ef77eaa53a.png

    二、运行环境

    1.Pycharm
    2.python 3.6
    3.requests
    4.lxml

    三、思路

    (1)主页链接为https://movie.douban.com/top250,一共有10页,每页25部电影,每部电影都位于<li>标签下。
    (2)我们有两种方法实现翻页,通过分析,我们发现每一页的url都存在密切的关联,第一页为https://movie.douban.com/top250?start=0&filter=,第二页为https://movie.douban.com/top250?start=25&filter=,第三页为https://movie.douban.com/top250?start=50&filter=,我们发现url中start的值在发生变化,因此我们可以利用一个for循环遍历每一页;第二种方法为我们可以自动提取每一个中“后页”这个按钮的链接,从而快速构建下一页的url。

    四、实现代码

    import requests
    import random
    from lxml import etree
    
    UA_LIST = [
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    headers = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
      'Accept-Encoding': 'gzip, deflate, br',
      'Accept-Language': 'zh-CN,zh;q=0.9',
      'Connection': 'keep-alive',
      'Host': 'movie.douban.com',
      'User-Agent': random.choice(UA_LIST)
    }
    
    def downloadHtml(url):
       try:
           r = requests.get(url, headers=headers)
           r.raise_for_status()
           r.encoding = r.apparent_encoding
           return r.text
       except:
           return ""
    
    def parse(url):
       response = downloadHtml(url)
       html = etree.HTML(response)
       try:
           names = html.xpath("//*[@id='content']/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()")
           doctor = html.xpath("//*[@id='content']/div/div[1]/ol/li/div/div[2]/div[2]/p[1]/text()[1]")
           review = html.xpath("//*[@id='content']/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()")
           introduce = html.xpath("//*[@id='content']/div/div[1]/ol/li/div/div[2]/div[2]/p[2]/span/text()")
           for names_i, doctor_i, review_i, introduce_i in zip(names, doctor, review, introduce):
               content = {
                   'names': names_i,
                   'doctor': doctor_i.replace('\n', ' ').replace(' ', '').replace('\xa0', ''),
                   'review': review_i,
                   'introduce': introduce_i
               }
               print(content)
       except:
           print("错误信息")
    
    def URL(url):
       try:
           response = downloadHtml(url)
           html = etree.HTML(response)
           new_url = html.xpath("//*[@id='content']/div/div[1]/div[2]/span[3]/a/@href")[0]
           return new_url
       except:
           print("执行结束")
    
    if __name__ == '__main__':
       start_url = "https://movie.douban.com/top250"
       # 点击后页
       while(True):
           try:
               parse(start_url)
               n_url = "https://movie.douban.com/top250" + URL(start_url)
               start_url = n_url
           except:
               break
    
    

    五、运行结果

    2f1fd0ce7dda9fef140580272aecef9.png

    相关文章

      网友评论

          本文标题:爬取豆瓣电影(requests+lxml)

          本文链接:https://www.haomeiwen.com/subject/ukhfiftx.html