美文网首页
python2.7+lxml爬取猫眼电影烂片top20

python2.7+lxml爬取猫眼电影烂片top20

作者: Chelsea_Dagger | 来源:发表于2017-11-07 18:44 被阅读0次

    前言


    众所周知,BeautifulSoup 是个非常强大的库,不过还有一些比较流行的解析库,例如 lxml,使用的是 Xpath 语法,同样是效率比较高的解析方法。如果大家对 BeautifulSoup 使用不太习惯的话,可以尝试下 Xpath。(墙裂推荐哦)

    lxml的安装:

    pip install lxml
    

    lxml教程(仅供参考)


    代码

    # -*- coding: UTF-8 -*-
    
    import requests
    from lxml import etree
    #request和lxml,用于网络请求和解析
    
    import sys
    reload(sys)
    sys.setdefaultencoding('utf8')
    #用于解决python2.7中文编码问题
    
    ori_url = 'http://maoyan.com/films?sortId=1&offset={}'
    #猫眼电影主页url,offset从0开始递增,一页30部电影
    
    headers={
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Cookie': 'your cookie',
        #填写你自己的浏览器cookie
        'Host': 'maoyan.com',
        'Referer': 'http://maoyan.com/',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
    }
    
    all_url=[]
    
    for i in range(11):
        offset = str(i*30)
        req_url = ori_url.format(offset)
        all_url.append(req_url)
    #一共11页,url动态变化
    
    movie_item=list()
    i = 0
    j = 0
    for url in all_url:
        html = requests.get(url, headers=headers).text
        selector = etree.HTML(html)    
        infos = selector.xpath('//div[@class="movies-list"]/dl[@class="movie-list"]//div[@class="channel-detail movie-item-title"]/a')
        #xpath爬取电影name和电影url
        j = i
        for info in infos:
            movie_item.append(dict())
            movie_url = 'http://maoyan.com' + info.xpath('@href')[0]
            movie_name = info.xpath('text()')[0]
            movie_item[i]['name'] = movie_name
            movie_item[i]['url'] = movie_url
            i += 1
    
        score = selector.xpath('//div[@class="channel-detail channel-detail-orange"]')
        #xpath爬取电影评分(两种情况:有评分/暂无评分)
        for item in score:
            if item.text == None:
                sc= item.getchildren()[0].text+item.getchildren()[1].text
            else:
                sc= item.text
            movie_item[j]['score'] = sc
            j+=1
    
    movie_item = sorted(movie_item, key=lambda item:item['score'], reverse=False)
    #按照评分排序
    file=open('./p_data/movieinfos.txt','w')
    #将结果写入本地文件
    print len(movie_item)
    for i in range(len(movie_item)):
        file.write(str(movie_item[i]['name'])+'    '+str(movie_item[i]['score'])+'    '+str(movie_item[i]['url'])+'\n')
    file.close()
    
    

    最终结果

    image.png

    Ps:银翼杀手和异形契约在我看来是很好的两部电影,导演水准也很高(心情复杂

    相关文章

      网友评论

          本文标题:python2.7+lxml爬取猫眼电影烂片top20

          本文链接:https://www.haomeiwen.com/subject/nfnjmxtx.html