美文网首页Python,web开发,前端技术分享码农的世界互联网科技
10分钟用python实现给定电影英文名,在猫眼上爬到中文名和票

10分钟用python实现给定电影英文名,在猫眼上爬到中文名和票

作者: 2890bd62c72a | 来源:发表于2019-08-15 21:05 被阅读3次

    &

    [root@xxn maoyan]# cat cat.py
    #!/usr/bin/env python
    #coding:utf-8
    
    import requests
    from bs4 import BeautifulSoup
    
    def movieurl(url):
        """
        用来获取电影的单页url地址
        """
        headers = {
            "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36",
        }
        response = requests.get(url,headers=headers,timeout=10)
        soup= BeautifulSoup(response.text,'lxml')
        href = soup.find_all('div',class_="channel-detail movie-item-title")[0]
        movieurl = "http://maoyan.com%s" % href.find('a')['href']
        return movieurl
    
    def moveinfo(url):
        """
        得到电影的中文名,票房单位。
        如果票房单位没数据,说明票房"暂无"。
        """
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36",
        }
        response = requests.get(url, headers=headers,timeout=5)
        soup = BeautifulSoup(response.text, 'lxml')
        Chinesename = soup.find('div',class_="movie-brief-container").h3.string
        try:
            boxofficeunit = soup.find_all('div',class_="movie-index-content box")[0].find('span',class_='unit').string
        except:
            boxofficeunit = 0
        return Chinesename,boxofficeunit
    
    if __name__ == '__main__':
        Moviename = input("请输入电影的英文名字:")
        Moviename = Moviename.replace(' ','+')
        url = "http://maoyan.com/query?kw=%s&type=0" % Moviename
        Chinesename, boxofficeunit = moveinfo(movieurl(url))
        print Chinesename,boxofficeunit
    

    &

    [root@xxn maoyan]# cat maoyan.py
    #!/usr/bin/env python
    # coding=utf-8
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    import random
    from PIL import Image
    import pytesseract
    import os
    import cat
    
    def imagedownlod(url):
        """
        把电影单页做个截图保存,因为我们要取票房数据,所以不进行图片载入,加快速度
        """
        dcap = dict(DesiredCapabilities.PHANTOMJS)
        USER_AGENTS=[
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4882.400 QQBrowser/9.7.13059.400'
        ]
        #从USER_AGENTS列表中随机选一个浏览器头,伪装浏览器
        dcap["phantomjs.page.settings.userAgent"] = (random.choice(USER_AGENTS))
        driver = webdriver.PhantomJS(desired_capabilities=dcap)
        # 不载入图片,爬页面速度会快很多
        dcap["phantomjs.page.settings.loadImages"] = False # 禁止加载图片
        driver = webdriver.PhantomJS(desired_capabilities=dcap)
        driver.set_window_size(1366, 3245)
        driver.get(url)
        driver.save_screenshot("maoyan.png")
    
    def crop_image(image_path,crop_path):
        """
        本来想利用webdriver来得到票房元素的位置,然后根据位置和元素大小做数字运算求出来4个参数,位置可以正常得到,但是图片大小不一,所以抠图会有问题
        所以换种方式:我把每个页面截图修改成统一大小,因为票房的位置是固定的,所以这样可以使爬虫更强健。
        """
        # 计算抠取区域的绝对坐标
        left = 668
        top = 388
        right = 668+158
        bottom = 388+54
        # 打开图片,抠取相应区域并存储
        img = Image.open(image_path)
        out = img.resize((1366, 3245),Image.ANTIALIAS) #resize image with high-quality
        out.save('maoyannew.png')
        im = Image.open('maoyannew.png')
        im = im.crop((left, top, right, bottom))
        im.save(crop_path)
        os.remove('maoyannew.png')
    
    def words(image):
        """
        因为我们对不同大小的图片进行归一化处理,所以有些图片pytesseract不能识别数字
        所以我首先进行灰度处理,然后使用config="-psm 8 -c tessedit_char_whitelist=1234567890"这个参数
        """
        im = Image.open(image).convert('L')
        im.save(image)
        number =  pytesseract.image_to_string(Image.open(image),config="-psm 8 -c tessedit_char_whitelist=1234567890")
        os.remove(image)
        return number
    
    if __name__ == '__main__':
        Moviename = input("请输入电影的英文名字:")
        Moviename = Moviename.replace(' ','+')
        url = "http://maoyan.com/query?kw=%s&type=0" % Moviename
        Chinesename,boxofficeunit = cat.moveinfo(cat.movieurl(url))
        imagedownlod(cat.movieurl(url))
        crop_image('maoyan.png','piaofang.png')
        print words('piaofang.png')
        os.remove('maoyan.png')
    

    &

    [root@xxn maoyan]# cat catseye.py 
    #!/usr/bin/env python
    # coding=utf-8
    import cat
    import maoyan
    import sys
    import os
    reload(sys)
    sys.setdefaultencoding('utf8')
    def main():
        moviename = input("请输入电影的英文名字:")
        Moviename = moviename.replace(' ','+')
        Moviename = moviename.replace(':','%3A')
        url = "http://maoyan.com/query?kw=%s&type=0" % Moviename
        Chinesename,boxofficeunit = cat.moveinfo(cat.movieurl(url))
        if boxofficeunit == 0:
            """
            如果票房单位为0也就是不存在,那么电影票房也就是暂无,所以我们就不需要抠图识别数字了
            """
            print "您搜索的电影英文名字:" + moviename
            print "您搜索的电影中文名字:" +  Chinesename
            print "你搜索的电影票房:" + '暂无'
        else:
            maoyan.imagedownlod(cat.movieurl(url))
            maoyan.crop_image('maoyan.png','piaofang.png')
            number = maoyan.words('piaofang.png')
            print "您搜索的电影英文名字:" + moviename
            print "您搜索的电影中文名字:" +  Chinesename
            print "你搜索的电影票房:" + str(number2) + str(boxofficeunit)
            os.remove('maoyan.png')
    if __name__ == '__main__':
        main()
    

    测试:

    如果你依然在编程的世界里迷茫,可以加入我们的Python学习扣qun:784758214,看看前辈们是如何学习的!交流经验!自己是一名高级python开发工程师,从基础的python脚本到web开发、爬虫、django、数据挖掘等,零基础到项目实战的资料都有整理。送给每一位python的小伙伴!分享一些学习的方法和需要注意的小细节,点击加入我们的 python学习者聚集地

    相关文章

      网友评论

        本文标题:10分钟用python实现给定电影英文名,在猫眼上爬到中文名和票

        本文链接:https://www.haomeiwen.com/subject/lxutsctx.html