10分钟用python实现给定电影英文名，在猫眼上爬到中文名和票

作者: 2890bd62c72a | 来源:发表于2019-08-15 21:05 被阅读3次

&

[root@xxn maoyan]# cat cat.py
#!/usr/bin/env python
#coding:utf-8

import requests
from bs4 import BeautifulSoup

def movieurl(url):
    """
    用来获取电影的单页url地址
    """
    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36",
    }
    response = requests.get(url,headers=headers,timeout=10)
    soup= BeautifulSoup(response.text,'lxml')
    href = soup.find_all('div',class_="channel-detail movie-item-title")[0]
    movieurl = "http://maoyan.com%s" % href.find('a')['href']
    return movieurl

def moveinfo(url):
    """
    得到电影的中文名,票房单位。
    如果票房单位没数据，说明票房"暂无"。
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36",
    }
    response = requests.get(url, headers=headers,timeout=5)
    soup = BeautifulSoup(response.text, 'lxml')
    Chinesename = soup.find('div',class_="movie-brief-container").h3.string
    try:
        boxofficeunit = soup.find_all('div',class_="movie-index-content box")[0].find('span',class_='unit').string
    except:
        boxofficeunit = 0
    return Chinesename,boxofficeunit

if __name__ == '__main__':
    Moviename = input("请输入电影的英文名字：")
    Moviename = Moviename.replace(' ','+')
    url = "http://maoyan.com/query?kw=%s&type=0" % Moviename
    Chinesename, boxofficeunit = moveinfo(movieurl(url))
    print Chinesename,boxofficeunit

&

[root@xxn maoyan]# cat maoyan.py
#!/usr/bin/env python
# coding=utf-8
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import random
from PIL import Image
import pytesseract
import os
import cat

def imagedownlod(url):
    """
    把电影单页做个截图保存,因为我们要取票房数据，所以不进行图片载入，加快速度
    """
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    USER_AGENTS=[
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4882.400 QQBrowser/9.7.13059.400'
    ]
    #从USER_AGENTS列表中随机选一个浏览器头，伪装浏览器
    dcap["phantomjs.page.settings.userAgent"] = (random.choice(USER_AGENTS))
    driver = webdriver.PhantomJS(desired_capabilities=dcap)
    # 不载入图片，爬页面速度会快很多
    dcap["phantomjs.page.settings.loadImages"] = False # 禁止加载图片
    driver = webdriver.PhantomJS(desired_capabilities=dcap)
    driver.set_window_size(1366, 3245)
    driver.get(url)
    driver.save_screenshot("maoyan.png")

def crop_image(image_path,crop_path):
    """
    本来想利用webdriver来得到票房元素的位置,然后根据位置和元素大小做数字运算求出来4个参数，位置可以正常得到，但是图片大小不一，所以抠图会有问题
    所以换种方式：我把每个页面截图修改成统一大小，因为票房的位置是固定的，所以这样可以使爬虫更强健。
    """
    # 计算抠取区域的绝对坐标
    left = 668
    top = 388
    right = 668+158
    bottom = 388+54
    # 打开图片，抠取相应区域并存储
    img = Image.open(image_path)
    out = img.resize((1366, 3245),Image.ANTIALIAS) #resize image with high-quality
    out.save('maoyannew.png')
    im = Image.open('maoyannew.png')
    im = im.crop((left, top, right, bottom))
    im.save(crop_path)
    os.remove('maoyannew.png')

def words(image):
    """
    因为我们对不同大小的图片进行归一化处理，所以有些图片pytesseract不能识别数字
    所以我首先进行灰度处理，然后使用config="-psm 8 -c tessedit_char_whitelist=1234567890"这个参数
    """
    im = Image.open(image).convert('L')
    im.save(image)
    number =  pytesseract.image_to_string(Image.open(image),config="-psm 8 -c tessedit_char_whitelist=1234567890")
    os.remove(image)
    return number

if __name__ == '__main__':
    Moviename = input("请输入电影的英文名字：")
    Moviename = Moviename.replace(' ','+')
    url = "http://maoyan.com/query?kw=%s&type=0" % Moviename
    Chinesename,boxofficeunit = cat.moveinfo(cat.movieurl(url))
    imagedownlod(cat.movieurl(url))
    crop_image('maoyan.png','piaofang.png')
    print words('piaofang.png')
    os.remove('maoyan.png')

&

[root@xxn maoyan]# cat catseye.py 
#!/usr/bin/env python
# coding=utf-8
import cat
import maoyan
import sys
import os
reload(sys)
sys.setdefaultencoding('utf8')
def main():
    moviename = input("请输入电影的英文名字：")
    Moviename = moviename.replace(' ','+')
    Moviename = moviename.replace(':','%3A')
    url = "http://maoyan.com/query?kw=%s&type=0" % Moviename
    Chinesename,boxofficeunit = cat.moveinfo(cat.movieurl(url))
    if boxofficeunit == 0:
        """
        如果票房单位为0也就是不存在，那么电影票房也就是暂无，所以我们就不需要抠图识别数字了
        """
        print "您搜索的电影英文名字:" + moviename
        print "您搜索的电影中文名字:" +  Chinesename
        print "你搜索的电影票房:" + '暂无'
    else:
        maoyan.imagedownlod(cat.movieurl(url))
        maoyan.crop_image('maoyan.png','piaofang.png')
        number = maoyan.words('piaofang.png')
        print "您搜索的电影英文名字:" + moviename
        print "您搜索的电影中文名字:" +  Chinesename
        print "你搜索的电影票房:" + str(number2) + str(boxofficeunit)
        os.remove('maoyan.png')
if __name__ == '__main__':
    main()

测试：

如果你依然在编程的世界里迷茫，可以加入我们的Python学习扣qun：784758214，看看前辈们是如何学习的！交流经验！自己是一名高级python开发工程师，从基础的python脚本到web开发、爬虫、django、数据挖掘等，零基础到项目实战的资料都有整理。送给每一位python的小伙伴！分享一些学习的方法和需要注意的小细节，点击加入我们的 python学习者聚集地

网友评论

本文标题：10分钟用python实现给定电影英文名，在猫眼上爬到中文名和票

本文链接：https://www.haomeiwen.com/subject/lxutsctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

10分钟用python实现给定电影英文名，在猫眼上爬到中文名和票

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python，web开发，前端技术分享

码农的世界

互联网科技

大数据爬虫Python AI Sql

10分钟用python实现给定电影英文名，在猫眼上爬到中文名和票

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python，web开发，前端技术分享

码农的世界

互联网科技

大数据 爬虫Python AI Sql

大数据爬虫Python AI Sql