Python网络爬虫近期电影票房或热度信息爬取！

作者: 14e61d025165 | 来源:发表于2019-07-14 14:59 被阅读9次

Python网络爬虫近期电影票房或热度信息爬取！
python爬虫实战——爬取股票个股信息
Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存
初识网络爬虫
用python网络爬虫爬取英雄联盟英雄图片
数据科学实践与学习索引
划题整理-计算机
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例

目标意义

为了理解动态网站中一些数据如何获取，做一个简单的分析。

说明

Python资源共享群：484031800

思路，原始代码来源于： https://book.douban.com/subject/27061630/。

构造-下载器

构造分下载器，下载原始网页，用于原始网页的获取，动态网页中，js部分的响应获取。

通过浏览器模仿，合理制作请求头，获取网页信息即可。

代码如下：

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
import chardet
class HtmlDownloader(object):
def download(self,url):
if url is None:
return None
user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
headers={'User-Agent':user_agent}
r=requests.get(url,headers=headers)
if r.status_code is 200:
r.encoding=chardet.detect(r.content)['encoding']
return r.text
return None
</pre>

构造-解析器

解析器解析数据使用。

获取的票房信息，电影名称等，使用解析器完成。

被解析的动态数据来源于js部分的代码。

js地址的获取则通过F12控制台-->网络-->JS，然后观察，得到。

地址如正上映的电影：

http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http://movie.mtime.com/257982/&t=201907121611461266&Ajax_CallBackArgument0=257982

返回信息中，解析出json格式的部分，通过json的一些方法，获取其中的票房等信息。

其中，json解析工具地址如： https://www.json.cn/

未上映的电影是同理的。

这些数据的解析有差异，所以定制了函数分支，处理解析过程中可能遇到的不同情景。

代码如下：

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import re
import json
class HtmlParser(object):
def parser_url(self,page_url,response):
pattern=re.compile(r'(http://movie.mtime.com/(\d+)/)')
urls=pattern.findall(response)
if urls != None:
return list(set(urls))#Duplicate removal
else:
return None

def parser_json(self,url,response):
    #parsing json. input page_url as js url and response for parsing
    pattern=re.compile(r'=(.*?);')
    result=pattern.findall(response)[0]
    if result != None:
        value=json.loads(result)
        isRelease=value.get('value').get('isRelease')
        if isRelease:
            isRelease=1
            return self.parser_json_release(value,url)
        else:
            isRelease=0
            return self.parser_json_notRelease(value,url)
    return None
def parser_json_release(self,value,url):
    isRelease=1
    movieTitle=value.get('value').get('movieTitle')
    RatingFinal=value.get('value').get('movieRating').get('RatingFinal')
    try:
        TotalBoxOffice=value.get('value').get('boxOffice').get('TotalBoxOffice')
        TotalBoxOfficeUnit=value.get('value').get('boxOffice').get('TotalBoxOfficeUnit')
    except:
        TotalBoxOffice="None"
        TotalBoxOfficeUnit="None"
    return isRelease,movieTitle,RatingFinal,TotalBoxOffice,TotalBoxOfficeUnit,url

def parser_json_notRelease(self,value,url):
    isRelease=0
    movieTitle=value.get('value').get('movieTitle')
    try:
        RatingFinal=Ranking=value.get('value').get('hotValue').get('Ranking')
    except:
        RatingFinal=-1
    TotalBoxOffice='None'
    TotalBoxOfficeUnit='None'
    return isRelease,movieTitle,RatingFinal,TotalBoxOffice,TotalBoxOfficeUnit,url

</pre>

构造-存储器

存储方案为Sqlite，所以在解析器中isRelease部分，使用了0和1进行的存储。

存储需要连接sqlite3，创建数据库，获取执行数据库语句的方法，插入数据等。

按照原作者思路，存储时，先暂时存储到内存中，条数大于10以后，将内存中的数据插入到sqlite数据库中。

代码如下：

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import sqlite3
class DataOutput(object):
def init(self):
self.cx=sqlite3.connect("MTime.db")
self.create_table('MTime')
self.datas=[]

def create_table(self,table_name):
    values='''
    id integer primary key autoincrement,
    isRelease boolean not null,
    movieTitle varchar(50) not null,
    RatingFinal_HotValue real not null default 0.0,
    TotalBoxOffice varchar(20),
    TotalBoxOfficeUnit varchar(10),
    sourceUrl varchar(300)
    '''
    self.cx.execute('create table if not exists %s(%s)' %(table_name,values))

def store_data(self,data):
    if data is None:
        return
    self.datas.append(data)
    if len(self.datas)>10:
        self.output_db('MTime')

def output_db(self,table_name):
    for data in self.datas:
        cmd="insert into %s (isRelease,movieTitle,RatingFinal_HotValue,TotalBoxOffice,TotalBoxOfficeUnit,sourceUrl) values %s" %(table_name,data)
        self.cx.execute(cmd)
        self.datas.remove(data)
    self.cx.commit()

def output_end(self):
    if len(self.datas)>0:
        self.output_db('MTime')
    self.cx.close()

</pre>

主函数部分

创建以上对象作为初始化

然后获取根路径。从根路径下找到百余条电影网址信息。

对每个电影网址信息一一解析，然后存储。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import HtmlDownloader
import HtmlParser
import DataOutput
import time
class Spider(object):
def init(self):
self.downloader=HtmlDownloader.HtmlDownloader()
self.parser=HtmlParser.HtmlParser()
self.output=DataOutput.DataOutput()

def crawl(self,root_url):
    content=self.downloader.download(root_url)
    urls=self.parser.parser_url(root_url, content)
    for url in urls:
        print('.')
        t=time.strftime("%Y%m%d%H%M%S1266",time.localtime())
        rank_url='http://service.library.mtime.com/Movie.api'\
        '?Ajax_CallBack=true'\
        '&Ajax_CallBackType=Mtime.Library.Services'\
        '&Ajax_CallBackMethod=GetMovieOverviewRating'\
        '&Ajax_CrossDomain=1'\
        '&Ajax_RequestUrl=%s'\
        '&t=%s'\
        '&Ajax_CallBackArgument0=%s' %(url[0],t,url[1])
        rank_content=self.downloader.download(rank_url)
        try:
            data=self.parser.parser_json(rank_url, rank_content)
        except:
            print(rank_url)
        self.output.store_data(data)
    self.output.output_end()
    print('ed')

if name=='main':
spider=Spider()
spider.crawl('http://theater.mtime.com/China_Beijing/')
</pre>

当前效果

如下：

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1563087534706" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

Python网络爬虫近期电影票房或热度信息爬取！
目标意义为了理解动态网站中一些数据如何获取，做一个简单的分析。说明 Python资源共享群：484031800...
python爬虫实战——爬取股票个股信息
python爬虫实战——爬取股票个股信息 python IDLE版本：(Python 3.6 64-bit) 爬虫...
Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存
Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存目的采用python爬虫爬取豆瓣电影Top25...
初识网络爬虫
网络爬虫网络爬虫是指在互联网上自动爬取网站内容信息的程序，也称作网络蜘蛛或网络机器人。爬虫基本流程一个网络爬...
用python网络爬虫爬取英雄联盟英雄图片
用python爬虫爬取lol皮肤。这也用python网络爬虫爬取lol英雄皮肤，忘了是看哪个大神的博客（由于当时...
数据科学实践与学习索引
Python 包 pandas 爬虫小专栏—爬取广州二手房信息小专栏—爬虫模块化小专栏—广度优先爬虫小专栏—爬取...
划题整理-计算机
1.什么是网络爬虫？网络爬虫是指在互联网上自动爬取网站内容信息的程序，也被称作网络蜘蛛或网络机器人。大型的爬虫程...
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（七）- 深度爬虫CrawlSpider
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（二）- urllib爬虫案例
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...