python爬取猫眼电影

作者: 大吉大利马大宝 | 来源:发表于2017-09-28 17:16 被阅读0次

爬虫学习(一)：利用requests爬取猫眼电影top100
python爬取猫眼电影
50行Python爬取猫眼电影TOP100榜单信息
Python爬虫
2018-03-06
爬取猫眼电影存入csv
爬取猫眼电影存入mysql
爬取猫眼电影存入mongodb
Python爬虫实战，requests模块，Python实现猫眼
爬取猫眼电影

今年夏天从东营到西安，西安的这个夏天把人热日他了，终于挨到了秋天，红红火火，恍恍惚惚。来了快三个月了，马上国庆了，趁着高速不收费，rush b 一波，回东营过两天衣来伸手,饭来张口的日子。天天小牌打上，小酒喝上岂不是美滋滋。
最近忙里偷闲偷学了一点python。主要以后肯定要用到数据采集(wo ai xue xi)。
这次爬取主要是为了爬取猫眼电影top100的所有电影信息并生成文件保存到txt文件中。比较简单。适合初学者练手。
首先实力分析一波大体意思，请欣赏抽象派代表大宝的画作

无标题.png

流程就是:客户端（我们的电脑）给-------->服务器（猫眼）发个请求------->服务器响应之后-------->把返回的结果给我们-------->我们拿着结果弄成我们想要的样子
网页地址：http://maoyan.com/board/4?offset=?????

TIM截图20170928154503.png

可以看到单个电影所有信息都包含在一个<dd>..</dd>中所以这一块是我们爬取的地方。
全部代码如下：

import requests
from requests.exceptions import RequestException#异常获取
import re #正则
import json
from multiprocessing import Pool #多线程

#获取网页源代码
def get_one_page(url):
    response = requests.get(url)
    try:
        if response.status_code == 200:#通过状态码200获取html源代码
            return response.text
    except RequestException:
        return None
#主函数
def main(offset):
    url ='http://maoyan.com/board/4?offset=' + str(offset) #offset代表页数，这里有10页
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item);
        write_to_file(item);#写入文件调用地方
#写入文件
def write_to_file(content):
    with open('result.txt','a',encoding='utf-8') as f:#open...as方法，'a'，代表在txt文件中追加字段
        f.write(json.dumps(content,ensure_ascii=False) + '\n')
        f.close()

#匹配字符串
def parse_one_page(html):
#正则匹配，这里不懂，还是看看书吧
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)" .*?name">.*?">(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
    items = re.findall(pattern,html);
    for item in items:
#变成我们喜欢的格式
        yield {
            'index':item[0],
            'image':item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],
            'time': item[4].strip()[5:],
            'score': item[5]+item[6],
        }
    # print(items);
#调用主函数
if __name__ =='__main__':
#一般方法
    # for i in range(10):
    #  main(i*10)
#多线程
    pool =Pool()
    pool.map(main,[i*10 for i in range(10)])

最后的效果：result.txt 文件