分析Ajax爬取今日头条街拍美图

作者: 何苦_python_java | 来源:发表于2018-04-26 16:43 被阅读0次

爬取今日头条街拍图片
利用Python3解析Ajax爬取今日头条图集
【零基础学爬虫】分析Ajax，抓取今日头条街拍美图
python3多线程高容错爬取头条的街拍美图
分析Ajax爬取今日头条街拍美图
Python爬虫——Ajax爬取今日头条街拍美图
详解python爬取今日头条街拍美图
【Python】Python3网络爬虫实战-36、分析Ajax爬
Python爬虫——selenium爬取当当畅销图书排行
使用Ajax爬取今日头条街拍美图实战解析

我以今日头条为例来尝试通过分析 Ajax 请求来抓取网页数据的方法，我们这次要抓取的目标是今日头条的街拍美图，抓取完成之后将每组图片分文件夹下载到本地保存下来。

1. 准备工作

在本节开始之前请确保已经安装好了 Requests 库，如没有安装可以参考第一章的安装说明。

2. 抓取分析

在抓取之前我们首先要分析一下抓取的逻辑，首先打开今日头条的首页：http://www.toutiao.com/，如图所示：

#

首页内容
在右上角有一个搜索入口，在这里我们尝试抓取街拍美图，所以输入“街拍”二字搜索一下，结果图所示：

image.png

搜索结果

这样我们就跳转到了搜索结果页面。

这时打开开发者工具，查看一下所有网络请求，我们首先打开第一个网络请求，这个请求的 URL 就是当前的链接：http://www.toutiao.com/search/?keyword=街拍，打开 Preview 选项卡查看 Response Body，如果页面中的内容是直接请求直接加载出来的，那么这第一个请求的源代码中必然包含了页面结果中的文字，为了验证，我们可以尝试尝试搜索一下搜索结果的标题，比如“路人”二字，如图所示：

image.png

import os
import requests
from urllib.parse import urlencode
from hashlib import md5
from multiprocessing.pool import Pool
import redis
import json

GROUP_START = 1
GROUP_END = 5

rediscli = redis.StrictRedis(host='192.168.199.108', port=6379, db=0)


def get_page(offset):
    params = {
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'cur_tab': '3',
        'from': 'gallery',
    }
    url = 'https://www.toutiao.com/search_content/?' + urlencode(params)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError:
        return None


def get_images(json):
    data = json.get('data')
    if data:
        for item in data:
            # print(item)
            image_list = item.get('image_list')
            title = item.get('title')
            # print(image_list)
            for image in image_list:
                yield {
                    'image': image.get('url'),
                    'title': title
                }


def save_image(item):
    if not os.path.exists(item.get('title')):
        os.mkdir(item.get('title'))
    try:
        local_image_url = item.get('image')
        new_image_url = local_image_url.replace('list','large')
        response = requests.get('http:' + new_image_url)
        if response.status_code == 200:
            file_path = '{0}/{1}.{2}'.format(item.get('title'), md5(response.content).hexdigest(), 'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb')as f:
                    f.write(response.content)
            else:
                print('Already Downloaded', file_path)
    except requests.ConnectionError:
        print('Failed to save image')


def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)


if __name__ == '__main__':
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(main, groups)
    pool.close()
    pool.join()