python3爬虫项目实战（二）ajax请求爬取网站图片

作者: Thunder_Storm | 来源:发表于2020-03-03 15:52 被阅读0次

2018-12-28
python3爬虫项目实战（二）ajax请求爬取网站图片
基础篇-爬虫基本原理
爬虫很难？最适合新人上手的3个Python项目,即学即用！
3 个适合新人上手的Python项目
python-爬虫学习（文字、图片、视频）
python爬虫学习（文字、图片、视频）
最简单的万能爬虫器
爬虫实战七、使用Scrapyd部署Scrapy爬虫到远程服务器
Python爬取豆瓣电影动态数据

这次的项目是利用ajax来对网站图片进行爬取。Ajax的作用就是在保证页面不被刷新的情况下，与服务器交换数据从而只更新部分网页的技术。
这里不多讲述ajax技术，先给出网页，来具体分析。
https://www.toutiao.com/
这个是头条的网站，我们在搜索框输入“街拍”二字，点击搜索，就可以进入到搜索界面，按一下F12就可以进入检查模式，在这个模式下，我们点击Network选项卡，选择XHR后就可以看到下面请求（如果没有按Ctrl+R刷新即可），这里XHR是一种页数的Ajax的请求类型。我们点开第一个请求，可以看到很多的信息，比如Request URL等。这里我们只来分析这个URL。

找到Ajax请求

Request URL: https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1583219891329

可以看到请求的参数：aid，app_name，offset，format，keyword，autoload，count，en_qc，cur_tab，from，pd，timestamp。
在这里，我们通过构造这些参数，就可以访问到相同的内容。

部分参数含义

这里很多参数都是固定的，比如aid，app_name等，这里不进行分析，有兴趣的话可以百度查一查，但是有一些参数比较有意思，比如offset。
大家如果拖动页面，页面刷新的时候，会出现新的Ajax请求，这里对应的请求参数offset会发生变化：每刷新一页，offset加20，因此我们如果通过变化offset可以获取多页的图片，我们通过group = ([x*20 for x in range(GROUP_START, GROUP_STOP+1)])来构造多页的图片请求。
keyword参数就是我们输入的搜索关键词，进行编码后的结果。
还有一个参数是timestamp，就是时间戳的意思，这个参数测试的时候发现，可以省略，依然可以爬取到图片，但还是给出构造的方法：

timestamp = int(time.time())

下面给出构造参数的格式：

params = {
        'aid':'24',
        'app_name':'web_search',
        'offset':offset,
        'format':'json',
        'keyword':'街拍',
        'autoload':'true',
        'count':'20',
        'en_qc':'1',
        'cur_tab':'1',
        'form':'search_tab',
        'pd':'synthesis',
        #时间戳参数，缺失了也不影响
        #'timestamp':timestamp
    }

将这些参数和网页链接连接起来：
url = 'https://www.toutiao.com/api/search/content/?'+urlencode(params)

就可以对网站进行访问。具体的工程代码在下面。
在刚开始运行的时候会出现一些问题，如下图所示：

文件夹名的报错

这个错误是因为我们是以标题名作为文件夹名，有时候标题名会有一些字符，使其无法成为文件夹的名字。
已经解决的命名问题有如下两个：

#OSError: [Errno 22] 文件名、目录名或卷标语法不正确。: 'jiepai/29张街拍，定格不一样的"空城"纸坊'`
 #OSError: [Errno 22] 文件名、目录名或卷标语法不正确。: 'jiepai/街拍小技巧|让你的街拍照片秒变时尚大片！'

所以代码中会有一些处理方法：
dir = dir.replace('"','\'').replace(' ','').replace('|','_')
不过这种错误种类太多了，目前只改了这三个
运行配置：大家需要在运行目录下增加一个“jiepai”文件夹，否则就会出现文件夹找不到的错误（暂时想起来这些，要是后面有问题可以评论解决）
运行代码后，大家就可以带自己的"jiepai"文件夹里面看到下面的东西了，点开文件夹，里面就是图片

爬取成功示例

这里给出项目源码：

import requests
from urllib.parse import urlencode
from multiprocessing import freeze_support
from multiprocessing.pool import Pool
import time
from hashlib import md5
import os
timestamp = int(time.time())
def get_page(offset):
    params = {
        'aid':'24',
        'app_name':'web_search',
        'offset':offset,
        'format':'json',
        'keyword':'街拍',
        'autoload':'true',
        'count':'20',
        'en_qc':'1',
        'cur_tab':'1',
        'form':'search_tab',
        'pd':'synthesis',
        #时间戳参数，缺失了也不影响
        #'timestamp':timestamp
    }
    headers = {
        'cookie': 'tt_webid=6709993811802818062; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6709993811802818062; UM_distinctid=16bbfdeff8f460-0e02d1b4d98e6d-37607e02-1fa400-16bbfdeff90492; CNZZDATA1259612802=558541297-1562289443-https%253A%252F%252Fwww.google.com%252F%7C1562289443; __tasessionId=aorag2kb71562292191190; csrftoken=e4aae62081d10a9bb97fb5cd48e5cfa7; s_v_web_id=03c096aa5abb1e5a2f9edc5b4be5e8f3'
    }
    url = 'https://www.toutiao.com/api/search/content/?'+urlencode(params)
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print("error",e.args)
        return None
def parse_page(json):
    if json.get('data'):
        for item in json.get('data'):
            try:
                title = item.get('title')
                images = item.get('image_list')
            except:
                continue
            else:
                if title is None or images is None:
                    continue
                else:
                    for image in images:
                        yield {
                            'title': title,
                            'image': image.get('url')
                        }
def save_image(item):
    image =item.get('image')
    title=item.get('title')
    dir = "jiepai/"+title
    #OSError: [Errno 22] 文件名、目录名或卷标语法不正确。: 'jiepai/29张街拍，定格不一样的"空城"纸坊'
    #OSError: [Errno 22] 文件名、目录名或卷标语法不正确。: 'jiepai/街拍小技巧|让你的街拍照片秒变时尚大片！'
    dir = dir.replace('"','\'').replace(' ','').replace('|','_')
    if not os.path.exists(dir):
        os.mkdir(dir)
    try:
        response = requests.get(image)
        if response.status_code == 200:
            file_path = "{0}/{1}.{2}".format(dir,md5(response.content).hexdigest(),'jpg')
            print(file_path)
            if not os.path.exists(file_path):
                with open(file_path,'wb') as f:
                    f.write(response.content)
            else:
                print('Already Download',file_path)
    except requests.ConnectionError:
        print('Fail to Save Image')

def main(offset):
    print("main",offset)
    json = get_page(offset)
    for item in parse_page(json):
        print(item)
        save_image(item)


GROUP_START = 0
GROUP_STOP = 5
if __name__ == '__main__':
    freeze_support()
    pool = Pool()
    group = ([x*20 for x in range(GROUP_START, GROUP_STOP+1)])
    print(group)
    pool.map(main, group)
    pool.close()
    pool.join()