Python 爬虫 | 爬取动态加载的网站

作者: YocnZhao | 来源:发表于2019-05-23 16:34 被阅读0次

Python 爬虫 | 爬取动态加载的网站
各类链接
CentOS7上使用Chrome的无头浏览器
一个不那么典型的Python爬虫
python爬虫学习手册-服务器渲染（基础库urllib）熟悉
3个适合新人上手的Python项目
爬取通过ajax动态加载的页面（实时监控华尔街见闻资讯与快讯）
如何利用Python网络爬虫抓取微信好友数量以及微信好友的男女比
Python商品数据预处理与K-Means聚类可视化分析
爬虫很难？最适合新人上手的3个Python项目,即学即用！

上篇说了如何爬取静态网站https://www.jianshu.com/p/bbf4386f7527，我们可能在爬取的过程中发现有的网站并没有把内容放到html里面，而是通过ajax动态加载的方式放进来的。
比如http://tu.duowan.com/gallery/138916.html#p1
我们访问发现很容易找到图片的原图地址，于是我们兴冲冲的用爬虫请求一下发现根本没有地址，根本是个空的，一脸懵逼，可以比较下下面的两幅图。

浏览器的F12

爬虫请求的html
很明显我们请求的并没有地址，而浏览器是有的。
这是因为网站用了AJAX，也就是XMLHttpRequest，那我们怎么找到真正的地址呢？

XHR
我们可以从这里找到XHR请求的地址，也就是http://tu.duowan.com/index.php?r=show/getByGallery/&gid=138916&_=1558600256687，我们请求这个链接发现是个json：

地址真正的所在地
那这就好办了，既然找到了真正的地址，我们就按照我们之前的经验搞一搞。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import urllib2, urllib, os, time, json


class Pic:
    def __init__(self, url, desc, path):
        self.url = url
        self.desc = desc
        self.path = path


locol = "/Users/y/PythonWorkSpace/DUOWAN/"


def test():
    # 20000-20700
    start_index = 137882
    end_index = 138930
    for i in range(start_index, end_index):
        download_pic(i)
    return


def download_pic(index):
    curr_time = str(time.time()).replace(".", "0")
    url = "http://tu.duowan.com/index.php?r=show/getByGallery/&gid=%d&_=%s" % (index, curr_time)
    print "开始执行Task %s" % url
    request = urllib2.Request(url)  # Request参数有三个，url,data,headers,如果没有data参数，那就得按我这样的写法
    request.add_header("User-Agent",
                       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36")
    request.add_header("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7,fr;q=0.6")
    request.add_header("Accept",
                       "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3")

    response = urllib2.urlopen(request)
    # print response.code
    if response.code != 200:
        return
    html = response.read()
    if html.strip() == '':
        return
    dict = json.loads(html, encoding="GBK")
    # print raw.keys()
    # print dict[u'picInfo']
    pic_list = []
    pic_info = dict[u'picInfo']
    current_dir = locol + "" + str(index) + "/"
    for info in pic_info:
        source = info[u'source']
        desc = info[u'add_intro']
        suffix = '.gif'
        if source.endswith("gif"):
            suffix = '.gif'
        elif source.endswith("jpg"):
            suffix = '.jpg'
        else:
            return
        path = current_dir + desc + suffix
        pic = Pic(source, desc, path)
        pic_list.append(pic)

    for pic in pic_list:
        if not os.path.exists(current_dir):
            os.mkdir(current_dir)
        print "-------------开始下载---------------", pic.url, pic.path
        urllib.urlretrieve(pic.url, pic.path)

    print '休息一下，休息3s'
    time.sleep(3)
    return

打完收工~~~