美文网首页爬虫专题
Instagram爬虫记录(一)

Instagram爬虫记录(一)

作者: 10f0a5e89b6d | 来源:发表于2018-08-26 14:42 被阅读13次

Hello大家好,本人呢是IU粉,有天刷着IU的ins,然后产生了爬下IU发布的照片和视频的内容的想法,说干就干,走起!

版本是Python3.7,及其Windows 10 环境
ps:本篇最后附上本文源码


打开目标页面进行分析

首先打开IU的主页:https://www.instagram.com/dlwlrma/

IU首页
我是用火狐浏览器打开的,F12开始调试,通过选择元素可以发现,元素中的所有图片链接均在class="v1Nh3 kIKUG _bz0w"的一个div下面,如图所示:

OK,现在我们去调试器查看网页源码,对上述那个div就行搜索,发现网页源码中并没有这个div,由此,可以判断,此div为动态加载。
搜索div.png
那么问题来了,所有的数据都以某种暂时不知道的形式加载到首页HTML文件中。下面,我们就开始去搜索数据到底在哪里,首先我们去复制前面看到的某个jpg文件的名字,例如:39099041_724879754520599_610565124800905216_n.jpg,然后我们用这个名字继续去网页源码中进行搜索。

哈哈,nice,找到了,看来这一行就是加载的图片数据了,ok,我们把这一行整体单独复制出来,看看是个什么鬼!
咦~scrip包裹的window._sharedData数据,难得的貌似还是json数据

ok再将window._sharedData单独搂出来,并且格式化一下,看看现在的数据:

嘿嘿,是不是很漂亮?
下面我们就来写代码吧!第一个小目标就是拿到window._sharedData数据,很简单,不是吗?
# -*- coding: utf-8 -*- 
from lxml import etree
import requests

headers = { "Origin": "https://www.instagram.com/",
            "Referer": "https://www.instagram.com/",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/58.0.3029.110 Safari/537.36",
            "Connection": "keep-alive",
            "Host": "www.instagram.com",
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "accept-encoding": "gzip, deflate, sdch, br",
            "accept-language": "zh-CN,zh;q=0.8",
            "X-Instragram-AJAX": "1",
            "X-Requested-With": "XMLHttpRequest",
            "Upgrade-Insecure-Requests": "1",
            }

BASE_URL = 'https://www.instagram.com/dlwlrma/'

proxy = {
    'http': 'http://127.0.0.1:1080',
    'https': 'http://127.0.0.1:1080'
}


def crawler():
    try:
        res = requests.get(BASE_URL,headers = headers,proxies = proxy)
        html = etree.HTML(res.content.decode())
        all_js_tags = html.xpath('//script[@type="text/javascript"]/text()')
        for js_tag in all_js_tags:
            if js_tag.strip().startswith('window._sharedData'):
                print(js_tag)
    except Exception as e:
        print("有异常!!!")
        raise e


if __name__ == '__main__':
    crawler()

看看输出结果吧,已经成功得到window._sharedData数据啦!


通过对window._sharedData数据进行分析,发现jpg链接["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][ "edges"]下面,ok,再次对数据进行处理,拿到所有的jpg链接
def crawler():
    try:
        res = requests.get(BASE_URL,headers = headers,proxies = proxy)
        html = etree.HTML(res.content.decode())
        all_js_tags = html.xpath('//script[@type="text/javascript"]/text()')
        new_imgs_url = []
        for js_tag in all_js_tags:
            if js_tag.strip().startswith('window._sharedData'):
                data = js_tag[:-1].split('= {')[1]
                js_data = json.loads('{' + data, encoding='utf-8')
                edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][
                    "edges"]
                # print(edges[0])
                for edge in edges:
                    new_imgs_url.append(edge["node"]["display_url"])
        for i in new_imgs_url:
            print(i)

    except Exception as e:
        print("有异常!!!")
        raise e
得到jpg链接.png

ok,下面开始写一个下载方法对得到的链接进行下载

def download(imgs_urls,save_img_path):
    for i in imgs_urls:
        print(i)
        img_name = name_re.findall(i)[0]
        img_mp_path = save_img_path + img_name
        print(img_name + "正在下载")
        with open(img_mp_path, 'wb+') as f:
            f.write(requests.get(i, proxies=proxy).content)
            time.sleep(1)

现在,就已经下载完成了呢,去文件夹看看看下载好的图片吧!



本篇全部代码

# -*- coding: utf-8 -*-
from lxml import etree
import re
import json
import requests
import time

headers = { "Origin": "https://www.instagram.com/",
            "Referer": "https://www.instagram.com/",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/58.0.3029.110 Safari/537.36",
            "Connection": "keep-alive",
            "Host": "www.instagram.com",
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "accept-encoding": "gzip, deflate, sdch, br",
            "accept-language": "zh-CN,zh;q=0.8",
            "X-Instragram-AJAX": "1",
            "X-Requested-With": "XMLHttpRequest",
            "Upgrade-Insecure-Requests": "1",
            }

BASE_URL = 'https://www.instagram.com/dlwlrma/'

proxy = {
    'http': 'http://127.0.0.1:1080',
    'https': 'http://127.0.0.1:1080'
}

name_re = re.compile(r'[0-9\_]+[a-zA-Z]+[\.jpgmp4]{4}')
save_path = 'E:/IU/'

def crawler():
    try:
        res = requests.get(BASE_URL,headers = headers,proxies = proxy)
        html = etree.HTML(res.content.decode())
        all_js_tags = html.xpath('//script[@type="text/javascript"]/text()')
        new_imgs_url = []
        for js_tag in all_js_tags:
            if js_tag.strip().startswith('window._sharedData'):
                data = js_tag[:-1].split('= {')[1]
                js_data = json.loads('{' + data, encoding='utf-8')
                edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][
                    "edges"]
                # print(edges[0])
                for edge in edges:
                    new_imgs_url.append(edge["node"]["display_url"])
        for i in new_imgs_url:
            print(i)
        download(new_imgs_url,save_path)

    except Exception as e:
        print("有异常!!!")
        raise e


def download(imgs_urls,save_img_path):
    for i in imgs_urls:
        print(i)
        img_name = name_re.findall(i)[0]
        img_mp_path = save_img_path + img_name
        print(img_name + "正在下载")
        with open(img_mp_path, 'wb+') as f:
            f.write(requests.get(i, proxies=proxy).content)
            time.sleep(1)


if __name__ == '__main__':
    crawler()

PS:第一次写简书,如有不足和错误,请各位看官多多指教
当然啦,对于wuli大IU怎么去只爬取12张图片呢,下一篇将会讲获取更多链接的方法,下篇见啦!

相关文章

网友评论

  • 53880ff8a760:怎么只有12张图片,怎样爬取更多?

本文标题:Instagram爬虫记录(一)

本文链接:https://www.haomeiwen.com/subject/slkoiftx.html