python 爬虫百度图片之详情大图

作者: leoryzhu | 来源:发表于2019-03-19 15:16 被阅读0次

python 爬虫百度图片之详情大图
贴吧帖子内图片抓取
Python爬虫入门
python爬取百度美女图片
3分钟带你了解世界第一语言Python 入门上手也这么简单！
python爬取百万美女图片并进行打分，你想知道谁最美吗？
python 爬虫百度图片之列表图
爬取百度贴吧帖子
python爬取百度贴吧的图片1
matplotlib利用二进制数据显示图片

上次我们学会了怎么样爬取百度的列表图片，速度很快，但是我觉得不满意，因为爬取的列表图片都只是缩略图片，点击进入详情才是高清图片。于是我继续探索。

实现的步骤

点击列表图片，进入详情页面，我们可以获取详情页的请求地址是
https://image.baidu.com/search/detail?ct=503316480&z=0&ipn=d&word=%E6%98%8E%E6%98%9F&step_word=&hs=0&pn=2&spn=0&di=6752966330&pi=0&rn=1&tn=baiduimagedetail&is=0%2C0&istype=2&ie=utf-8&oe=utf-8&in=&cl=2&lm=-1&st=-1&cs=371978350%2C138525231&os=3779051497%2C2039068748&simid=0%2C0&adpicid=0&lpn=0&ln=1785&fr=&fmq=1552974833622_R&fm=result&ic=&s=undefined&hd=&latest=&copyright=&se=&sme=&tab=0&width=&height=&face=undefined&ist=&jit=&cg=&bdtype=0&oriquery=&objurl=http%3A%2F%2Fimg.zcool.cn%2Fcommunity%2F015abf5a92a96aa801219231c32adc.jpg%401280w_1l_2o_100sh.jpg&fromurl=ippr_z2C%24qAzdH3FAzdH3Fooo_z%26e3Bzv55s_z%26e3Bv54_z%26e3BvgAzdH3Fo56hAzdH3FZM3YyMTIdMTI%3D_z%26e3Bip4s&gsm=0&rpstart=0&rpnum=0&islist=&querylist=&force=undefined
又臭又长，这个地址不能直接获取，也是动态生成的。可以发现一些参数可以在列表图片的数据中获取，列表图片的数据如下，这样把对应的数据填上就可以了，经过我测试发现只有几个参数是必要的。

image.png

检查图片，复制图片地址

image.png

进入Network -> All，第一个就是详情页请求的响应，

image.png

ctrl+f查找刚才复制的地址，注意，地址可能不是完全一致，如果找不到可以删除一些参数再找一下，最后发现图片地址也可以在js代码中找到，

image.png

这样获取详情页面响应后，可以正则匹配或者解析html要查找图片的地址。这样可以就可以下载百度高清大图了

实现的代码

import requests
import re
import time
import os
import urllib.parse
from lxml import etree
import json
page_num=30
photo_dir="D:\\data\\pic\\face\\photo"


def getDetailImage(word):
    num=0
    url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={0}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=&copyright=&word={0}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn={1}&rn="+str(page_num)+"&gsm=1e&1552975216767="
    while num<50:

        page_url=url.format(urllib.parse.quote(word),num*page_num)
        print(page_url)
        response=requests.get(page_url)
       
        regex = re.compile(r'\\(?![/u"])')
        json_data=json.loads(regex.sub(r"\\\\", response.text))#问题在于编码中是\xa0之类的，当遇到有些 不用转义的\http之类的，则会出现以上错误
        for item in json_data['data']:
            try :
                params={
                    "word":word,
                    "di":item['di'],
                    "tn":"baiduimagedetail",
                    "cs":item['cs'],
                    "os":item['os'],
                }
                detail_url="http://image.baidu.com/search/detail"
                response=requests.get(detail_url,params=params)
                selector = etree.HTML(response.text)
                pic_url=selector.xpath("//img[@id='hdFirstImgObj']/@src")[0]
                print(pic_url)
                name=pic_url.split('/')[-1]
                headers={
                    "Referer":page_url,
                }
            
                html=requests.get(pic_url,headers=headers)
                with open(os.path.join(word_dir,name),'wb')as f:
                    f.write(html.content)
            except:
                pass
            
        num=num+1
        

if __name__ == "__main__":
    word = input("请输入搜索关键词(可以是人名，地名等): ")
    word_dir=os.path.join(photo_dir,word)
    if not os.path.exists(word_dir):
        os.mkdir(word_dir)
    getDetailImage(word)