美文网首页一起学起来程序员
解析煎蛋图片url的js加载

解析煎蛋图片url的js加载

作者: ever_hu | 来源:发表于2017-11-18 19:33 被阅读370次

    刚开始学习爬虫的时候大概了解了一下scrapy,但是后面在工作中并没有使用scrapy,所以就忘记了大概的用法。最近想重新学习一下scrapy,就想爬一下煎蛋的妹子图练一下手,但是在实际操作的时候,发现请求返回的内容里面并没有图片的链接:

    <li id="comment-3617895">
      <div>
        <div class="row">
          <div class="author">
            <strong title="防伪码:8c7a19c8025844512774c2c6103cae3c0e9d9b5f" class="">进击的肥喵</strong>
            <br>
            <small>
              <a href="#footer" title="@回复" onclick="document.getElementById('comment').value += &#39;@&lt;a href=&quot;//jandan.net/ooxx/page-310#comment-3617895&quot;&gt;进击的肥喵&lt;/a&gt;: &#39;">@4 hours ago</a>
    </span></small></div>
    <div class="text">
      <span class="righttext">
        <a href="//jandan.net/ooxx/page-310#comment-3617895">3617895</a>
      </span>
      <p>
        <img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)"/>
        <span class="img-hash">6fadiP6jpEOinbyOjDMf5F1MT01mhMHpB0oC562st3bqZwhPR+OhO+YvbyrNqKyKMmBNGSDh7Gk0I+B+zcKmrgCm3n1M0bXlNjhOjdDps9/hCO039Uo2+w</span>
      </p>
    </div>
    <div class="jandan-vote">
      <span class="tucao-like-container">
        <a title="圈圈/支持" href="javascript:;" class="comment-like like" data-id="3617895" data-type="pos">OO</a>
        [<span>45</span>
        ]
                                
      </span>
      <span class="tucao-unlike-container">
        <a title="叉叉/反对" href="javascript:;" class="comment-unlike unlike" data-id="3617895" data-type="neg">XX</a>
        [<span>3</span>
        ]
    
                                <a href="javascript:;" class="tucao-btn" data-id="3617895">吐槽 [0] </a>
      </span>
    </div>
    </div></div></li>
    

    但是我们打开控制面板后,可以看到图片的url

    main_page.png

    所以我一开始以为它是异步加载的,但是查看网络传输的时候并没有看到它后续请求,所以就想着图片的地址会不会已经在返回的页面里,只是后续通js把它解析出来,所以再次回到返回的页面查看,发现了一个比较重要的东西,img-hash:

    <p>
        <img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)"/>
        <span class="img-hash">6fadiP6jpEOinbyOjDMf5F1MT01mhMHpB0oC562st3bqZwhPR+OhO+YvbyrNqKyKMmBNGSDh7Gk0I+B+zcKmrgCm3n1M0bXlNjhOjdDps9/hCO039Uo2+w</span>
      </p>
    

    可以看出它一开始是一个空白的图片,然后在onload的时候会调用jandan_load_img方法加载图片,而img-hash就很有可能保存着图片的url

    然后全局搜索jandan_load_img这个方法,可以发现它在一个js里面:http://cdn.jandan.net/static/min/xxxxxxxxxxxxxxxxxxxxxxx.js(在实际操作中名字不一定相同,但是路径都是一样的)。我们对这个函数下断点:

    jandan_load_img.png

    可以看到e就是img-hash,经过S45fAAhlWwSoItVgdyMFW4jIPId52kxV方法调用后返回的就是图片的url,我们再看一下这个函数:

    function S45fAAhlWwSoItVgdyMFW4jIPId52kxV(n, k, x, f) {
      var k = k ? k : "DECODE";
      var x = x ? x : "";
      var f = f ? f : 0;
      var g = 4;
      x = md5(x);
      var w = md5(x.substr(0, 16));
      var u = md5(x.substr(16, 16));
      if (g) {
        if (k == "DECODE") {
          var t = n.substr(0, g)
        } else {
          var b = md5(microtime());
          var d = b.length - g;
          var t = b.substr(d, g)
        }
      } else {
        var t = ""
      }
      var r = w + md5(w + t);
      var m;
      if (k == "DECODE") {
        n = n.substr(g);
        m = base64_decode(n)
      } else {
        f = f ? f + time() : 0;
        tmpstr = f.toString();
        if (tmpstr.length >= 10) {
          n = tmpstr.substr(0, 10) + md5(n + u).substr(0, 16) + n
        } else {
          var e = 10 - tmpstr.length;
          for (var p = 0; p < e; p++) {
            tmpstr = "0" + tmpstr
          }
          n = tmpstr + md5(n + u).substr(0, 16) + n
        }
        m = n
      }
      var h = new Array(256);
      for (var p = 0; p < 256; p++) {
        h[p] = p
      }
      var q = new Array();
      for (var p = 0; p < 256; p++) {
        q[p] = r.charCodeAt(p % r.length)
      }
      for (var o = p = 0; p < 256; p++) {
        o = (o + h[p] + q[p]) % 256;
        tmp = h[p];
        h[p] = h[o];
        h[o] = tmp
      }
      var l = "";
      m = m.split("");
      for (var v = o = p = 0; p < m.length; p++) {
        v = (v + 1) % 256;
        o = (o + h[v]) % 256;
        tmp = h[v];
        h[v] = h[o];
        h[o] = tmp;
        l += chr(ord(m[p]) ^ (h[(h[v] + h[o]) % 256]))
      }
      if (k == "DECODE") {
        if ((l.substr(0, 10) == 0 || l.substr(0, 10) - time() > 0) && l.substr(10, 16) == md5(l.substr(26) + u).substr(0, 16)) {
          l = l.substr(26)
        } else {
          l = ""
        }
      } else {
        l = base64_encode(l);
        var c = new RegExp("=","g");
        l = l.replace(c, "");
        l = t + l
      }
      return l
    }
    

    这个函数除了将hash值解析出url(DECODE)之外,似乎还可以将进行加密操作(ENCODE),但是我只需要解密部分的功能,所以在重写的时候只需要实现解密部分的功能就好了

    python实现如下:

    # 由于函数的x是会更新的,所以这里原来的代码不通用,具体请看下面的实现
    

    更新

    decrypt()方法中的x参数更新了:

    image.png

    当初测试时在两个新的会话中,x的值是一样的,所以我就以为它是不变的,但是现在看来它是会周期性更新,所以需要从js中匹配出它的值_pat = re.compile('f\.remove\(\);var c=.+?\(e,"(.+?)"\)'),js地址可以从html中解析出。

    js.png
    # python2 & python3
    
    import base64
    import re
    import requests
    import sys
    
    from hashlib import md5
    from lxml import etree
    
    _pat_x = re.compile('f\.remove\(\);var c=.+?\(e,"(.+?)"\)')
    
    if sys.version_info[0] == 2:  # python2
        to_int = ord
    else:  # python3
        to_int = int
    
    page_headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'en-US,en;q=0.8',
        'Host': 'jandan.net',
        'Upgrade-Insecure-Requests': '1',
        'Referer': 'http://jandan.net',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
    }
    
    js_headers = {
        'Accept': '*/*',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'en-US,en;q=0.8',
        'Host': 'cdn.jandan.net',
        'Referer': 'http://jandan.net',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
    }
    
    
    def decrypt(n, x):
        """
        :param n: img-hash
        :param x: x from js
        :return:
        """
        g = 4
        x = md5(x.encode('utf8')).hexdigest()
        w = md5(x[:16].encode('utf8')).hexdigest()
        u = md5(x[16:].encode('utf8')).hexdigest()
    
        t = n[:g]
        r = w + md5((w + t).encode('utf8')).hexdigest()
    
        n = n[g:]
        m = base64.b64decode(n + (4 - len(n) % 4) * '=')
    
        h = list(range(256))
        q = [ord(r[i % 64]) for i in range(256)]
        o = 0
        for p in range(256):
            o = (o + h[p] + q[p]) & 0xFF
            h[p], h[o] = h[o], h[p]
    
        l = ''
        v = 0
        o = 0
        for p in m:
            v = (v + 1) & 0xFF
            o = (o + h[v]) & 0xFF
            h[v], h[o] = h[o], h[v]
            l += chr(to_int(p) ^ (h[(h[v]+h[o]) & 0xFF]))
        l = l[26:]
        if not l.startswith('http:'):
            l = 'http:' + l
        return l
    
    
    def get_x(js_url):
        """
        :param js_url: js_url from page
        :return:
        """
        js = requests.get(js_url, js_headers)
        x = _pat_x.search(js.text).group(1)
        return x
    
    
    def request_url(url):
        resp = requests.get(url, headers=page_headers)
        doc = etree.HTML(resp.content)
        js_url = doc.xpath('//script[contains(@src, "cdn.jandan.net/static/min")]/@src')[0]
        if not js_url.startswith('http:'):
            js_url = 'http:' + js_url
        x = get_x(js_url)
    
        hash_images = doc.xpath('//*[@class="img-hash"]/text()')
        image_urls = []
        for item in hash_images:
            url = decrypt(item, x)
            image_urls.append(url)
        return image_urls
    
    
    if __name__ == '__main__':
        image_urls = request_url('http://jandan.net/ooxx')
        for url in image_urls:
            print(url)
    
    image.png

    相关文章

      网友评论

      • 不爱笑的男孩:LZ,报错了:http:?弌蜋姜畍w⑧踉{彇?W尨櫷'觞?Traceback (most recent call last):
        File "decodeUrl.py", line 103, in <module>
        print(url)
        IOError: [Errno 34] Result too large
        ever_hu:现在煎蛋更新了解密方法,直接对img_hash值进行base64解码就可以了
      • WritingHere:lz,这个函数返回的是乱码啊,我用py3
      • 482f06555e60:请问chrome浏览器 哪里用全局搜索?
        能详细说下js调试的步骤吗?我折腾了好久,也没弄懂,谢谢
        ever_hu:chrome全局搜索快捷键`Ctrl+Shift+F`(windows)或`Command+Shift+F`(mac),调试的话几句话讲不清楚,大致上就是找到js函数,确定函数的输入,根据函数体重写函数(或者js环境运行)。对于比较复杂的情况,需要下断点一步步调试,如果函数有调用其它函数的话,也要确定那些函数的输入,然后重写那些函数。
      • MR_Fack:这个怎么解密的原理是什么?
      • TinXie:想知道的說~

      本文标题:解析煎蛋图片url的js加载

      本文链接:https://www.haomeiwen.com/subject/ercyvxtx.html