美文网首页Python3之路
python爬虫 -- 有道翻译小工具

python爬虫 -- 有道翻译小工具

作者: faris_shi | 来源:发表于2018-02-24 08:05 被阅读596次

    最近的项目中需要自动翻译一些英文的文章,所以就在简书上找了找有没有有道翻译的小爬虫,结果就找到了一篇文章 破解有道翻译反爬虫机制

    可能是有道那边有调整的原因,结果文章中有些小问题,在此做一下更正。

    浏览器打开 http://fanyi.youdao.com/,右键 -> 检查 -> Network项。在翻译框中输入 beautiful后,我们会发现请求来了。

    我们已经确定了很多信息:

    • URL:http://fanyi.youdao.com/translate_osmartresult=dict&smartresult=rule

    • 请求方法为 POST。

    • 请求参数

    我们再试了些其他值后发现,果真变化的只有那几个参数:

    • i : 我们要翻译的词或者句子

    • salt : 加密用到的盐

    • sign : 签名字符串

    然后跟着作者的思路,来到了哪个迷迷茫茫的JS文件中,终于找到:

    嗯,没有问题,和作者说的没有差错。

    只是秘钥变了 ebSeFb%=XZ%T[KZ)c(sy!。但是比较好奇,有道是如何更换秘钥的,策略又是什么?

    先用作者文章中的代码,更换秘钥,直接跑起来,BUT,BUT,BUT没有跑通:

    {'errorCode': 50}
    

    静下心来,好好想想。可能是没有加 header,有道应该会有防爬虫的简单机制。再试:

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
        "Accept":"application/json, text/javascript, */*; q=0.01",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
        "Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",
        "Cookie":"_ntes_nnid=c686062b6d8c9e3f11e2a8413b5bb9a8,1517022642199; OUTFOX_SEARCH_USER_ID_NCOO=1367486017.479911; OUTFOX_SEARCH_USER_ID=722357816@10.168.11.24; DICT_UGC=be3af0da19b5c5e6aa4e17bd8d90b28a|; JSESSIONID=abcCzqE6R9jTv5rTtoWgw; fanyi-ad-id=40789; fanyi-ad-closed=1; ___rl__test__cookies=1519344925194",
        "Referer":"http//fanyi.youdao.com/",
        "X-Requested-With": "XMLHttpRequest"
    }
    

    执行还是报错:

     Desktop python3 aa.py
    请输入需要翻译的单词:hello
    Traceback (most recent call last):
      File "aa.py", line 94, in <module>
        print(YouDaoFanYi().translate(content))
      File "aa.py", line 77, in translate
        dictResult = json.loads(result)
      File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 349, in loads
        s = s.decode(detect_encoding(s), 'surrogatepass')
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
    

    什么情况,把响应body打印出来,更是惊奇:

    b"\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xabV*)J\xcc+\xceI,I\rJ-.\xcd)Q\xb2\x8a\x8e\xaeV*I\x072\x94\x9e\xec]\xf0t\xe9^%\x1d\xa5\xe2\xa2d 7#5''_\xa966VG)\xb5\xa8(\xbf\xc89?%U\xc9\xca@G\xa9\xa4\xb2\x00\xc8PJ\xcd3\xaa\xca\xd0u\xf6\x08\x06\xe9\xc8M,*\x81\x99X\r\x94*)\xcaL-\x06\x1a\xae\x04\x94\xcc\xd3Sx\xb1p\xc5\xf3%\xbb^N_\xf7\xb4a\xe6\xfb==\n\xcf\x9a\xbb\x9e.m\x7f\xd61\xed\xe9\x94%/\xb6n\x7f\xb6y\xc5\xb3\x96\xfeg\xd3\xb7=\x9f\xd5\xf2|\xca\x8a\x17\xeb\xd7\xc6\x14\xc5\xe4\x01\xb5f\xe6\x95\xe8)<\x9d\xd6\xf4~\xcf\xec\xa7\x93;\x9e\xef\x9d\x0e\x15\x07\x1a\xa9\xe1\x01r\x9f\xe6\x93]\xbb\x9eN\xe8\x05\xcak<\xdb<U\xf3\xe9\xfc\xe6g[f\x83\x15\x01\x9d\rq\xa8am-\x00\x9f\x1b\xb6\x04\xf7\x00\x00\x00"
    

    继续检查,继续找原因,突然我看到了一个令我觉悟的信息:

    嗯,这个应该是问题的关键。python3urllib应该没有实现这个机制。

    def readData(resp):
        info = resp.info()
        encoding = info.get('Content-Encoding')
        transferEncoding = info.get('Transfer-Encoding:')
        if transferEncoding != 'chunked' and encoding != 'gzip':
            return resp.read()
        str = ""
        while True:
            chunk = resp.read(4096)
            if not chunk: break
            decomp = zlib.decompressobj(16 + zlib.MAX_WBITS)
            data = decomp.decompress(chunk)
            str = str + data.decode("utf-8")
        return str
    

    嗯,应该没问题了,再试一下:

    Desktop python3 aa.py
    请输入需要翻译的单词:beautiful
    ['美丽的']
    

    完成,没有问题了。

    最后给出完整的代码:

    # -*- coding: utf-8 -*-
    
    
    import urllib.request
    import urllib.parse
    import json
    import zlib
    
    import time
    import random
    import hashlib
    
    
    class YouDaoFanYi:
    
        def __init__(self):
            self.url = "http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule"
    
            self.headers = {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
                "Accept":"application/json, text/javascript, */*; q=0.01",
                "Accept-Encoding": "gzip, deflate",
                "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
                "Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",
                "Cookie":"_ntes_nnid=c686062b6d8c9e3f11e2a8413b5bb9a8,1517022642199; OUTFOX_SEARCH_USER_ID_NCOO=1367486017.479911; OUTFOX_SEARCH_USER_ID=722357816@10.168.11.24; DICT_UGC=be3af0da19b5c5e6aa4e17bd8d90b28a|; JSESSIONID=abcCzqE6R9jTv5rTtoWgw; fanyi-ad-id=40789; fanyi-ad-closed=1; ___rl__test__cookies=1519344925194",
                "Referer":"http//fanyi.youdao.com/",
                "X-Requested-With": "XMLHttpRequest"
            }
    
            self.data = {
                "from":"AUTO",
                "to": "AUTO",
                "smartresult": "dict",
                "client": "fanyideskweb",
                "doctype": "json",
                "version": "2.1",
                "keyfrom": "fanyi.web",
                "action": "FY_BY_REALTIME",
                "typoResult": "false"
            }
    
            self.client = 'fanyideskweb'
            self.secretKey = 'ebSeFb%=XZ%T[KZ)c(sy!'
    
        @staticmethod
        def readData(resp):
            info = resp.info()
            encoding = info.get('Content-Encoding')
            transferEncoding = info.get('Transfer-Encoding:')
            if transferEncoding != 'chunked' and encoding != 'gzip':
                return resp.read()
            str = ""
            while True:
                chunk = resp.read(4096)
                if not chunk: break
                decomp = zlib.decompressobj(16 + zlib.MAX_WBITS)
                data = decomp.decompress(chunk)
                str = str + data.decode("utf-8")
            return str
    
        def translate(self, content):
            data = dict(self.data)
            salt = str(int(time.time() * 1000) + random.randint(1, 10))
            sign = hashlib.md5((self.client + content + salt + self.secretKey).encode('utf-8')).hexdigest()
    
            data["client"] = self.client
            data["salt"] = salt
            data["sign"] = sign
            data["i"]=content
    
            data = urllib.parse.urlencode(data).encode('utf-8')
            request = urllib.request.Request(url=self.url, data=data, headers=self.headers, method='POST')
            response = urllib.request.urlopen(request)
            #result=response.read()
            #print(result)
            result = YouDaoFanYi.readData(response)
            response.close()
            dictResult = json.loads(result)
    
            paragraphs=[]
            for paragraph in dictResult["translateResult"]:
                line=""
                for a in paragraph:
                    line = line + a["tgt"]
    
                if(len(line) != 0):
                    paragraphs.append(line)
    
            return paragraphs
    
    
    #有道翻译中,是支持翻译多段落文章的,所以调用translate之后,会返回一个数组,数组里的元素就是翻译过后的段落。输入几个段落,输出就有几个段落。
    
    content = input('请输入需要翻译的单词:').replace("\\n", "\n")
    
    print(YouDaoFanYi().translate(content))
    
    

    相关文章

      网友评论

        本文标题:python爬虫 -- 有道翻译小工具

        本文链接:https://www.haomeiwen.com/subject/lpnfxftx.html