美文网首页
1.利用python自动翻译

1.利用python自动翻译

作者: 小小秤 | 来源:发表于2020-11-06 17:48 被阅读0次

    对于程序员来说经常遇到英文手册,本文就是作者在工作过程中遇到英文的.chm格式的英文手册界面如图1所示,为了方便查询手册,利用python编程,网络在线翻译文档。具体的实现如下:


    图1.原英文手册截图

    1.利用windows系统自带的hh.exe文件还原.chm文件
    进入cmd 使用语句 hh -decompile 目标文件夹 源CHM文件名
    我使用的实例如图2所示


    图2.还原.chm文件

    还原后的文件如图3所示


    图3.还原后的图片
    2.还原后的文件包括各种文件类型,而我们需要翻译的是.htm文件类型。所以我们需要获取还原后的文件夹中的所有.html文件,这里就会用到编程语言中的递归,具体代码如下:
    #获取所有的html 文件
    def getHtmlFiles(dir):
    
        global htmlFiles
        fileNames = os.listdir(dir)
        for i in range(len(fileNames)):
            if fileNames[i].__contains__(".htm"):
                htmlFiles.append(dir+"\\"+fileNames[i])
                print(dir+"\\"+fileNames[i])
    
        for i in range(len(fileNames)):
            if os.path.isdir(dir + "\\" + fileNames[i]):
                getHtmlFiles(dir + "\\" + fileNames[i])
    

    打印出的文件路径如图4所示


    图4.打印的所有.htm文件路径

    3.以上获取到htm文件路径后,我们就可以根据路径读取htm的文件内容,实际操作过程中,我将翻译的结果替代htm的英文后,用浏览器打开会出现乱码,因此先要删除htm文件中关于语言的设置头文件。具体代码如下:

    #删除头文件解决中文乱码
    def deleteHtmlHead(index,path):
        print('==1_删除头文件中的属性解决中文乱码Start==')
        print('****'+str(index)+'****'+str(path))
    
        filedata = ""
        with open(path, "r") as html:
            encoding = 'UTF-8'
            # print("hhhhhhhhhhhhhhhhhhhhhhhh")
            for line in html:
                # print(line)
                if line.__contains__('<meta http-equiv="Content-Language" content="en-us">'):
                    line = ''
                if line.__contains__('<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">'):
                    line = ''
                if line.__contains__('charset=windows-1252'):
                    line = ''
                if line.__contains__('doctype HTML'):
                    line = ''
                filedata += line
    
        with open(path, "w") as html:
            html.write(filedata)
        print('==1_删除头文件中的属性解决中文乱码END==')
    

    4.原始的.htm文件打开后如图5所示。


    图5.原始的htm内容

    提取内容的文本可使用BeautifulSoup或者htmlparser 使用过程中没有得到理想的效果,因此我使用代码获取'>'与'<'之间的值来获取。其实翻译的准确性与否就在于能否准确的提取出.htm文件中的英文文本。在实际的编程过程中有许多特殊字符串情况需要处理,而本人做的不够完美。
    5.将获取到的英文文本调用有道的翻译接口进行翻译
    6.用翻译的中文结果替换原文件中的英文完成翻译
    4-6过程的具体代码如下:

    #获取翻译结果
    def translator(str):
        """
        input : str 需要翻译的字符串
        output:translation 翻译后的字符串
        """
        # API
        url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null'
        # 传输的参数, i为要翻译的内容
        key = {
            'type': "AUTO",
            'i': str,
            "doctype": "json",
            "version": "2.1",
            "keyfrom": "fanyi.web",
            "ue": "UTF-8",
            "action": "FY_BY_CLICKBUTTON",
            "typoResult": "true"
        }
        # key 这个字典为发送给有道词典服务器的内容
        response = requests.post(url, data=key)
        # 判断服务器是否相应成功
        if response.status_code == 200:
            # 通过 json.loads 把返回的结果加载成 json 格式
            if response.text==None:
                print("有道词典None")
                return None
            else:
                if (response.text.__contains__('非常抱歉,来自您ip的请求异常频繁,为了保护其他用户的正常访问,只能暂时禁止您目前的访问。')):
                    print("非常抱歉,来自您ip的请求异常频繁")
                    return None
                else:
                    result = json.loads(response.text)
            #         print ("输入的词为:%s" % result['translateResult'][0][0]['src'])
            #         print ("翻译结果为:%s" % result['translateResult'][0][0]['tgt'])
                  #  translation = result['translateResult'][0][0]['tgt']
                    translation = result['translateResult']
                    return translation
        else:
            print("有道词典调用失败")
            # 相应失败就返回空
            return None
    
    def getEnContentAndTrans(index,path):
        print('==2_处理文本分割Start==')
        print('****'+str(index)+'****'+str(path))
        html = open(path,"r")
        orgContent = html.read()
        list=[]
        if orgContent.__contains__('Example (C)'):
            list = orgContent.split('Example (C)')
        elif orgContent.__contains__('Example</span> (C)'):
            list = orgContent.split('Example</span> (C)')
        else:
            list = orgContent.split('Example (C)')
        html.close()
        print('==2_处理文本分割 END==')
        print('==3_截取英文字段并翻译Start==')
        print('****'+str(index)+'****'+str(path))
        dealText = list[0]
        chTransText = dealText
    
        totalSize = len(dealText)
    
        beginPosition = 0
        findLeft = True
        leftPosition = 0
        rightPosition = 0
        enTextList=[]
    
        for i in range(beginPosition,totalSize):
            if dealText[i]==">" and findLeft:
                beginPosition=i
                leftPosition = i;
                findLeft = False
            if dealText[i]=="<" and not findLeft:
                beginPosition = i
                rightPosition = i
                findLeft = True
    
                enText = dealText[leftPosition+1:rightPosition]
                checkText = re.sub("\s", "", enText)
                #屏蔽&nbsp;
                checkNbspText = checkText
                checkNbspText = checkNbspText.replace('&nbsp;', '')
                checkNbspText = checkNbspText.replace('\n', '')
                checkNbspText = checkNbspText.replace(' ', '')
                checkNbspText = checkNbspText.replace('\t', '')
                #屏蔽大写字母和数字
                checkNum = checkNbspText
                checkNum = checkNum.replace('.','')
    
                isDigit = checkNum.isdigit()
                isUpcase = checkNum.isupper()
    
                if (len(checkText))>0 and len(checkNbspText)>0 and isDigit==False and isUpcase==False:
                    transResult = translator(enText)
                    # if enText.__contains__('With J1939, the Rx buffer is reserved for this CAN channel'):
                    #     print('ssssssssssssssssssss')
                    ch=""
                    if transResult==None:
                        print("NONENONENONENONENONENONE")
                        sys.exit(0)
                    else:
                        for i in range(len(transResult)):
                            for j in range(len(transResult[i])):
                                item = transResult[i][j]
                                chx = item['tgt']
                                ch+=chx
                        print(enText+'___'+ch)
                        #dealText = dealText.replace(enText,ch,1)
                        if enText.__contains__('CAN') or enText.__contains__('Can') or enText.__contains__('can'):
                            chTransText = chTransText.replace(enText, 'CAN'+ch, 1)
                        elif ch.__contains__('手动RC系列30'):
                            ch=ch.replace('手动RC系列30','RC30系列手册',1)
                            chTransText = chTransText.replace(enText, ch, 1)
                        else:
                            chTransText = chTransText.replace(enText, ch, 1)
                else:
                    if len(checkText)>0:
                        print('%%%%%%'+enText)
    
                        #enTextList.append(enText)
        print('==3_截取英文字段并翻译 END==')
        print('==4_将翻译后的文件保存Start==')
        print('****'+str(index)+'****'+str(path))
        with open(path,"w") as html:
            chTransText=chTransText.replace('。','.')
            html.write(chTransText)
    
            if len(list)>1:
                # html.write('Example')
                # if orgContent.__contains__('Example (C)'):
                #     html.write('Example (C)')
                # elif orgContent.__contains__('Example</span> (C)'):
                #     html.write('Example</span> (C)')
    
                for i in range(1,len(list)):
                    if orgContent.__contains__('Example (C)'):
                        html.write('Example (C)')
                    elif orgContent.__contains__('Example</span> (C)'):
                        html.write('Example</span> (C)')
                    html.write(list[i])
        print('==4_将翻译后的文件保存 END==')
    

    翻译后的结果如图6所示


    图6.翻译结果

    7.利用.chm生成器将所有翻译后的文件打包成.chm文件
    8.一些问题。有道接口一小时只能调用1000次,我的解决方案是电脑连接手机热点,当ip被有道封锁后,插拔sim卡,ip地址就不同了。整体的代码(包括调试过程中的所有有用或无用的方法)如下:

    
    
    import json
    import requests
    import os
    from bs4 import BeautifulSoup
    import re
    import sys
    
    
    
    isFirst = True
    htmlFiles=[]
    
    #获取所有的html 文件
    def getHtmlFiles(dir):
    
        global htmlFiles
        fileNames = os.listdir(dir)
        for i in range(len(fileNames)):
            if fileNames[i].__contains__(".htm"):
                htmlFiles.append(dir+"\\"+fileNames[i])
                print(dir+"\\"+fileNames[i])
    
        for i in range(len(fileNames)):
            if os.path.isdir(dir + "\\" + fileNames[i]):
                getHtmlFiles(dir + "\\" + fileNames[i])
    
    
    #删除头文件解决中文乱码
    def deleteHtmlHead(index,path):
        print('==1_删除头文件中的属性解决中文乱码Start==')
        print('****'+str(index)+'****'+str(path))
    
        filedata = ""
        with open(path, "r") as html:
            encoding = 'UTF-8'
            # print("hhhhhhhhhhhhhhhhhhhhhhhh")
            for line in html:
                # print(line)
                if line.__contains__('<meta http-equiv="Content-Language" content="en-us">'):
                    line = ''
                if line.__contains__('<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">'):
                    line = ''
                if line.__contains__('charset=windows-1252'):
                    line = ''
                if line.__contains__('doctype HTML'):
                    line = ''
                filedata += line
    
        with open(path, "w") as html:
            html.write(filedata)
        print('==1_删除头文件中的属性解决中文乱码END==')
    
    #处理html中的文本
    def handHtmlContent(index,path):
        print('========================')
        print('==2_处理HTML中的文本Start==')
        print('========================')
        print(str(index)+'___'+str(path))
        org=''
        with open(path,"r") as html:
            content = html.read()
            soup = BeautifulSoup(content, "lxml")
            org = str(soup.text)
            # org = org.replace("\t", "\n")
            # org = re.sub("\s\n", "\n", org)
            # org=re.sub("\n\s\n","",org)
            # org = re.sub("\n\s\s\n", "", org)
            # org = re.sub("\s\s\s", "", org)
            # org = re.sub("\s\s\s\s", "", org)
            # org = re.sub("\s\s\s\s\s", "", org)
            # org = re.sub("\s\s\s\s\s\s", "", org)
            # org = re.sub("\s\s\s\s\s\s\s", "", org)
            # org = re.sub("\s\s\s\s\s\s\s\s", "", org)
            # org = re.sub("\s\s\s\s\s\s\s\s\s", "", org)
            # org = re.sub("\s\s\s\s\s\s\s\s\s\s", "", org)
            #
            # while org.__contains__("\n\n"):
            #     org = org.replace("\n\n","\n")
        print('========================')
        print('==2_处理HTML中的文本 End==')
        print('========================')
        return org
    
    
    #获取翻译结果
    def translator(str):
        """
        input : str 需要翻译的字符串
        output:translation 翻译后的字符串
        """
        # API
        url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null'
        # 传输的参数, i为要翻译的内容
        key = {
            'type': "AUTO",
            'i': str,
            "doctype": "json",
            "version": "2.1",
            "keyfrom": "fanyi.web",
            "ue": "UTF-8",
            "action": "FY_BY_CLICKBUTTON",
            "typoResult": "true"
        }
        # key 这个字典为发送给有道词典服务器的内容
        response = requests.post(url, data=key)
        # 判断服务器是否相应成功
        if response.status_code == 200:
            # 通过 json.loads 把返回的结果加载成 json 格式
            if response.text==None:
                print("有道词典None")
                return None
            else:
                if (response.text.__contains__('非常抱歉,来自您ip的请求异常频繁,为了保护其他用户的正常访问,只能暂时禁止您目前的访问。')):
                    print("非常抱歉,来自您ip的请求异常频繁")
                    return None
                else:
                    result = json.loads(response.text)
            #         print ("输入的词为:%s" % result['translateResult'][0][0]['src'])
            #         print ("翻译结果为:%s" % result['translateResult'][0][0]['tgt'])
                  #  translation = result['translateResult'][0][0]['tgt']
                    translation = result['translateResult']
                    return translation
        else:
            print("有道词典调用失败")
            # 相应失败就返回空
            return None
    
    """有道翻译函数  DONE!"""
    
    #将翻译的文本结果替换到源文件
    def updateHtmlContent(index,path,transResult):
        print('=========================')
        print('==4_将中文替换文本内容Start==')
        print('=========================')
        print(str(index) + '___' + str(path))
    
        html = open(path, "r")
        #lines = html.readlines()
        maxContent = html.read()
        updateContent = maxContent
        for i in range(len(transResult)):
            item = transResult[i][0]
            en = item['src']
            ch = item['tgt']
            if ch==None or len(en)<=0 or en.__contains__('\xa0'):
                continue
            if updateContent.__contains__(en):
                #print('==============='+ch)
                updateContent = updateContent.replace(en,ch,1)
    
        html.close()
        with open(path, "w") as html:
            html.write(updateContent)
        print('=========================')
        print('==4_将中文替换文本内容 END==')
        print('=========================')
    
    #def checkOnlyNBSP(text):
    
    def getEnContentAndTrans(index,path):
        print('==2_处理文本分割Start==')
        print('****'+str(index)+'****'+str(path))
        html = open(path,"r")
        orgContent = html.read()
        list=[]
        if orgContent.__contains__('Example (C)'):
            list = orgContent.split('Example (C)')
        elif orgContent.__contains__('Example</span> (C)'):
            list = orgContent.split('Example</span> (C)')
        else:
            list = orgContent.split('Example (C)')
        html.close()
        print('==2_处理文本分割 END==')
        print('==3_截取英文字段并翻译Start==')
        print('****'+str(index)+'****'+str(path))
        dealText = list[0]
        chTransText = dealText
    
        totalSize = len(dealText)
    
        beginPosition = 0
        findLeft = True
        leftPosition = 0
        rightPosition = 0
        enTextList=[]
    
        for i in range(beginPosition,totalSize):
            if dealText[i]==">" and findLeft:
                beginPosition=i
                leftPosition = i;
                findLeft = False
            if dealText[i]=="<" and not findLeft:
                beginPosition = i
                rightPosition = i
                findLeft = True
    
                enText = dealText[leftPosition+1:rightPosition]
                checkText = re.sub("\s", "", enText)
                #屏蔽&nbsp;
                checkNbspText = checkText
                checkNbspText = checkNbspText.replace('&nbsp;', '')
                checkNbspText = checkNbspText.replace('\n', '')
                checkNbspText = checkNbspText.replace(' ', '')
                checkNbspText = checkNbspText.replace('\t', '')
                #屏蔽大写字母和数字
                checkNum = checkNbspText
                checkNum = checkNum.replace('.','')
    
                isDigit = checkNum.isdigit()
                isUpcase = checkNum.isupper()
    
                if (len(checkText))>0 and len(checkNbspText)>0 and isDigit==False and isUpcase==False:
                    transResult = translator(enText)
                    # if enText.__contains__('With J1939, the Rx buffer is reserved for this CAN channel'):
                    #     print('ssssssssssssssssssss')
                    ch=""
                    if transResult==None:
                        print("NONENONENONENONENONENONE")
                        sys.exit(0)
                    else:
                        for i in range(len(transResult)):
                            for j in range(len(transResult[i])):
                                item = transResult[i][j]
                                chx = item['tgt']
                                ch+=chx
                        print(enText+'___'+ch)
                        #dealText = dealText.replace(enText,ch,1)
                        if enText.__contains__('CAN') or enText.__contains__('Can') or enText.__contains__('can'):
                            chTransText = chTransText.replace(enText, 'CAN'+ch, 1)
                        elif ch.__contains__('手动RC系列30'):
                            ch=ch.replace('手动RC系列30','RC30系列手册',1)
                            chTransText = chTransText.replace(enText, ch, 1)
                        else:
                            chTransText = chTransText.replace(enText, ch, 1)
                else:
                    if len(checkText)>0:
                        print('%%%%%%'+enText)
    
                        #enTextList.append(enText)
        print('==3_截取英文字段并翻译 END==')
        print('==4_将翻译后的文件保存Start==')
        print('****'+str(index)+'****'+str(path))
        with open(path,"w") as html:
            chTransText=chTransText.replace('。','.')
            html.write(chTransText)
    
            if len(list)>1:
                # html.write('Example')
                # if orgContent.__contains__('Example (C)'):
                #     html.write('Example (C)')
                # elif orgContent.__contains__('Example</span> (C)'):
                #     html.write('Example</span> (C)')
    
                for i in range(1,len(list)):
                    if orgContent.__contains__('Example (C)'):
                        html.write('Example (C)')
                    elif orgContent.__contains__('Example</span> (C)'):
                        html.write('Example</span> (C)')
                    html.write(list[i])
        print('==4_将翻译后的文件保存 END==')
    
    def main():
        # rootDir = 'D:\用户手册\RC30_Manual'+'\\'+'3_controllers'+'\\'+'HWID00D3'+'\\'+'functions'
        rootDir = 'D:\用户手册\RC30_Manual'
        print('==0_获取所有的Html文件Start==')
        global isFirst
        if isFirst:
            getHtmlFiles(rootDir)
            isFirst=False
        print('==0_获取所有的Html文件 END===')
    
        # ** ** 118 ** ** D:\用户手册\RC30_Manual\5
        # _API\CAN\can_sendDatabox.htm
        '''
        for i in range(len(htmlFiles)):
            print(str(i)+'******'+htmlFiles[i])
            if (str(htmlFiles[i])).__contains__('can_sendDatabox.htm'):
                print('=============='+str(i))
        '''
        ''''''
        for i in range(207,len(htmlFiles)):
        # for i in range(175, 176):
            #print(str(i+1)+'__'+str(htmlFiles[i]))
            # if i==0:
            deleteHtmlHead(i,str(htmlFiles[i]))
            getEnContentAndTrans(i,str(htmlFiles[i]))
    
            '''
            content = handHtmlContent(i,htmlFiles[i])
            #print(content)
            print('================================')
            print('========3_调用有道翻译Start========')
            print('================================')
    
            transResult = translator(content)
            print(transResult)
            # print(transResult[0])
            # print(transResult[0][0])
            # print(transResult[1][0])
            #
            # print(len(transResult))
            print('===============================')
            print('========3_调用有道翻译 End========')
            print('===============================')
            updateHtmlContent(i,htmlFiles[i],transResult)
            '''
    
    
    
    if __name__=='__main__':
        main()
    
    

    相关文章

      网友评论

          本文标题:1.利用python自动翻译

          本文链接:https://www.haomeiwen.com/subject/jjotbktx.html