美文网首页
38 Python批量翻译英语单词

38 Python批量翻译英语单词

作者: Viterbi | 来源:发表于2022-12-09 10:55 被阅读0次

    38 Python批量翻译英语单词

    用途: 对批量的英语文本,生成英语-汉语翻译的单词本,提供Excel下载

    本代码实现:

    1. 提供一个英文文章URL,自动下载网页;
    2. 实现网页中所有英语单词的翻译;
    3. 下载翻译结果的Excel

    涉及技术:

    1. pandas的读取csv、多数据merge、输出Excel
    2. requests库下载HTML网页
    3. BeautifulSoup解析HTML网页
    4. Python正则表达式实现英文分词

    1. 读取英语-汉语翻译词典文件

    词典文件来自:https://github.com/skywind3000/ECDICT 使用步骤:

    1. 下载代码打包:https://github.com/skywind3000/ECDICT/archive/master.zip
    2. 解压master.zip,然后解压其中的stardict.csv文件
    import pandas as pd
    
    # 注意:stardict.csv的地址需要替换成你自己的文件地址
    df_dict = pd.read_csv("D:/tmp/ECDICT-master/stardict.csv")
    
    
        d:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (11) have mixed types.Specify dtype option on import or set low_memory=False.
          interactivity=interactivity, compiler=compiler, result=result)
    
    df_dict.shape
    
    
    
        (3402564, 13)
    
    
    
    df_dict.sample(10).head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    word phonetic definition translation pos collins oxford tag bnc frq exchange detail audio
    3370509 WWDH NaN NaN [网络] 淇楄壋 NaN NaN NaN NaN NaN NaN NaN NaN NaN
    518014 chauhtan (chotan) NaN NaN 卓丹 NaN NaN NaN NaN NaN NaN NaN NaN NaN
    389953 breviarist NaN NaN [网络] 短笛师 NaN NaN NaN NaN NaN NaN NaN NaN NaN
    951231 electric-vehicle NaN NaN abbr. “EV”的变体;“electric car”的变体\n[网络] 电动汽车 NaN NaN NaN NaN NaN NaN NaN NaN NaN
    91258 Albionian æl'biәniәn NaN [地质]阿尔比翁期 NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN
    # 把word、translation之外的列扔掉
    df_dict = df_dict[["word", "translation"]]
    df_dict.head()
    
    
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }
    
        .dataframe tbody tr th {
            vertical-align: top;
        }
    
        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>word</th>
          <th>translation</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>'a</td>
          <td>na. 一\nn. 英文字母表的第一字母;【乐】A音\nart. 冠以不定冠词主要表示类别\...</td>
        </tr>
        <tr>
          <th>1</th>
          <td>'A' game</td>
          <td>[网络] 游戏;一个游戏;一局</td>
        </tr>
        <tr>
          <th>2</th>
          <td>'Abbāsīyah</td>
          <td>[地名] 阿巴西耶 ( 埃 )</td>
        </tr>
        <tr>
          <th>3</th>
          <td>'Abd al Kūrī</td>
          <td>[地名] 阿卜杜勒库里岛 ( 也门 )</td>
        </tr>
        <tr>
          <th>4</th>
          <td>'Abd al Mājid</td>
          <td>[地名] 阿卜杜勒马吉德 ( 苏丹 )</td>
        </tr>
      </tbody>
    </table>
    </div>
    
    
    
    ### 2. 下载网页,得到网页内容
    
    
    ```python
    import requests
    
    # Pandas官方文档中的一个URL
    url = "https://pandas.pydata.org/docs/user_guide/indexing.html"
    
    html_cont = requests.get(url).text
    
    html_cont[:100]
    
    
    
        '\n\n<!DOCTYPE html>\n\n<html xmlns="http://www.w3.org/1999/xhtml">\n  <head>\n    <meta charset="utf-8" />'
    
    

    3. 提取HTML的正文内容

    即:去除HTML标签,获取正文

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_cont)
    html_text = soup.get_text()
    
    
    html_text[:500]
    
    
    
    
        '\n\n\nIndexing and selecting data — pandas 1.0.1 documentation\n\n\n\n\n\n\n\n\n\n\n\n\nMathJax.Hub.Config({"tex2jax": {"inlineMath": [["$", "$"], ["\\\\(", "\\\\)"]], "processEscapes": true, "ignoreClass": "document", "processClass": "math|output_area"}})\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\n\n\nWhat\'s New in 1.0.0\n\n\nGetting started\n\n\nUser Guide\n\n\nAPI reference\n\n\nDevelopment\n\n\nRelease Notes\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIO tools (text, CSV, HDF5, â\x80¦)\n\n\nIndexing and selecting data\n\n\nMultiIndex / advanced indexing\n\n\nMerge, join, a'
    
    

    4. 英文分词和数据清洗

    # 分词
    import re
    word_list = re.split("""[ ,.\(\)/\n|\-:=\$\["']""",html_text)
    word_list[:10]
    
    
        ['', '', '', 'Indexing', 'and', 'selecting', 'data', '—', 'pandas', '1']
    
    
    
    # 读取停用词表,从网上复制的,位于当前目录下
    with open("./datas/stop_words/stop_words.txt") as fin:
        stop_words=set(fin.read().split("\n"))
    list(stop_words)[:10]
    
    
    
        ['',
         'itself',
         'showed',
         'throughout',
         'pointed',
         'n',
         'against',
         'name',
         'none',
         'ran']
    
    
    
    # 数据清洗
    word_list_clean = []
    for word in word_list:
        word = str(word).lower().strip()
        # 过滤掉空词、数字、单个字符的词、停用词
        if not word or word.isnumeric() or len(word)<=1 or word in stop_words:
            continue
        word_list_clean.append(word)
    word_list_clean[:20]
    
    
    
    
        ['indexing',
         'selecting',
         'data',
         'pandas',
         'documentation',
         'mathjax',
         'hub',
         'config',
         'tex2jax',
         'inlinemath',
         '\\\\',
         '\\\\',
         ']]',
         'processescapes',
         'true',
         'ignoreclass',
         'document',
         'processclass',
         'math',
         'output_area']
    
    

    5. 分词结果构造成一个DataFrame

    df_words = pd.DataFrame({
        "word": word_list_clean
    })
    df_words.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    word
    0 indexing
    1 selecting
    2 data
    3 pandas
    4 documentation
    df_words.shape
    
        (4915, 1)
    
    
    
    # 统计词频
    df_words = (
        df_words
        .groupby("word")["word"]
        .agg(count="size")
        .reset_index()
        .sort_values(by="count", ascending=False)
    )
    df_words.head(10)
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    word count
    620 df 161
    659 dtype 87
    1274 true 86
    593 dataframe 80
    1038 pd 75
    917 loc 72
    970 nan 72
    721 false 58
    914 list 58
    835 indexing 53

    6. 和单词词典实现merge

    df_merge = pd.merge(
        left = df_dict,
        right = df_words,
        left_on = "word",
        right_on = "word"
    )
    
    df_merge.sample(10)
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    word translation count
    658 team n. 队, 组\nvt. 把马(牛)套在同一辆车上, 把...编成一组\nvi. 驾驶卡车, 协作 3
    523 providing conj. 以...为条件, 假如 1
    394 lines n. 台词 1
    118 columns 塔器 49
    136 conforms v. 遵守( conform的第三人称单数 ); 顺应; 相一致; 相符合 1
    529 python n. 大蟒, 巨蟒\n[计] Python 程序设计语言;人生苦短,我用 Python 26
    185 determine v. 决定, 决心 1
    285 forward a. 向前的, 早的, 迅速的, 在前的, 进步的\nvt. 促进...的生长, 转寄, 运... 1
    49 arguments n. 参数 3
    564 reported a. 报告的;据报道的 1
    df_merge.shape
    
    
    
        (718, 3)
    

    7. 存入Excel

    df_merge.to_excel("./38. batch_chinese_english.xlsx", index=False)
    

    后续升级:

    1. 可以提供txt/excel/word/pdf的批量输入,生成单词本;
    2. 可以做成网页、微信小程序的形式,在线访问和使用
    3. 用户可以标记或上传“已经认识的词语”,每次过滤掉

    本文使用 文章同步助手 同步

    相关文章

      网友评论

          本文标题:38 Python批量翻译英语单词

          本文链接:https://www.haomeiwen.com/subject/fvfmtdtx.html