38 Python批量翻译英语单词
用途: 对批量的英语文本,生成英语-汉语翻译的单词本,提供Excel下载
本代码实现:
- 提供一个英文文章URL,自动下载网页;
- 实现网页中所有英语单词的翻译;
- 下载翻译结果的Excel
涉及技术:
- pandas的读取csv、多数据merge、输出Excel
- requests库下载HTML网页
- BeautifulSoup解析HTML网页
- Python正则表达式实现英文分词
1. 读取英语-汉语翻译词典文件
词典文件来自:https://github.com/skywind3000/ECDICT 使用步骤:
- 下载代码打包:https://github.com/skywind3000/ECDICT/archive/master.zip
- 解压master.zip,然后解压其中的stardict.csv文件
import pandas as pd
# 注意:stardict.csv的地址需要替换成你自己的文件地址
df_dict = pd.read_csv("D:/tmp/ECDICT-master/stardict.csv")
d:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (11) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
df_dict.shape
(3402564, 13)
df_dict.sample(10).head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
word | phonetic | definition | translation | pos | collins | oxford | tag | bnc | frq | exchange | detail | audio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3370509 | WWDH | NaN | NaN | [网络] 淇楄壋 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
518014 | chauhtan (chotan) | NaN | NaN | 卓丹 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
389953 | breviarist | NaN | NaN | [网络] 短笛师 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
951231 | electric-vehicle | NaN | NaN | abbr. “EV”的变体;“electric car”的变体\n[网络] 电动汽车 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
91258 | Albionian | æl'biәniәn | NaN | [地质]阿尔比翁期 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN |
# 把word、translation之外的列扔掉
df_dict = df_dict[["word", "translation"]]
df_dict.head()
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>word</th>
<th>translation</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>'a</td>
<td>na. 一\nn. 英文字母表的第一字母;【乐】A音\nart. 冠以不定冠词主要表示类别\...</td>
</tr>
<tr>
<th>1</th>
<td>'A' game</td>
<td>[网络] 游戏;一个游戏;一局</td>
</tr>
<tr>
<th>2</th>
<td>'Abbāsīyah</td>
<td>[地名] 阿巴西耶 ( 埃 )</td>
</tr>
<tr>
<th>3</th>
<td>'Abd al Kūrī</td>
<td>[地名] 阿卜杜勒库里岛 ( 也门 )</td>
</tr>
<tr>
<th>4</th>
<td>'Abd al Mājid</td>
<td>[地名] 阿卜杜勒马吉德 ( 苏丹 )</td>
</tr>
</tbody>
</table>
</div>
### 2. 下载网页,得到网页内容
```python
import requests
# Pandas官方文档中的一个URL
url = "https://pandas.pydata.org/docs/user_guide/indexing.html"
html_cont = requests.get(url).text
html_cont[:100]
'\n\n<!DOCTYPE html>\n\n<html xmlns="http://www.w3.org/1999/xhtml">\n <head>\n <meta charset="utf-8" />'
3. 提取HTML的正文内容
即:去除HTML标签,获取正文
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_cont)
html_text = soup.get_text()
html_text[:500]
'\n\n\nIndexing and selecting data — pandas 1.0.1 documentation\n\n\n\n\n\n\n\n\n\n\n\n\nMathJax.Hub.Config({"tex2jax": {"inlineMath": [["$", "$"], ["\\\\(", "\\\\)"]], "processEscapes": true, "ignoreClass": "document", "processClass": "math|output_area"}})\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\n\n\nWhat\'s New in 1.0.0\n\n\nGetting started\n\n\nUser Guide\n\n\nAPI reference\n\n\nDevelopment\n\n\nRelease Notes\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIO tools (text, CSV, HDF5, â\x80¦)\n\n\nIndexing and selecting data\n\n\nMultiIndex / advanced indexing\n\n\nMerge, join, a'
4. 英文分词和数据清洗
# 分词
import re
word_list = re.split("""[ ,.\(\)/\n|\-:=\$\["']""",html_text)
word_list[:10]
['', '', '', 'Indexing', 'and', 'selecting', 'data', '—', 'pandas', '1']
# 读取停用词表,从网上复制的,位于当前目录下
with open("./datas/stop_words/stop_words.txt") as fin:
stop_words=set(fin.read().split("\n"))
list(stop_words)[:10]
['',
'itself',
'showed',
'throughout',
'pointed',
'n',
'against',
'name',
'none',
'ran']
# 数据清洗
word_list_clean = []
for word in word_list:
word = str(word).lower().strip()
# 过滤掉空词、数字、单个字符的词、停用词
if not word or word.isnumeric() or len(word)<=1 or word in stop_words:
continue
word_list_clean.append(word)
word_list_clean[:20]
['indexing',
'selecting',
'data',
'pandas',
'documentation',
'mathjax',
'hub',
'config',
'tex2jax',
'inlinemath',
'\\\\',
'\\\\',
']]',
'processescapes',
'true',
'ignoreclass',
'document',
'processclass',
'math',
'output_area']
5. 分词结果构造成一个DataFrame
df_words = pd.DataFrame({
"word": word_list_clean
})
df_words.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
word | |
---|---|
0 | indexing |
1 | selecting |
2 | data |
3 | pandas |
4 | documentation |
df_words.shape
(4915, 1)
# 统计词频
df_words = (
df_words
.groupby("word")["word"]
.agg(count="size")
.reset_index()
.sort_values(by="count", ascending=False)
)
df_words.head(10)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
word | count | |
---|---|---|
620 | df | 161 |
659 | dtype | 87 |
1274 | true | 86 |
593 | dataframe | 80 |
1038 | pd | 75 |
917 | loc | 72 |
970 | nan | 72 |
721 | false | 58 |
914 | list | 58 |
835 | indexing | 53 |
6. 和单词词典实现merge
df_merge = pd.merge(
left = df_dict,
right = df_words,
left_on = "word",
right_on = "word"
)
df_merge.sample(10)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
word | translation | count | |
---|---|---|---|
658 | team | n. 队, 组\nvt. 把马(牛)套在同一辆车上, 把...编成一组\nvi. 驾驶卡车, 协作 | 3 |
523 | providing | conj. 以...为条件, 假如 | 1 |
394 | lines | n. 台词 | 1 |
118 | columns | 塔器 | 49 |
136 | conforms | v. 遵守( conform的第三人称单数 ); 顺应; 相一致; 相符合 | 1 |
529 | python | n. 大蟒, 巨蟒\n[计] Python 程序设计语言;人生苦短,我用 Python | 26 |
185 | determine | v. 决定, 决心 | 1 |
285 | forward | a. 向前的, 早的, 迅速的, 在前的, 进步的\nvt. 促进...的生长, 转寄, 运... | 1 |
49 | arguments | n. 参数 | 3 |
564 | reported | a. 报告的;据报道的 | 1 |
df_merge.shape
(718, 3)
7. 存入Excel
df_merge.to_excel("./38. batch_chinese_english.xlsx", index=False)
后续升级:
- 可以提供txt/excel/word/pdf的批量输入,生成单词本;
- 可以做成网页、微信小程序的形式,在线访问和使用
- 用户可以标记或上传“已经认识的词语”,每次过滤掉
本文使用 文章同步助手 同步
网友评论