jieba库是优秀的中文分词第三方库。
通过
pip install jieba
进行安装。
jieba中文分词的原理:通过中文词库的方式来识别分词。利用中文词库,确定汉字之间的关联概率,汉字间概率大的组成词组,形成分词结果。
jieba分词的三种模式
- 精确模式:把文本精确的切分开,不存在冗余单词。
- 全模式:把文本中所有可能的词语都扫描出来,有冗余。
- 搜索引擎模式:在精确模式基础上,会长词再次切分。
jieba常用函数: - jieba.lcut(s) 精确模式,返回一个列表类型的分词结果。
- jieba.lcut(s,cut_all=True) 全模式,返回一个列表类型的分词结果,存在冗余。
- jieba.lcut_for_search(s) 搜索引擎模式,返回一个列表类型呢的分词结果,存在冗余。
- jieba.add_word(w) 向分词词典增加新词w。
实际掌握jieba.lcut(s)即可
哈姆雷特词频统计实例:
文本链接
#CalHamletV1.py
def getText():
txt = open("hamlet.txt","r").read()
txt = txt.lower()
for ch in '|"#$%&()*+,-./:;<=>?@[\\]^_{|}·~‘’':
txt = txt.replace(ch," ")
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
'''
#完成的工作就是对一个列表按照键值对的2个元素的第二个元素进行由大到小排序。
sort()排序方法:lambda用来指定使用哪一个多元选项的列作为排序列,
默认从小到大排,reverse=True返回的是从大到小排
'''
for i in range(10):
word, count = items[i]
print("{0:<10}{1:>5}".format(word,count))
三国演义人物出场次数统计
文本链接
#CalThreekingdomsV1.py
import jieba
txt = open("threekingdoms.txt","r",encoding="utf-8").read()
excludes = {"将军","却说","荆州",
"二人","不可","不能",
"如此","商议","如何",
"主公","军士","左右",
"军马","引兵","大喜",
"天下","次日","东吴",
"于是","今日","不敢"}
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(word,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(20):
word, count = items[i]
print("{0:<10}{1:>5}".format(word,count))
网友评论