使用jieba分词库相关知识,完成下列两题:
(1)查找出“threekingdoms.txt”文件中出现频率前十位的词汇
import jieba
txt=open("threekingdoms.txt","rb").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
word,count = items[i]
print("{0:<10}{1:>5}".format(word,count))
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\15228\AppData\Local\Temp\jieba.cache
Loading model cost 1.518 seconds.
Prefix dict has been built succesfully.
1586
曹操 953
孔明 836
将军 772
却说 656
玄德 586
关公 510
丞相 491
二人 469
不可 440
荆州 425
玄德曰 390
孔明曰 390
不能 383
如此 378
(2)统计出“threekingdoms.txt”文件 “关羽”、“曹操”、“诸葛亮”、“刘备” 等人名出现的次数。
import jieba
excludes={"将军","却说","荆州","二人","不可","不能","如此"}
txt=open("threekingdoms.txt","rb").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word =="孔明曰":
rword="孔明"
elif word == "关公" or word =="云长":
rword="关羽"
elif word == "玄德" or word =="玄德曰":
rword="刘备"
elif word == "孟德" or word =="丞相":
rword="曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del(counts[word])
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(5):
word,count = items[i]
print("{0:<10}{1:>5}".format(word,count))
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\15228\AppData\Local\Temp\jieba.cache
Loading model cost 0.680 seconds.
Prefix dict has been built succesfully.
1586
曹操 1451
孔明 1383
刘备 1253
关羽 784
网友评论