实例8 --- jieba库及文本词频统计

作者: glRu | 来源:发表于2020-06-10 16:29 被阅读0次

实例8 --- jieba库及文本词频统计
python：利用jieba库对中文小说进行词频统计并进行简单的
python统计《论语》里的词频
python 词频统计（jieba库）
用了那么多在线词云，终于发现了超好用的词云工具！
小猪的Python学习之旅 —— 15.浅尝Python数据分析
可视化pyecharts库初体验
《一年顶十年》词频分析的结果
第六章（1.3）自然语言处理实战——使用tf-idf算法实现简单
python 数据词云展示实例（2）- jieba库的使用

基本统计值计算

举一反三

-获取多个数据：从控制台获取多个不确定数据的方法

-分隔多个函数：模块化设计方法

-充分利用函数：充分利用Python提供的内置函数

jieba库

-中文文本需要通过分词获得单个的词语

-jieba是优秀的中文分词第三方库，需要额外安装

-jieba库提供三种分词模式，最简单只需掌握一个函数

jieba分词的原理

-利用一个中文词库，确定中文字符之间的关联概率

-中文字符间概率大的组成词组，形成分词结果

-除了分词，用户还可以添加自定义的词组

jieba分词的三种模式

-精确模式：把文本精确的切分开，不存在冗余单词

-全模式：把文本中所有可能的词语都扫描出来，有冗余

-搜索引擎模式：在精确模式基础上，对长词再次切分

jieba库常用函数

"文本词频统计"问题分析

-英文文本：Hamet分析词频

https://python123.io/resources/pye/hamlet.txt

-中文文本：《三国演义》分析人物

https://python123.io/resources/pye/threekingdoms.txt

1. "Hamlet英文词频统计"实例代码

# CalHamletV1.py

def get_text():

txt= open('hamlet.txt', 'r').read()

txt= txt.lower()

for chin '!"#$%&()+,-./:;<=>?@[\\]^_{|}~':

txt= txt.replace(ch, " ")

return txt

hamletTxt= get_text()

words= hamletTxt.split()

counts= {}

for wordin words:

counts[word] = counts.get(word, 0) + 1

items= list(counts.items())

items.sort(key=lambda x:x[1], reverse=True)

for iin range(10):

word, count= items[i]

print("{0:<10}{1:>5}".format(word, count))

"《三国演义》人物出场统计"实例代码

# CalThreeKingdomV2.py

import jieba

txt= open('threekingdom.txt', 'r',encoding="utf-8").read()

exclude= {'将军', '却说', '荆州', '二人', '不可', '不能', '如此'}

words= jieba.lcut(txt)

counts= {}

for wordin words:

if len(word) == 1:

continue

elif word== '诸葛亮' or word== '孔明曰':

rword== '孔明'

elif word== '关公' or word== '云长':

rword== '关羽'

elif word== '玄德' or word== '玄德曰':

rword== '刘备'

elif word== '孟德' or word== '丞相':

rword== '曹操'

else:

rword= word

counts[rword] = counts.get(rword, 0) + 1

for wordin exclude:

del counts[word]

items= list(counts.items())

items.sort(key=lambda x: x[1], reverse=True)

for iin range(10):

word, count= items[i]

print("{0:<10}{1:>5}".format(word, count))

词频统计举一反三

-《红楼梦》、《西游记》、《水浒传》…

-政府工作报告、科研论文、新闻报道…

-进一步呢？未来还有词云…

网友评论

本文标题：实例8 --- jieba库及文本词频统计

本文链接：https://www.haomeiwen.com/subject/fupytktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！