美文网首页
python 中文,英文做词频统计小计

python 中文,英文做词频统计小计

作者: xu一直在路上 | 来源:发表于2019-01-05 14:22 被阅读0次

作为一个爬虫工程师,词频统计还是要有所了解的,对于舆情的文本处理,统计每个词出现的次数,亦或是统计文本出现top10词,为以后简单的数据分析,做一点点准备。那么我们开始来处理吧。

import re

text = '''Which year will be the turning point for the world's most populous country in which its population experiences negative growth? Chinese demographers differ in their answers.
Experts with the Chinese Academy of Social Sciences estimate the turning point could arrive around 2028 after the population peaks to 1.44 billion, says the Green Book of Population and Labor co-released by the Chinese Academy of Social Sciences and Social Sciences Academic Press on Thursday. 
However, Huang Wenzheng, a demographics expert, told the Global Times on Friday that this estimate is too optimistic. He estimated the year 2024 or 2025 will be the threshold for population negative growth.
According to Huang, the prediction in the green book is based on the fertility rate that could remain at 1.6, which is hard to realize. 
In 2016, China's fertility rate was 1.7, but in 2017, the number of births was less, according to media reports. 
The births in 2016 and 2017 were high compared to years before, said Huang. "This was due to the introduction of two-child policy for all families [in 2016] which encouraged those who had the willingness to have a second child before the policy. So they hastened to give birth in these two years."
"But the overall trend is that people are no longer willing to have more children."
Huang elaborated that people's concept of raising children has changed. Urban people care about quality, rather than quantity. "They want to provide the best resources they have to bring up their children. This won't be possible if they have several," he said. 
With rapid urbanization, many people from rural areas come to work in the city and also follow this practice. 
"Previously people thought that having two or three children is normal. But now they are accustomed to having only one child. They find this normal," Huang said.
Yi Fuxian, a research fellow at the University of Wisconsin-Madison, holds a more pessimistic view. He told the Global Times that 2018 has seen negative growth based on his own research and analysis. 
Both Yi and Huang believe that China will abandon the two-child policy this year, putting an end to family planning, in order to stimulate births. They also warned that the sharp decline in population could have negative influence on the economy.
China has introduced a series of new measures to stimulate fertility. This year, the country's tax cuts also favor families with children. Families are able to deduct 12,000 yuan ($1,748) a year from their taxable income for children's education.
Huang said this is still far from enough. He suggested the government provide free upbringing of children aged 0 to 3 and make kindergarten education compulsory to further ease the burden of educating children. 


'''
# 词频统计
def word_count(string):
    if isinstance(string, str):
        new_text = string.strip()
        str_list = re.split('\s+', new_text)
        word_dict = {}
        for str_word in str_list:
            if str_word in word_dict.keys():#如果key存在则value加1
                word_dict[str_word] = word_dict[str_word] + 1
            else:
                word_dict[str_word] = 1
        return word_dict
    else:
        raise 'Please enter a string'


word = word_count(string=text)
#print(word)

# 词频统计按降序排序取前10
word_list = sorted(word .items(), key=lambda x: x[1], reverse=True)[0:11]
print(word_list)

image.png

如上图统计文本top10词汇出现的词语,以及次数。

以上是英文词频统计,下面我们看看中文文本怎么统计吧。
首先中文统计我们需要下载一个第三方库jieba分词。
安装 pip install jieba
处理文本分词
import jieba
content_text ='''然而,我们并没有时间去探索数据集中的数千个案例。我们应该做的则是在测试案例的典型范例上继续运行LIME,看看哪些词的占有率仍能位居前列。通过这种方法,我们可以获得像以前模型那样的单词的重要性分数,并验证模型的预测'''

def get_(string):
    b = list(jieba.cut(string, cut_all=True))
    dict = {}
    for str in b:
        if str != '' and str != '\n':#去除空白字符,和换行符。
            if str in dict.keys():
                dict[str] = dict[str] + 1
            else:
                dict[str] = 1
    return dict

word = get_(string=content_text )
#取前十top10词汇
word_list = sorted(word .items(), key=lambda x: x[1], reverse=True)[0:11]
print(word_list)
image.png

这是中文版词频统计结果截图。

好了,今天小结到这里就完了,有兴趣的小伙伴,可以私信我,

相关文章

  • python 中文,英文做词频统计小计

    作为一个爬虫工程师,词频统计还是要有所了解的,对于舆情的文本处理,统计每个词出现的次数,亦或是统计文本出现top1...

  • 开启自学人生

    day6 姓名:邓超 学号:1901010076 学习:封装统计英文词频的函数+封装统计中文词频的函数。 总结:1...

  • Python 词频统计-中文分词

    中文分词: 我的家乡可以分为 我 的 家乡 停用词 数据处理,需要过来的词语和子 如web,网址等 语气助词、副词...

  • Python中文词频统计

    今天看到的一个统计,统计的金庸小说里面的高频词语。想着看了一周python,试试看能不能统计。网上找的代码,调整顺...

  • 文本挖掘

    1文本词频分析(中英文各一份)及列表的sort()使用2如何利用python统计英文文章词频3主题模型 LDA 入...

  • Python中文分词及词频统计

    中文分词 中文分词(Chinese Word Segmentation),将中文语句切割成单独的词组。英文使用空格...

  • 教你用Python进行中文词频统计

    Python是用于数据挖掘的利器 用Python可以用来做很多很好玩的东西,下面就来用Python来进行词频统计 ...

  • python统计词频

    一、最终目的 统计四六级真题中四六级词汇出现的频率,并提取对应的例句,最终保存到SQL数据库中。 二、处理过程 1...

  • python统计词频

    一、使用re库进行识别 1、代码 2、参考 python--10行代码搞定词频统计python:统计历年英语四六级...

  • python 词频统计

    """Count words.""" def count_words(s, n): """Return the...

网友评论

      本文标题:python 中文,英文做词频统计小计

      本文链接:https://www.haomeiwen.com/subject/wsurrqtx.html