两种简单方法实现词频统计

作者: 盗花 | 来源:发表于2018-03-10 20:46 被阅读63次

有时在抓取完一个网页的内容后,需要做词频分析,即统计网页中哪个词组出现的次数多,哪个词组出现的次数少。
现假设我已经拥有以下100个二元词组,如何统计每个二元词组出现的频次呢?有两种简单的方法可以实现。

    words = ['called from',
         'from a',
         'a retirement',
         'retirement which',
         'which i',
         'i had',
         'had supposed',
         'supposed was',
         'was to',
         'to continue',
         'continue for',
         'for the',
         'the residue',
         'residue of',
         'of my',
         'my life',
         'life to',
         'to fill',
         'fill the',
         'the chief',
         'chief executive',
         'executive office',
         'office of',
         'of this',
         'this great',
         'great and',
         'and free',
         'free nation',
         'nation i',
         'i appear',
         'appear before',
         'before you',
         'you fellow-citizens',
         'fellow-citizens to',
         'to take',
         'take the',
         'the oaths',
         'oaths which',
         'which the',
         'the constitution',
         'constitution prescribes',
         'prescribes as',
         'as a',
         'a necessary',
         'necessary qualification',
         'qualification for',
         'the performance',
         'performance of',
         'of its',
         'its duties',
         'duties and',
         'and in',
         'in obedience',
         'obedience to',
         'to a',
         'a custom',
         'custom coeval',
         'coeval with',
         'with our',
         'our government',
         'government and',
         'and what',
         'what i',
         'i believe',
         'believe to',
         'to be',
         'be your',
         'your expectations',
         'expectations i',
         'i proceed',
         'proceed to',
         'to present',
         'present to',
         'to you',
         'you a',
         'a summary',
         'summary of',
         'was the',
         'the principles',
         'principles which',
         'which will',
         'will govern',
         'govern me',
         'me in',
         'i shall',
         'the discharge',
         'discharge of',
         'the duties',
         'duties which',
         'i shall',
         'shall be',
         'be called',
         'called upon',
         'upon to',
         'to perform',
         'perform it',
         'it was',
         'was the',
         'the remark',
         'was the']

一、常规方法

思路是建立一个字典,将每个二元词组作为字典的键,将词频作为字典的值。

words_dict = {}  # 建立字典
for word in words:
    if word not in words_dict:  # 如果单词不在字典中
        words_dict[word] = 1  # 初始词频为1
    else:
        words_dict[word] += 1  # 每次单词出现,词频加1

接下来就是排序了,按照词频从高到低排序。

sorted_words_dict = sorted(words_dict.items(), key=lambda x: x[1], reverse=True)

排序方法除了用上面的lambda实现外,还可以用operator模块实现。

import operator
sorted_words_dict = sorted(words_dict.items(), key=operator.itemgetter(1), reverse=True)

以上两种排序方法就实现效果来看是等价的。
最终排序结果如下:

In [22]: sorted_words_dict
Out[22]: 
[('was the', 3),
 ('i shall', 2),
 ('called from', 1),
 ('from a', 1),
 ('a retirement', 1),
 ('retirement which', 1),
 ('which i', 1),
 ('i had', 1),
 ('had supposed', 1),
 ('supposed was', 1),
 ('was to', 1),
 ('to continue', 1),
 ('continue for', 1),
 ('for the', 1),
 ('the residue', 1),
 ('residue of', 1),
 ('of my', 1),
 ('my life', 1),
 ('life to', 1),
 ('to fill', 1),
 ('fill the', 1),
 ('the chief', 1),
 ('chief executive', 1),
 ('executive office', 1),
 ('office of', 1),
 ('of this', 1),
 ('this great', 1),
 ('great and', 1),
 ('and free', 1),
 ('free nation', 1),
 ('nation i', 1),
 ('i appear', 1),
 ('appear before', 1),
 ('before you', 1),
 ('you fellow-citizens', 1),
 ('fellow-citizens to', 1),
 ('to take', 1),
 ('take the', 1),
 ('the oaths', 1),
 ('oaths which', 1),
 ('which the', 1),
 ('the constitution', 1),
 ('constitution prescribes', 1),
 ('prescribes as', 1),
 ('as a', 1),
 ('a necessary', 1),
 ('necessary qualification', 1),
 ('qualification for', 1),
 ('the performance', 1),
 ('performance of', 1),
 ('of its', 1),
 ('its duties', 1),
 ('duties and', 1),
 ('and in', 1),
 ('in obedience', 1),
 ('obedience to', 1),
 ('to a', 1),
 ('a custom', 1),
 ('custom coeval', 1),
 ('coeval with', 1),
 ('with our', 1),
 ('our government', 1),
 ('government and', 1),
 ('and what', 1),
 ('what i', 1),
 ('i believe', 1),
 ('believe to', 1),
 ('to be', 1),
 ('be your', 1),
 ('your expectations', 1),
 ('expectations i', 1),
 ('i proceed', 1),
 ('proceed to', 1),
 ('to present', 1),
 ('present to', 1),
 ('to you', 1),
 ('you a', 1),
 ('a summary', 1),
 ('summary of', 1),
 ('the principles', 1),
 ('principles which', 1),
 ('which will', 1),
 ('will govern', 1),
 ('govern me', 1),
 ('me in', 1),
 ('the discharge', 1),
 ('discharge of', 1),
 ('the duties', 1),
 ('duties which', 1),
 ('shall be', 1),
 ('be called', 1),
 ('called upon', 1),
 ('upon to', 1),
 ('to perform', 1),
 ('perform it', 1),
 ('it was', 1),
 ('the remark', 1)]

可以看到,was the出现了3次,i shall出现了两次,其他二元词组每个出现了1次。

二、Counter

利用python标准库collecitons中的Counter方法,可以更容易的实现词频排序。

from collections import Counter

sorted_words_dict = Counter(words)

结果为:

In [32]: sorted_words_dict.most_common()
Out[32]: 
[('was the', 3),
 ('i shall', 2),
 ('called from', 1),
 ('from a', 1),
 ('a retirement', 1),
 ('retirement which', 1),
 ('which i', 1),
 ('i had', 1),
 ('had supposed', 1),
 ('supposed was', 1),
 ('was to', 1),
 ('to continue', 1),
 ('continue for', 1),
 ('for the', 1),
 ('the residue', 1),
 ('residue of', 1),
 ('of my', 1),
 ('my life', 1),
 ('life to', 1),
 ('to fill', 1),
 ('fill the', 1),
 ('the chief', 1),
 ('chief executive', 1),
 ('executive office', 1),
 ('office of', 1),
 ('of this', 1),
 ('this great', 1),
 ('great and', 1),
 ('and free', 1),
 ('free nation', 1),
 ('nation i', 1),
 ('i appear', 1),
 ('appear before', 1),
 ('before you', 1),
 ('you fellow-citizens', 1),
 ('fellow-citizens to', 1),
 ('to take', 1),
 ('take the', 1),
 ('the oaths', 1),
 ('oaths which', 1),
 ('which the', 1),
 ('the constitution', 1),
 ('constitution prescribes', 1),
 ('prescribes as', 1),
 ('as a', 1),
 ('a necessary', 1),
 ('necessary qualification', 1),
 ('qualification for', 1),
 ('the performance', 1),
 ('performance of', 1),
 ('of its', 1),
 ('its duties', 1),
 ('duties and', 1),
 ('and in', 1),
 ('in obedience', 1),
 ('obedience to', 1),
 ('to a', 1),
 ('a custom', 1),
 ('custom coeval', 1),
 ('coeval with', 1),
 ('with our', 1),
 ('our government', 1),
 ('government and', 1),
 ('and what', 1),
 ('what i', 1),
 ('i believe', 1),
 ('believe to', 1),
 ('to be', 1),
 ('be your', 1),
 ('your expectations', 1),
 ('expectations i', 1),
 ('i proceed', 1),
 ('proceed to', 1),
 ('to present', 1),
 ('present to', 1),
 ('to you', 1),
 ('you a', 1),
 ('a summary', 1),
 ('summary of', 1),
 ('the principles', 1),
 ('principles which', 1),
 ('which will', 1),
 ('will govern', 1),
 ('govern me', 1),
 ('me in', 1),
 ('the discharge', 1),
 ('discharge of', 1),
 ('the duties', 1),
 ('duties which', 1),
 ('shall be', 1),
 ('be called', 1),
 ('called upon', 1),
 ('upon to', 1),
 ('to perform', 1),
 ('perform it', 1),
 ('it was', 1),
 ('the remark', 1)]

两种方法都很简单,很实用,值得掌握。

相关文章

  • 两种简单方法实现词频统计

    有时在抓取完一个网页的内容后,需要做词频分析,即统计网页中哪个词组出现的次数多,哪个词组出现的次数少。现假设我已经...

  • python词频统计实例

    项目概述 通过两个Python文件实现一个简单的词频统计。 本工程共有4个文件: file01:要统计的词频文件。...

  • Hadoop MapReduce 的基本helloworld程序

    本程序实现最简单的MapReduce程序:计算文章的词频统计,wordcount头文件 其他部分

  • lupengday03

    字典 字典操作的方法 词频统计 高级字典 pandas

  • Python实现词频统计

    《百年孤独》词频统计 学习更多?欢迎关注本人公众号:Python无忧

  • Go 实现词频统计

    功能 统计多个文件中英文单词出现的次数 按照词频从多到少排序输出 支持并发 实现 创建 file.txt 内容如下...

  • MapReduce实现词频统计

    一、MapReduce编程指导思想 MapReduce的开发一共有八个步骤其中map阶段分为2个步骤,shuffl...

  • 可视化pyecharts库初体验

    爬取学校贴吧150个帖子,统计词频,简单数据分析 一、数据采集目标站点:百度贴吧 二、分词统计词频(jieba) ...

  • es实战-使用IK分词器进行词频统计

    本文主要介绍如何通过 IK 分词器进行词频统计。使用分词器对文章的词频进行统计,主要目的是实现如下图所示的词云功能...

  • 用Py做文本分析3:制作词云图

    1.词频统计 在词频统计之前,需要先完成分词工作。因为词频统计是基于分词后所构建的list进行的。 1.1使用Pa...

网友评论

    本文标题:两种简单方法实现词频统计

    本文链接:https://www.haomeiwen.com/subject/zqvbfftx.html