两种简单方法实现词频统计

作者: 盗花 | 来源:发表于2018-03-10 20:46 被阅读63次

两种简单方法实现词频统计
python词频统计实例
Hadoop MapReduce 的基本helloworld程序
lupengday03
Python实现词频统计
Go 实现词频统计
MapReduce实现词频统计
可视化pyecharts库初体验
es实战-使用IK分词器进行词频统计
用Py做文本分析3：制作词云图

有时在抓取完一个网页的内容后，需要做词频分析，即统计网页中哪个词组出现的次数多，哪个词组出现的次数少。
现假设我已经拥有以下100个二元词组，如何统计每个二元词组出现的频次呢？有两种简单的方法可以实现。

    words = ['called from',
         'from a',
         'a retirement',
         'retirement which',
         'which i',
         'i had',
         'had supposed',
         'supposed was',
         'was to',
         'to continue',
         'continue for',
         'for the',
         'the residue',
         'residue of',
         'of my',
         'my life',
         'life to',
         'to fill',
         'fill the',
         'the chief',
         'chief executive',
         'executive office',
         'office of',
         'of this',
         'this great',
         'great and',
         'and free',
         'free nation',
         'nation i',
         'i appear',
         'appear before',
         'before you',
         'you fellow-citizens',
         'fellow-citizens to',
         'to take',
         'take the',
         'the oaths',
         'oaths which',
         'which the',
         'the constitution',
         'constitution prescribes',
         'prescribes as',
         'as a',
         'a necessary',
         'necessary qualification',
         'qualification for',
         'the performance',
         'performance of',
         'of its',
         'its duties',
         'duties and',
         'and in',
         'in obedience',
         'obedience to',
         'to a',
         'a custom',
         'custom coeval',
         'coeval with',
         'with our',
         'our government',
         'government and',
         'and what',
         'what i',
         'i believe',
         'believe to',
         'to be',
         'be your',
         'your expectations',
         'expectations i',
         'i proceed',
         'proceed to',
         'to present',
         'present to',
         'to you',
         'you a',
         'a summary',
         'summary of',
         'was the',
         'the principles',
         'principles which',
         'which will',
         'will govern',
         'govern me',
         'me in',
         'i shall',
         'the discharge',
         'discharge of',
         'the duties',
         'duties which',
         'i shall',
         'shall be',
         'be called',
         'called upon',
         'upon to',
         'to perform',
         'perform it',
         'it was',
         'was the',
         'the remark',
         'was the']

一、常规方法

思路是建立一个字典，将每个二元词组作为字典的键，将词频作为字典的值。

words_dict = {}  # 建立字典
for word in words:
    if word not in words_dict:  # 如果单词不在字典中
        words_dict[word] = 1  # 初始词频为1
    else:
        words_dict[word] += 1  # 每次单词出现，词频加1

接下来就是排序了，按照词频从高到低排序。

sorted_words_dict = sorted(words_dict.items(), key=lambda x: x[1], reverse=True)

排序方法除了用上面的lambda实现外，还可以用operator模块实现。

import operator
sorted_words_dict = sorted(words_dict.items(), key=operator.itemgetter(1), reverse=True)

以上两种排序方法就实现效果来看是等价的。
最终排序结果如下：

In [22]: sorted_words_dict
Out[22]: 
[('was the', 3),
 ('i shall', 2),
 ('called from', 1),
 ('from a', 1),
 ('a retirement', 1),
 ('retirement which', 1),
 ('which i', 1),
 ('i had', 1),
 ('had supposed', 1),
 ('supposed was', 1),
 ('was to', 1),
 ('to continue', 1),
 ('continue for', 1),
 ('for the', 1),
 ('the residue', 1),
 ('residue of', 1),
 ('of my', 1),
 ('my life', 1),
 ('life to', 1),
 ('to fill', 1),
 ('fill the', 1),
 ('the chief', 1),
 ('chief executive', 1),
 ('executive office', 1),
 ('office of', 1),
 ('of this', 1),
 ('this great', 1),
 ('great and', 1),
 ('and free', 1),
 ('free nation', 1),
 ('nation i', 1),
 ('i appear', 1),
 ('appear before', 1),
 ('before you', 1),
 ('you fellow-citizens', 1),
 ('fellow-citizens to', 1),
 ('to take', 1),
 ('take the', 1),
 ('the oaths', 1),
 ('oaths which', 1),
 ('which the', 1),
 ('the constitution', 1),
 ('constitution prescribes', 1),
 ('prescribes as', 1),
 ('as a', 1),
 ('a necessary', 1),
 ('necessary qualification', 1),
 ('qualification for', 1),
 ('the performance', 1),
 ('performance of', 1),
 ('of its', 1),
 ('its duties', 1),
 ('duties and', 1),
 ('and in', 1),
 ('in obedience', 1),
 ('obedience to', 1),
 ('to a', 1),
 ('a custom', 1),
 ('custom coeval', 1),
 ('coeval with', 1),
 ('with our', 1),
 ('our government', 1),
 ('government and', 1),
 ('and what', 1),
 ('what i', 1),
 ('i believe', 1),
 ('believe to', 1),
 ('to be', 1),
 ('be your', 1),
 ('your expectations', 1),
 ('expectations i', 1),
 ('i proceed', 1),
 ('proceed to', 1),
 ('to present', 1),
 ('present to', 1),
 ('to you', 1),
 ('you a', 1),
 ('a summary', 1),
 ('summary of', 1),
 ('the principles', 1),
 ('principles which', 1),
 ('which will', 1),
 ('will govern', 1),
 ('govern me', 1),
 ('me in', 1),
 ('the discharge', 1),
 ('discharge of', 1),
 ('the duties', 1),
 ('duties which', 1),
 ('shall be', 1),
 ('be called', 1),
 ('called upon', 1),
 ('upon to', 1),
 ('to perform', 1),
 ('perform it', 1),
 ('it was', 1),
 ('the remark', 1)]

可以看到，was the出现了3次，i shall出现了两次，其他二元词组每个出现了1次。

二、Counter

利用python标准库collecitons中的Counter方法，可以更容易的实现词频排序。

from collections import Counter

sorted_words_dict = Counter(words)

结果为：

In [32]: sorted_words_dict.most_common()
Out[32]: 
[('was the', 3),
 ('i shall', 2),
 ('called from', 1),
 ('from a', 1),
 ('a retirement', 1),
 ('retirement which', 1),
 ('which i', 1),
 ('i had', 1),
 ('had supposed', 1),
 ('supposed was', 1),
 ('was to', 1),
 ('to continue', 1),
 ('continue for', 1),
 ('for the', 1),
 ('the residue', 1),
 ('residue of', 1),
 ('of my', 1),
 ('my life', 1),
 ('life to', 1),
 ('to fill', 1),
 ('fill the', 1),
 ('the chief', 1),
 ('chief executive', 1),
 ('executive office', 1),
 ('office of', 1),
 ('of this', 1),
 ('this great', 1),
 ('great and', 1),
 ('and free', 1),
 ('free nation', 1),
 ('nation i', 1),
 ('i appear', 1),
 ('appear before', 1),
 ('before you', 1),
 ('you fellow-citizens', 1),
 ('fellow-citizens to', 1),
 ('to take', 1),
 ('take the', 1),
 ('the oaths', 1),
 ('oaths which', 1),
 ('which the', 1),
 ('the constitution', 1),
 ('constitution prescribes', 1),
 ('prescribes as', 1),
 ('as a', 1),
 ('a necessary', 1),
 ('necessary qualification', 1),
 ('qualification for', 1),
 ('the performance', 1),
 ('performance of', 1),
 ('of its', 1),
 ('its duties', 1),
 ('duties and', 1),
 ('and in', 1),
 ('in obedience', 1),
 ('obedience to', 1),
 ('to a', 1),
 ('a custom', 1),
 ('custom coeval', 1),
 ('coeval with', 1),
 ('with our', 1),
 ('our government', 1),
 ('government and', 1),
 ('and what', 1),
 ('what i', 1),
 ('i believe', 1),
 ('believe to', 1),
 ('to be', 1),
 ('be your', 1),
 ('your expectations', 1),
 ('expectations i', 1),
 ('i proceed', 1),
 ('proceed to', 1),
 ('to present', 1),
 ('present to', 1),
 ('to you', 1),
 ('you a', 1),
 ('a summary', 1),
 ('summary of', 1),
 ('the principles', 1),
 ('principles which', 1),
 ('which will', 1),
 ('will govern', 1),
 ('govern me', 1),
 ('me in', 1),
 ('the discharge', 1),
 ('discharge of', 1),
 ('the duties', 1),
 ('duties which', 1),
 ('shall be', 1),
 ('be called', 1),
 ('called upon', 1),
 ('upon to', 1),
 ('to perform', 1),
 ('perform it', 1),
 ('it was', 1),
 ('the remark', 1)]

两种方法都很简单，很实用，值得掌握。