美文网首页python学习Pythonpython开发
python 高级进阶之词频统计问题

python 高级进阶之词频统计问题

作者: 与蟒唯舞 | 来源:发表于2017-02-16 10:29 被阅读216次

    现有列表如下:
    [1, 7, 10, 4, 9, 10, 9, 8, 5, 8]
    希望统计出各个元素出现的次数,最终得到一个这样的结果:{8: 2, 9: 2...},即:{某个元素: 出现的次数...}。

    • 方法一:
      首先要将这些元素作为字典的键,建立一个初始值为0的字典:
    >>> from random import randint
    >>> data = [randint(1,10) for x in xrange(10)]
    >>> data
    [1, 7, 10, 4, 9, 10, 9, 8, 5, 8]
    >>> d = dict.fromkeys(data, 0)
    >>> d
    {1: 0, 4: 0, 5: 0, 7: 0, 8: 0, 9: 0, 10: 0}
    >>> for x in data:
    >>>     d[x] += 1
    >>> d
    {1: 1, 4: 1, 5: 1, 7: 1, 8: 2, 9: 2, 10: 2}
    
    • 方法二:
      利用 collections 模块中的 CounterCounter 是一个简单的计数器:
    >>> from collections import Counter
    >>> c = Counter(data)
    >>> c
    Counter({1: 1, 4: 1, 5: 1, 7: 1, 8: 2, 9: 2, 10: 2})
    >>> isinstance(c, dict)
    True
    # 该 Counter 对象是 dict 的子类,所以可以通过键来访问对应值
    >>> c[1]
    1
    # most_common(n),直接统计出前n个最高词频
    >>> c.most_common(2)
    [(8, 2), (9, 2)]
    

    参考文档:

    class Counter(__builtin__.dict)
     |  Dict subclass for counting hashable items.  Sometimes called a bag
     |  or multiset.  Elements are stored as dictionary keys and their counts
     |  are stored as dictionary values.
     |
     |  >>> c = Counter('abcdeabcdabcaba')  # count elements from a string
     |
     |  >>> c.most_common(3)                # three most common elements
     |  [('a', 5), ('b', 4), ('c', 3)]
     |  >>> sorted(c)                       # list all unique elements
     |  ['a', 'b', 'c', 'd', 'e']
     |  >>> ''.join(sorted(c.elements()))   # list elements with repetitions
     |  'aaaaabbbbcccdde'
     |  >>> sum(c.values())                 # total of all counts
     |  15
     |
     |  >>> c['a']                          # count of letter 'a'
     |  5
     |  >>> for elem in 'shazam':           # update counts from an iterable
     |  ...     c[elem] += 1                # by adding 1 to each element's count
     |  >>> c['a']                          # now there are seven 'a'
     |  7
     |  >>> del c['b']                      # remove all 'b'
     |  >>> c['b']                          # now there are zero 'b'
     |  0
     |
     |  >>> d = Counter('simsalabim')       # make another counter
     |  >>> c.update(d)                     # add in the second counter
     |  >>> c['a']                          # now there are nine 'a'
     |  9
     |
     |  >>> c.clear()                       # empty the counter
     |  >>> c
     |  Counter()
     |
     |  Note:  If a count is set to zero or reduced to zero, it will remain
     |  in the counter until the entry is deleted or the counter is cleared:
     |
     |  >>> c = Counter('aaabbc')
     |  >>> c['b'] -= 2                     # reduce the count of 'b' by two
     |  >>> c.most_common()                 # 'b' is still in, but its count is zero |  [('a', 3), ('c', 1), ('b', 0)]
    

    相关文章

      网友评论

      • 5c39c691b65a:老师 好

        我的代码:
        print '所有数字统计:',collections.Counter(all_nums).most_common()

        结果输出为:

        所有数字统计:[(u'0', 10), (u'9', 10), (u'8', 10), (u'2', 7)]


        我不想要 出现的次数的值,即只需要前面的数字。。0 9 8 2(不要后面的次数)

        zpx=[]
        zpx = collections.Counter(all_nums).most_common()
        print zpx[0][0]
        得到输出结果:0

        我想 合并输出 0982 要怎么写呢?
        5c39c691b65a:@与蟒唯舞
        老师好。经测试。得到结果如下:
        [u'0', u'9', u'8', u'2']
        怎么才能进一步把 列表结果输出为:0982

        我自行添加如下代码测试,

        zpx=collections.Counter(all_nums).most_common()
        print zpx[0][0]+zpx[1][0]+zpx[2][0]+zpx[3][0]

        能得到结果:0982
        但这个代码看上去很笨拙,而且有一个“Bug”
        当要统计的数字里不够4个,只有1-3个的话,代码就会报错、、




        5c39c691b65a: @与蟒唯舞 好的。谢谢老师。我试试
        与蟒唯舞:可以试试列表推导,[x[0] for x in collections.Counter(all_nums).most_common()]

      本文标题:python 高级进阶之词频统计问题

      本文链接:https://www.haomeiwen.com/subject/vjnqwttx.html