美文网首页Perl学习笔记
Linux下词频的计算

Linux下词频的计算

作者: dming1024 | 来源:发表于2019-06-14 17:28 被阅读0次

    参考文章:
    https://blog.csdn.net/herecles/article/details/8152054
    https://www.cnblogs.com/standby/p/8309994.html

    示例的文本如下:

    cat words.txt
    The Zen of Python, by Tim Peters
     
    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!
    

    1.利用AWK来统计词频

     cat words.txt | awk '{for(i=1;i<=NF;i++){if($i ~ /\w/) valid++;\
    count[$i]++}}END{print "valid words:"valid"\n";for(j in count)\
    print j,count[j]}'
    # 加了if去筛选“单词”字符,但是结果不理想
    #在END中,利用for将hash count中的数据输出。
    
    valid words:143#利用perl语言进行分析是,显然不是这样的,不知道哪里出了问题
    
    -- 1
     19
     hard 1
     1unts.
    one 2
    only 1
    is 10
    it 1
    If 2
     1nse.
    special 1
    aren't 1
    are 1
    ambiguity, 1
    honking 1
    Readability 1
    way 2
    of 3
    In 1
     1w.
    easy 1
    one-- 1
    than 8
    Special 1
    *right* 1
    refuse 1
    preferably 1
    that 1
    be 3
    Errors 1
    Sparse 1
    Complex 1
    explain, 2
     1ver.
     1tch.
     1rity.
    bad 1
    you're 1
    Beautiful 1
    There 1
     1sted.
    do 2
    Unless 1
    by 1
    cases 1
    better 8
    Now 1
    Explicit 1
    face 1
    often 1
    unless 1
    not 1
    more 1
    a 2
     1ters
    implementation 2
    Tim 1
    obvious 1
    Although 3
    let's 1
     1.
     1lently.
    practicality 1
    Namespaces 1
    should 2
     1mplex.
    those! 1
    great 1
     2ea.
    it's 1
    Simple 1
     1les.
    enough 1
    idea 1
    explicitly 1
     1lenced.
    pass 1
    Zen 1
    

    2.利用perl来统计词频

    perl语言此次处理起来似乎更胜一筹,但是这里有个点我琢磨很久,因为使用了2个perl语句,但是2个perl语句的作用不太一样,不能放在一个loop下执行,其中第一个语句是利用-alne(相当于while<>)将words中的单词进行遍历,完了之后需要结束循环;第二个perl语句不需要-alne,只是通过foreach语句进行hash count的打印,故而需加上END语句进行操作

     cat words.txt|perl -alne '{foreach(split){$total++;next if /\W/;\
    $valid++;$count{$_}++;}}' -e  'END{print"total:$total words,\
    valid:$valid words\n";foreach $word (sort keys %count)\
    {print " $word ==> $count{$word}\n"}}'
    
    total:144 words,valid:113 words
    
     Although ==> 3
    
     Beautiful ==> 1
    
     Complex ==> 1
    
     Errors ==> 1
    
     Explicit ==> 1
    
     Flat ==> 1
    
     If ==> 2
    
      There ==> 1
    
     Tim ==> 1
    
     Unless ==> 1
    
     Zen ==> 1
    
     a ==> 2
    
     and ==> 1
    
     are ==> 1
    
     at ==> 1
    
     bad ==> 1
    
     enough ==> 1
    
     explicitly ==> 1
    
     face ==> 1
    
     first ==> 1
    
     good ==> 1
    
     great ==> 1
    
     hard ==> 1
    
     honking ==> 1
    
     idea ==> 1
    
     implementation ==> 2
    
     is ==> 10
    
     it ==> 1
    
     may ==> 2
    
     more ==> 1
    
     never ==> 2
    
     not ==> 1
    
     obvious ==> 1
    
     of ==> 3
    
     often ==> 1
    
     one ==> 2
    
     only ==> 1
    
     pass ==> 1
    
     practicality ==> 1
    
     preferably ==> 1
    
     refuse ==> 1
    

    相关文章

      网友评论

        本文标题:Linux下词频的计算

        本文链接:https://www.haomeiwen.com/subject/rnppfctx.html