美文网首页
【文本挖掘】class 1

【文本挖掘】class 1

作者: caokai001 | 来源:发表于2019-03-04 01:10 被阅读20次

    使用linux计算文本复杂度,并可视化

    课程链接
    文本链接

    the frist homework

    1.linux脚本如下

    #!/bin/bash
    if [ ! -e $1 ] || [ -z $2 ];then
        echo "!!!请分别输入文件和行数"
    else
        rm $1.output
    fi
    
    cat $1 |tr -sc [:alnum:] "\n"|tr [:upper:] [:lower:] >$1.a.pure.txt
    total=`cat $1.a.pure.txt|wc -l `
    for i in `seq 1 $[$total /$2]`
    do
    head -n $[ i*$2 ] $1.a.pure.txt>tmp_file
    words=`cat tmp_file|wc -l`
    token=`sort tmp_file|uniq |wc -l`
    tmp=` echo "scale=4;$token / $words "|bc`
    echo -e "$[ i*$2 ]\t$tmp\t$token" >>$1.output
    done
    

    2.R脚本如下

    Article<-read.table("brown_a.txt.output",header = F)
    p<-ggplot(Article,aes(x=Article$V1))
    p<-p+geom_line(aes(y=Article$V2,col="ratio"))
    p <-p +geom_line(aes(y=Article$V3/11866,colour = "words"))
    p <- p + scale_colour_manual(values = c("blue", "red"))
    p<-p + scale_y_continuous(sec.axis = sec_axis(~.*11866, name =  "words"))
    p <- p + labs(y = "ratio",x = "length of article",colour = "Type")
    p <- p + theme(legend.position = c(0.2, 0.9))
    p
    

    3.运行

    [kcao@h1 test]$ sh fun.txt brown_a.txt 100 #获得brown_a.txt.output结果

    brown_a.txt.output

    导入Rstudio可视化

    文本复杂度越来越低

    附录:

    1.双Y轴绘图
    2.ggplot2 双Y轴
    3.shell除法显示小数

    相关文章

      网友评论

          本文标题:【文本挖掘】class 1

          本文链接:https://www.haomeiwen.com/subject/loivuqtx.html