美文网首页
学习笔记:linux常用命令cut、sort、grep

学习笔记:linux常用命令cut、sort、grep

作者: 打工仔小刘 | 来源:发表于2021-11-29 16:36 被阅读0次

    1.cut
    cut用于文件分割,将整个文件分割成不同的字段,可以对字段进行单独的操作,类似于excel中进行列操作。
    -b, 按字节(bytes)定位;-c, 按字符(characters)定位;-f, 按域(fields)定位; -d ,设置间隔符,默认制表符(Tab)。
    常用-d,-f。根据文件差异,有的分隔符是空格,有的是逗号、tab键等,-f指定要选择哪列或哪些列。

    cat查看文件内容,此文件以逗号分割
    Cat a.txt  
    SRR776504,RNA-Seq,646,8308777,PRJNA192864,SAMN01978938,24124470
    SRR776504,RNA-Seq,646,8308777,PRJNA192864,SAMN01978938,24124470
    SRR776505,RNA-Seq,579,26432223,PRJNA192864,SAMN01978939,78862401
    SRR776505,RNA-Seq,652,26432336,PRJNA192864,SAMN01978939,78862611
    SRR776505,RNA-Seq,652,26432254,PRJNA192864,SAMN01978939,78862001
    SRR776506,RNA-Seq,577,24314163,PRJNA192864,SAMN01978940,72169712
    cut –d ‘,’ –f 1 a.txt  
    逗号分割成7部分,取第一个字段
    SRR776504
    SRR776504
    SRR776505
    SRR776505
    SRR776505
    SRR776506
    sed 's/,/\t/g' a.txt | cut -f 1-4 
     sed命令把,替换成Tab键,再取1至4字段;
    SRR776504       RNA-Seq 646     8308777
    SRR776504       RNA-Seq 646     8308777
    SRR776505       RNA-Seq 579     26432223
    SRR776505       RNA-Seq 652     26432336
    SRR776505       RNA-Seq 652     26432254
    SRR776506       RNA-Seq 577     24314163
     |为管道符,用于连续操作,表示把前一条命令的结果传输给后一条命令,即前面的输出作为后面的输入
    

    2.sort
    sort用于排序,默认按字符编码排序。-k指定按哪个字段排序;使用-n参数按数字大小排序;-u用于去重复,等同于sort | uniq;;-r,反向排序(默认升序);-t,指定分隔符
    对前文取出的4个字段进行操作

    先按第一列排序,再按第三列数字大小降序
    sed 's/,/\t/g' a.txt | cut -f 1-4 | sort -k1,1 -k3,3nr 
    SRR776504       RNA-Seq 646     8308777
    SRR776504       RNA-Seq 646     8308777
    SRR776505       RNA-Seq 652     26432254
    SRR776505       RNA-Seq 652     26432336
    SRR776505       RNA-Seq 579     26432223
    SRR776506       RNA-Seq 577     24314163
    其中前2行重复
    sed 's/,/\t/g' a.txt | cut -f 1-4 | sort -k1,1 -k3,3nr | uniq -c
          2 SRR776504       RNA-Seq 646     8308777
          1 SRR776505       RNA-Seq 652     26432254
          1 SRR776505       RNA-Seq 652     26432336
          1 SRR776505       RNA-Seq 579     26432223
          1 SRR776506       RNA-Seq 577     24314163
    #uniq用于去重,-c表示进行重复计数,计数结果显示在第一列;-d(duplication),获得重复的行。
    

    再比如

    cat <<END | uniq -c
    > a
    > a
    > a
    > b
    > c
    > d
    > d
    > END
           cat <<END | uniq -c
          3 a
          1 b
          1 c
          2 d
    cat <<END后可键盘输入内容,前面会显示>,最后输入END结束,然后统计重复情况
    cat <<END | uniq -d
    > a
    > a
    > a
    > b
    > c
    > d
    > d
    > END
    a
    d
    #最后输出重复的内容a,d
    

    3.grep
    grep用于查找,-c,计数(count)

    head t.gtf #查看文件的开头部分内容,-n指定查看前多少行;-v,反向查找
    #!genome-version TAIR10
    #!genome-date 2008-04
    #!genome-build-accession GCA_000001735.1
    #!genebuild-last-updated 2010-09
    1       araport11       gene    10942648        10944727        .       -       .       gene_id "AT1G30814"; gene_source "araport11"; gene_biotype "protein_coding";
    1       araport11       transcript      10942648        10944727        .       -       .       gene_id "AT1G30814"; transcript_id "AT1G30814.1"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
    1       araport11       exon    10944317        10944727        .       -       .       gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "1"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G30814.1.exon1";
    1       araport11       exon    10944078        10944229        .       -       .       gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "2"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G30814.1.exon2";
    1       araport11       CDS     10944078        10944225        .       -       0       gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "2"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G30814.1";
    grep "CDS" t.gtf
    1       araport11       CDS     10944078        10944225        .       -       0       gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "2"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G30814.1";
    1       araport11       CDS     10943868        10943984        .       -       2       gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "3"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G30814.1";
    #"CDS"字样会标识出颜色
    grep -c "CDS" t.gtf
    2
    grep -v "#" t.gtf | head -3 #不看注释行
    1       araport11       gene    10942648        10944727        .       -       .       gene_id "AT1G30814"; gene_source "araport11"; gene_biotype "protein_coding";
    1       araport11       transcript      10942648        10944727        .       -       .       gene_id "AT1G30814"; transcript_id "AT1G30814.1"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
    1       araport11       exon    10944317        10944727        .       -       .       gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "1"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G30814.1.exon1";
    cat b.txt
    snpeff
    snppppef
    exper
    database
    data2
    123
    *123
    grep "s" b.txt
    snpeff
    snppppef
    database
    grep "^s" b.txt #匹配s开头的字符
    snpeff
    snppppef
    grep "f$" b.txt #匹配f结尾的字符
    snpeff
    snppppef
    grep "snp*ef" b.txt #匹配p 0次或多次
    snpeff
    snppppef
    grep -E "^s|d" b.txt #-E,匹配正则表达,|,或表达
    snpeff
    snppppef
    database
    data2
    

    相关文章

      网友评论

          本文标题:学习笔记:linux常用命令cut、sort、grep

          本文链接:https://www.haomeiwen.com/subject/fljlxrtx.html