美文网首页
RNA-seq数据处理前后的比较

RNA-seq数据处理前后的比较

作者: javen_spring | 来源:发表于2020-05-29 12:17 被阅读0次

    处理前的fastq原数据,trim-galore处理后的fq.gz(fastq)数据 (若处理数据则需要在rna小环境下进行,此次主要是查看文件,在conda的环境下进行)

    运行命令
    cd ${workdir}/04.clean
    zcat SRR1039510_1.fastq.gz | paste - - - - > raw.txt
    zcat SRR1039510_1_val_1.fq.gz |paste - - - - > trim.txt
    awk '(length($4)<63){print$1}' trim.txt > ID
    head -n 100 ID > ID100
    grep -w -f ID100 trim.txt | awk '{print$1,$4}' > trim.sm
    grep -w -f ID100 raw.txt | awk '{print$1,$4}' > raw.sm
    paste raw.sm trim.sm | awk '{print$2,$4}' | tr ' ' '\n' |less -S
    

    实例运行:

    (base) May5 10:51:27 ~
    $ workdir=$HOME/project/airway
    (base) May5 10:58:59 ~
    $ cd /trainee2/May5/project/airway/04.clean/
    (base) May5 11:00:34 ~/project/airway/04.clean
    $ ls
    clean_qc                                   SRR1039510.trim.log                        SRR1039512_1.fastq.gz
    filter.sh                                  SRR1039511_1.fastq.gz                      SRR1039512_1.fastq.gz_trimming_report.txt
    SRR1039510_1.fastq.gz                      SRR1039511_1.fastq.gz_trimming_report.txt  SRR1039512_1_val_1.fq.gz
    SRR1039510_1.fastq.gz_trimming_report.txt  SRR1039511_1_val_1.fq.gz                   SRR1039512_2.fastq.gz
    SRR1039510_1_val_1.fq.gz                   SRR1039511_2.fastq.gz                      SRR1039512_2.fastq.gz_trimming_report.txt
    SRR1039510_2.fastq.gz                      SRR1039511_2.fastq.gz_trimming_report.txt  SRR1039512_2_val_2.fq.gz
    SRR1039510_2.fastq.gz_trimming_report.txt  SRR1039511_2_val_2.fq.gz                   SRR1039512.trim.log
    SRR1039510_2_val_2.fq.gz                   SRR1039511.trim.log
    (base) May5 11:01:01 ~/project/airway/04.clean
    $ zcat SRR1039510_1.fastq.gz |paste - - - - >raw.txt  #将原始fastq数据4行拼成一行
    (base) May5 11:05:58 ~/project/airway/04.clean
    $ wc -l raw.txt
    25000 raw.txt   #原始read的条数
    (base) May5 11:06:11 ~/project/airway/04.clean
    $ zcat SRR1039510_1_val_1.fq.gz |paste - - - - >trim.txt   #将trim_galore修剪后的fq(fastq)数据4行拼成一行
    (base) May5 11:11:30 ~/project/airway/04.clean
    $ wc -l trim.txt 
    24448 trim.txt  #trim_galore处理后read的条数
    (base) May5 11:11:41 ~/project/airway/04.clean
    $ less -S trim.txt 
    (base) May5 11:19:34 ~/project/airway/04.clean
    $ awk '(length($4)<63){print $1}' trim.txt >ID   #打印出trim.txt中第4列碱基小于63的行的第1列 (即第4列碱基小于63的行的SRR名称‘@开头的数字’)并重定向到ID文件
    (base) May5 11:52:08 ~/project/airway/04.clean
    $ wc -l ID   #查看ID文件的行数,即表示有多少条read被trim_galore了
    1282 ID
    (base) May5 11:28:38 ~/project/airway/04.clean
    $ less -N ID
    (base) May5 11:33:11 ~/project/airway/04.clean
    $ head -n 100 ID >ID100   #取ID前100个read并重定向到ID100
    (base) May5 11:34:41 ~/project/airway/04.clean
    $ wc -l ID100
    100 ID100
    (base) May5 11:34:49 ~/project/airway/04.clean
    $ grep -w -f ID100 trim.txt |  awk '{print $1,$4}' >trim.sm  #用ID100中的名称在trim.txt中进行匹配,并打印出匹配行的第1列和第4列,重定向到trim.sm文件
    (base) May5 11:37:23 ~/project/airway/04.clean
    $ grep -w -f ID100 raw.txt |  awk '{print $1,$4,$8}' >raw.sm  #用ID100中的名称在raw.txt中进行匹配,并打印出匹配行的第1列和第4列,并重定向到raw.sm文件
    (base) May5 11:39:13 ~/project/airway/04.clean
    $ head -n 5 *.sm   #打印出raw.sm,trim.sm的前5行
    ==> raw.sm <==
    @SRR1039510.8 CTCATTTTCATCTTCACCATCAACAGAGAGAGCAGCATACTTGCTTGCAGAACTGAACTTAGA HIIIJJJJIIIIIJIJIGIIJJJJIJHIIIIIIGIGJJIIIIJJJJJIJJJJJJIGGIGJJIJ
    @SRR1039510.60 AACCTTGGATTTAGCGGCTGAGTACTTCCTCTTGTACATGGCCTTTCTGGAATACATGGCAGA HJJJJJJJHJJJIJIJJJJJIJBFHIIIJIJJJGFGIJIIJJHHJJIJJJJGIJHHHHHHFFF
    @SRR1039510.108 GAATTAGCAACTGTGAAACGTCCTCAGGAGAGAAGCTACATGCTGCAGAGGTGGCAAGAAGAT HJJJJJJJJJJJJHIIIJIJHIJJJJJJIJJJJJJJJJJJJJJJJJJJJJJCHGIJJHHHHFF
    @SRR1039510.154 TGGTCAGATAGCCCTTGTCTCCCGCCGCCAATCTCTGGCCCCTAGCAGCACGGAGCAGACGGC HHIABBHGIIJEIIIGGHIHGIGCGHG@DFBGGCCEC;CHHH2?EHFFB@BADBB########
    @SRR1039510.159 TGAAGTCACTTTTATAGAAGCTGTGTTAAATTATGGAAAGTACCTTGGGAGATAAGCTCAAGA HJJJJIIJJJJJJJJIJJJJJJJJIIJJJJJJJJJJJJJJGGIJJJJJJJJJJIJJJIJIIJJ
    
    ==> trim.sm <==
    @SRR1039510.8 CTCATTTTCATCTTCACCATCAACAGAGAGAGCAGCATACTTGCTTGCAGAACTGAACTT
    @SRR1039510.60 AACCTTGGATTTAGCGGCTGAGTACTTCCTCTTGTACATGGCCTTTCTGGAATACATGGC
    @SRR1039510.108 GAATTAGCAACTGTGAAACGTCCTCAGGAGAGAAGCTACATGCTGCAGAGGTGGCAAGA
    @SRR1039510.154 TGGTCAGATAGCCCTTGTCTCCCGCCGCCAATCTCTGGCCCCTAGCAGCACGGAG
    @SRR1039510.159 TGAAGTCACTTTTATAGAAGCTGTGTTAAATTATGGAAAGTACCTTGGGAGATAAGCTCA
    
    (base) May5 11:45:32 ~/project/airway/04.clean
    $ paste raw.sm trim.sm | awk '{print$2,$3,$5}'  |tr ' ' '\n'  |less -N #将raw.sm trim.sm文件拼成一行(按前后顺序),取第2,3,5列(即原始序列,质量值,修剪后的序列),将空格替换为换行(\n),用less进行查看
    
    

    相关文章

      网友评论

          本文标题:RNA-seq数据处理前后的比较

          本文链接:https://www.haomeiwen.com/subject/bnjxzhtx.html