美文网首页生物信息学与算法生物信息学生物信息杂谈
测序数据基本信息统计 | reads,coverage,dept

测序数据基本信息统计 | reads,coverage,dept

作者: fatlady | 来源:发表于2018-04-26 16:32 被阅读78次

    00 写在前面

    一般公司会提供clean data和后续的基础分析结果。
    因为可能需要自己来进行数据的预处理,记得拿回raw data,同时弄清楚公司处理数据时每步用到的软件版本及具体参数。拿到数据后,先进行简单的统计,主要是测序reads数目、比对上的比例、覆盖度和测序深度。现在一般全外要求测序深度在100X,实际深度一般会在100X以上。

    关于测序深度和覆盖度,这里按下图定义。


    depth and coverage

    01 fq文件中到底有多少条reads

    这里使用软件readfq来统计,成熟的小工具最大的好处是,因为功能单一,作者会着重优化计算速度,一般处理速度会比较快。具体有多快,参考作者介绍

    It could deal with ~4M reads (1G base) in less than 40 seconds, ~50M reads (14G base) in less than 5 minutes!!

    下载安装

    #download https://github.com/billzt/readfq
    gcc -lz -o kseq_fastq_base kseq_fastq_base.c
    kseq_fastq_base [input.fq]
    
    #FASTQ files could be gzipped.
    #The statistic results would be printed on STDOUT.
    

    计算

    #01_read.sh
    #! /bin/bash
    cd FQDIR/  #fq所在位置
    readfq="/../../biosoft/readfq-master/readfq-master/kseq_fastq_base"
    ls *.fq >fq.list
    for i in $(cat fq.list)
    do 
      printf $i "\t" >>fq.reads_num
      .$readfq $i >>fq.reads_num
    done
    

    结果文件如下,最后整理成表格即可。

    $ more fq.reads_num
    sample1_1.fq Num reads:3603393    Num Bases: 540508950
    sample1_2.clean.fq Num reads:3603393    Num Bases: 540508950
    sample2_1.clean.fq Num reads:13328100   Num Bases: 1999215000
    sample2_2.clean.fq Num reads:13328100   Num Bases: 1999215000
    

    02 有多少reads比对到参考基因组上了?

    这里从比对后得到的BAM文件开始,利用软件samtools中的flagstat 或者idxstats统计,参考samtools命令详解http://www.cnblogs.com/emanlee/p/4316581.html

    #02_mappedreads.sh
    #! /bin/bash
    cd bamDIR/  #bam文件所在位置
    ls *.bam >bam.list
    for i in $(cat bam.list)
    do 
      printf $i >>bam.mappedreads
      samtools flagstat $i | sed -n '5p'  >>bam.mappedreads
    done
    
    #02_reads.sh
    #! /bin/bash
    cd bamDIR/  #bam文件所在位置
    ls *.bam >bam.list
    for sample in $(cat bam.list)
    do
      export total_reads=$(samtools idxstats $bam_dir/$sample.final.bam |awk -F '\t' '{s+=$3}END{print s}')
      echo $sample number_of_reads $total_reads
    done
    
    #命令解析
    $ samtools flagstat sample1.bam
    3826122 + 0 in total (QC-passed reads + QC-failed reads) #总共的reads数
    0 + 0 secondary
    1658 + 0 supplementary
    343028 + 0 duplicates
    3824649 + 0 mapped (99.96% : N/A) #总体reads的匹配率
    3824464 + 0 paired in sequencing #总共的reads数
    1912442 + 0 read1 #reads1中的reads数
    1912022 + 0 read2 #reads2中的reads数
    3808606 + 0 properly paired (99.59% : N/A) #完美匹配的reads数:比对到同一条参考序列,并且两条reads之间的距离符合设置的阈值
    3821518 + 0 with itself and mate mapped #paired reads中两条都比对到参考序列上的reads数
    1473 + 0 singletons (0.04% : N/A)  #单独一条匹配到参考序列上的reads数,和上一个相加,则是总的匹配上的reads数。
    5882 + 0 with mate mapped to a different chr#paired reads中两条分别比对到两条不同的参考序列的reads数
    4273 + 0 with mate mapped to a different chr (mapQ>=5) #paired reads中两条分别比对到两条不同的参考序列的reads数
    

    03 计算coverage和depth

    这里从比对后得到的BAM文件开始,利用软件统计每个碱基被测序到的次数,再写脚本统计coverage和depth.这里介绍3种方法

    03-1 samtools的mpileup + custome perl :~4h per WES sample

    利用samtools的mpileup计算每个碱基的测序深度,由于遍历每个reads,所以非常慢,100X的全外样本耗时~4h。

    # 01_mpileup.sh
    #! /bin/bash
    cd bamDIR/  #bam文件所在位置
    ls *.bam >bam.list
    for sample in $(cat bam.list)
    do
        samtools mpileup -f reference.fa $bam_dir/$sample >mpileup/$sample.mpileup
        echo $sample is ready!
    done
    

    Perl脚本计算exon、50bp、100bp、150bp侧翼序列的覆盖度及depth;其中CCDS.20160908.exon.hg19.bed为targets.bed,捕获文件,重点是必须要有前3列,TAB分隔。(脚本来自生信技能树JIMMY的分享)

    # 02_exon_dep.pl
    #perl 02_exon_dep.sh CCDS.20160908.exon.hg19.bed  *.mplieup output
    open FH,"@ARGV[0]";
    #chr1    10085036        10097869        Cops5   0       -       10085036        10097869        255,0,0
    while(<FH>){
      chomp;
      @F=split;
      $start=$F[1];$end=$F[2];$chr=$F[0]; $chr=~s/chr//;
      foreach ($start..$end){$hash{"$chr:$_"}=1;}
      foreach ($start-150..$start-101){$exon150{"$chr:$_"}=1;}
      foreach ($start-100..$start-51){$exon100{"$chr:$_"}=1;}
      foreach ($start-50..$start-1){$exon50{"$chr:$_"}=1;}
      foreach ($end+1..$end+50){$exon50{"$chr:$_"}=1;}
      foreach ($end+51..$end+100){$exon100{"$chr:$_"}=1;}
      foreach ($end+101..$end+150){$exon150{"$chr:$_"}=1;}
    }
    close FH;
    
    $tmp=0;
    $tmp++ foreach keys %hash;
    $exon_length=$tmp;
    $tmp=0;
    $tmp++ foreach keys %exon50;
    $exon50_length=$tmp;
    $tmp=0;
    $tmp++ foreach keys %exon100;
    $exon100_length=$tmp;
    $tmp=0;
    $tmp++ foreach keys %exon150;
    $exon150_length=$tmp;
    
    open FH,$ARGV[1] or die "you need to give us a mpileup file!";
    #chrM    1       G       4       ^7.^,.^I.^],    GGGC
    open R,">>$ARGV[2]";
    while(<FH>){
      @F=split;
      $chr=$F[0];$pos=$F[1];$depth=$F[3];$chr=~s/chr//;
      if (exists $hash{"$chr:$pos"}){
          $c_sum++ if $depth >0;$d_sum+=$depth ;
            }
            else{
                    if (exists $exon50{"$chr:$pos"}){
                        $exon50_c_sum++ if $depth >0;$exon50_d_sum+=$depth ;
                    }
                    else{
                            if (exists $exon100{"$chr:$pos"}){
                               $exon100_c_sum++ if $depth >0;$exon100_d_sum+=$depth ;
                            }
                            else{
                                    if (exists $exon150{"$chr:$pos"}){
                                       $exon150_c_sum++ if $depth >0;$exon150_d_sum+=$depth ;
                                    }
                                    else{
                                        $other_c_sum++ if $depth >0;$other_d_sum+=$depth ;
                                    }
                            }
                    }
            }
    }
    close FH;
    if($exon_length>0){$coverage=$c_sum/$exon_length;};
    if($c_sum>0){$avg_depth=$d_sum/$c_sum;}
    
    if($exon50_length>0){$exon50_coverage=$exon50_c_sum/$exon50_length;}
    if($exon50_c_sum>0){$exon50_avg_depth=$exon50_d_sum/$exon50_c_sum;}
    
    if($exon100_length>0){$exon100_coverage=$exon100_c_sum/$exon100_length;}
    if($exon100_c_sum>0){$exon100_avg_depth=$exon100_d_sum/$exon100_c_sum;}
    if($exon150_length>0){$exon150_coverage=$exon150_c_sum/$exon150_length;}
    if($exon150_c_sum>0){$exon150_avg_depth=$exon150_d_sum/$exon150_c_sum;}
    #$other_coverage=$other_c_sum/(3000000000-90362533);
    #$other_avg_depth=$other_d_sum/$other_c_sum;
    
    print R $ARGV[0],"\t","exon","\t","$c_sum\t$d_sum\t$coverage\t$avg_depth\n";
    print R $ARGV[0],"\t","50bp","\t","$exon50_c_sum\t$exon50_d_sum\t$exon50_coverage\t$exon50_avg_depth\n";
    print R $ARGV[0],"\t","100bp","\t","$exon100_c_sum\t$exon100_d_sum\t$exon100_coverage\t$exon100_avg_depth\n";
    print R $ARGV[0],"\t","150bp","\t","$exon150_c_sum\t$exon150_d_sum\t$exon150_coverage\t$exon150_avg_depth\n";
    #print "$other_c_sum\t$other_d_sum\t$other_coverage\t$other_avg_depth\n";
    

    03-2 bedtools的genomecov + custome perl: 比samtools快

    #! /bin/bash
    cd bamDIR/  #bam文件所在位置
    ls *.bam >bam.list
    for sample in $(cat bam.list)
    do
      bedtools genomecov -ibam $bam_dir/$sample -d >$sample.dep_base
      echo depth_base_is_calculated_for $sample
    done
    

    统计脚本同03-2中02_exon_dep.pl。

    03-3 bam2bedGraph + bedtools: 最快

    bam2bedGraph 统计每个碱基的depth,并利用bedtools求overlapped region。

    # 01_bam2bedGraph_bedtools.sh
    #! /bin/bash
    cd bamDIR/  #bam文件所在位置
    ls *.bam >bam.list
    for sample in $(cat bam.list)
    do
      $bam2bed -o $sample.bam2bed $bam_dir/$sample  
      #chr start end depth #zero-based start and a one-based end
      
      cat $sample.bam2bed.bedGraph | awk '{print $1"\t"$2+1"\t"$3"\t"$4}' >$sample.bedGraph
      bedtools intersect -a $sample.bedGraph -b probes.bed >$sample.intersect
      #chr overlapped_start overlapped_end depth 
      echo $sample ok
    done
    
    
    #02_cov_depth.sh:实际测到的区域长度及覆盖度
    #! /bin/bash
    #cal the length and depth of overlapped regions
    for sample in $(cat bam.list)
    do
      perl -alne '{$len=$F[2]-$F[1]+1;$tmp+=$len}END{print $tmp}' $sample.intersect >>intersect.length
      perl -alne '{$len=$F[3]*($F2]-$F[1]+1);$tmp+=$len}END{print $tmp}' $sample.intersect >>intersect.depth
    done
    paste bam.list intersect.length intersect.depth >intersect.cov_depth
    #intersect.cov_depth: sampleID intersect.length intersect.depth
    
    #bed_length:理论应该测到的区域(捕获区域)长度
    #probes.bed  
    #chr start end name
    perl -alne '{$len=$F[2]-$F[1]+1;$tmp+=$len}END{print $tmp}' probes.bed
    
    
    #coverage=intersect.length/bed_length
    #depth=intersect.depth/intersect.length
    

    04 写在最后

    基础的统计可以用来锻炼自己的编程能力,着急的话可以去GitHub上搜索,有许多相似的小工具,而且一般都很注重速度的优化,可以快速得到结果;而且关于原理部分不太理解的话,直接联系作者也能得到蛮及时的回复。(个人体验,仅供参考)

    相关文章

      网友评论

        本文标题:测序数据基本信息统计 | reads,coverage,dept

        本文链接:https://www.haomeiwen.com/subject/wdlmlftx.html