美文网首页生信基础知识
Fastq 格式说明 & (Phred33 or Phred64

Fastq 格式说明 & (Phred33 or Phred64

作者: 陈光辉_山东花生 | 来源:发表于2020-02-19 10:49 被阅读0次

    Fastq格式是一种基于文本的存储生物序列和对应碱基(或氨基酸)质量的文件格式。最初由桑格研究所(Wellcome Trust Sanger Institute)开发出来,现已成为存储高通量测序数据的事实标准。以Illumina Casava 1.8+ 的fastq格式为例,fastq格式的形式如下:

    image

    每条序列由4行字符表示,上述样例显示有两条序列:

    第一行:必须以“@”开头,后面跟着唯一的序列ID标识符,然后跟着可选的序列描述内容,标识符与描述内容用空格分开。

    第二行:序列字符(核酸为[AGCTN]+,蛋白为氨基酸字符)。

    第三行:必须以“+”开头,后面跟着可选的ID标识符和可选的描述内容,如果“+”后面有内容,该内容必须与第一行“@”后的内容相同。

    第四行:碱基质量字符,每个字符对应第二行相应位置碱基或氨基酸的质量,该字符可以按一定规则转换为碱基质量得分,碱基质量得分可以反映该碱基的错误率。这行的字符数与第二行中的字符数必须相同。字符与错误率的具体关系见下文介绍。

    在满足上述要求的前提下,不同的测序仪厂商或数据存储商对第一行和第四行的定义有些差别。

    第一行,即标识行在Illumina和NCBI SRA中的样式如下:

    Illumina casava 1.8+(详细的解释可参考wiki):

    @HWI-ST1276:97:D1DCYACXX:7:1101:1406:2170 1:N:0:CGACGT

    NCBI SRA:

    @SRR387514.1 ILLUMINA-C4D679_0049_FC:1:12:3317:1141 length=40

    对于第四行的编码,最初由Phred程序的开发者定义,一般称为Phred qualitiy. 在Illumina早起版本(v1.3,v1.4)中,因为对quality的定义与Phred的不同,这行应该称为 Solexa quality。但从Illumina v1.5以后,也开始采用Phred的定义。

    碱基质量得分是怎么来的?

    Phred最初是一个从测序仪中产生的荧光记录数据中识别碱基的程序。在早起的荧光染料测序中,每次发生碱基合成时会释放出荧光信号,该信号被CCD图像传感器捕获。记录下荧光信号的峰值,生成一个实时的轨迹数据(chromatogram)。因为不同的碱基用不用的颜色标记,检测这些峰值即可判断出对应的碱基。但由于这些信号的波峰、密度、形状和位置等是不连续或模糊的,有时很难根据波峰判断出正确的碱基。

    image

    图1 chromatogram样图

    Phred计算许多与波峰大小和分辨率相关的参数,根据这些参数,从一个巨大的查询表中找出碱基质量得分。这个查询表是根据对已知序列的测序数据分析得到的(应该是分析得到波峰参数与碱基错误率的关系,再通过公式把错误率转换成质量得分,得到波峰参数与质量得分的直接对应表)。不同的测序试剂和机器用不同的查询表。为了节约磁盘空间,质量得分(可能占用两个字符)按一定规则(Phred+33或Phred+64)被转换为单个字符表示。

    碱基错误率与质量得分的关系有如下两种:

    Qphred = -10log10p

    Qillumina-prior to v.1.4 = -10log10(p/(1-p))

    image

    图 2 质量得分Q和错误率p的关系,红色的为phred,黑色的为Illumina早期版本,虚线表明p=0.05,对应的质量得分为Q≈13

    在不同版本的编码中,除了质量得分与错误率有所差别外,在字符与得分的转换上也有差别。

    image

    图3 不同版本质量得分与质量字符ASCII值的关系

    质量字符的ASCII值和质量得分的关系有如下两种:

    Phred+64 质量字符的ASCII值 - 64

    Phred+33: 质量字符的ASCII值 - 33

    可以粗略分为 Phred+33和Phred+64,这里的33和64就是指ASCII值转换为得分该减去的数值。

    在处理测序数据时,因为一些软件会根据碱基质量得分的不同做不同的处理,常要指定正确的编码方式,有必要对质量字符与质量得分的关系(Phred+33或Phred+64)作出正确的判断。当然,如果处理的是最近两年产生的测序数据,基本上都是Phred+33的,但从NCBI SRA数据库下载的旧数据就不一定了。

    根据图3中Phred+33与Phred+64所使用的质量字符范围的不同,可以对fastq文件中质量得分的编码方式做出判断。图3中显示,ASCII值小于等于58(相应的质量得分小于等于25)对应的字符只有在Phred+33的编码中被使用,所有Phred+64所使用的字符的ASCII值都大于等于59。在通常情况下,ASCII值大于等于74的字符只出现在Phred+64中。利用这些信息即可在程序中进行判断。

    文章末尾是一个对Phred+33或Phred+64做区分的perl脚本。

    该脚本的判断思想如下:

    默认读取1000条序列,在这1000条序列中:

    1. 如果有2个以上的质量字符ASCII值小于等于58(即有两个碱基的得分小于等于25),同时没有任何质量字符的ASCII值大于等于75,即判断是Phred+33。

    2. 如果有2个以上的质量字符ASCII值大于等于75(即有两个碱基的得分大于等于10),同时没有任何质量字符的ASCII值小于等于58,即判断是Phred+64。

    3. 如果所有质量字符的ASCII值介于59到74之间,即判断可能是Phred+33,但建议使用更多的序列做进一步测试(出现这种结果可能有两种情况:1, Phred+33编码,所有碱基质量得分介于26到42之间;2,Phred+64编码,所有碱基质量得分介于-5到10;是前者的可能性大)。

    4. 如果出现上述3种以外的情况,建议打印出质量字符的ASCII值人工判断。

    理解错误的地方欢迎指正。

    附录检测格式的perl:

    
    #!/usr/bin/perl -w
    use strict;
    use Getopt::Long;
    #
    # fastq_phred.pl - Script for judge the fastq's encoding, whether it is phred33 or phred64.
    #
    # Version: 0.3 ( May 19, 2014)
    # Author: Wencai Jie (jiewencai<@>qq.com), NJAU, China.
    #
    # Permission is granted to anyone to use this software for any purpose, without
    # any express or implied warranty. In no event will the authors be held liable 
    # for any damages arising from the use of this software.
    #
    
    #Get options.
    my ($help, $print_score, $detail, $print_ascii, $reads_num, $reads_start_arg, $reads_end_arg);
    my $reads_end_turn;
    GetOptions(
        'help|h!' => \$help,
        'score|s!' => \$print_score,
        'detail|d!' => \$detail,
        'ascii|a!' => \$print_ascii,
        'reads_num|n=i' => \$reads_num,
        'reads_start|b=i' => \$reads_start_arg,
        'reads_end|e=i' => \$reads_end_arg,
    );
    
    my $usage = "
    fastq_phred.pl:
    This program can print fastq file's reads quality scores, ASCII value, and help to judge it's 
    encoding by the ASCII value range, whether it is phred33 or phred64.
    
    Usage:
        perl fastq_phred.pl [options] <file1.fq [file2.fq ...]>
    Options:
        -h|--help         print this help message.
        -s|--score        print scores.                                 [default: Do not print scores] 
        -d|--detail       print detail scores or ASCII value when       [default: Do not print scores in detail] 
                              --score or --ascii set.
        -a|--ascii        print quality character's ASCII value. if     [default: Do not print ASCII vaule] 
                              this option set, the --score will disabled.
        -n|--reads_num    reads number used to test phred encoding      [default: 1000]
                              and print scores. It's advised to use more
                              than 100 reads to do the test.               
        -b|--reads_start  reads start position used to test phred       [default: 1]
                              encoding and print scores.
        -e|--reads_end    reads end position used to test phred         [default: the length of the read]
                              encoding and print scores.
    
    ";
        
    if ($#ARGV < 0 or $help){ 
        print "$usage";
        exit;
    }
    
    #Check parameters.
    unless ($reads_num){
        $reads_num = 1000;
    }
    if ($reads_start_arg && $reads_start_arg  < 0){
        print STDERR "ERROR:The reads start position should great than 0.\n\n";
        exit;
    }
    if ($reads_end_arg && $reads_end_arg  < 0){
        print STDERR "ERROR:The reads end position should great than 0.\n\n";
        exit;
    }
    if ($reads_start_arg && $reads_end_arg && $reads_end_arg < $reads_start_arg){
        print STDERR "ERROR:The reads start position should great than end position.\n\n";
        exit;
    }
    
    &main;
    
    sub main{
        my $filename = '';
        while ($filename = shift @ARGV){
        my @FQ = ();
        my @all_ascii = ();
        my ($file_end, $phred_result) = ('','');
        my ($Q, $count, $lt_58, $gt_75) = (0, 0, 0, 0);
        open FQ,"<$filename" or die "Can not open $filename:$!\n";
        #Read sequences.
        while($count < $reads_num){
            $count++;
            @FQ=(); 
            #read four lines from fastq file.
            for(my $i=0; $i<=3; $i++){
                if (eof(FQ)){
                    $file_end = 'yes';
                    last;
                }
                $FQ[$i]=<FQ>;
                if ($FQ[0] !~ m/^@/){
                    my $line = $count*4-3;
                    print STDERR "ERROR:\n$filename: It's not a correct fastq format.\nline '$line': $FQ[0]\n";
                    exit;
                }
            }
            if ( $file_end eq 'yes'){
                next;
            }
            my @ascii_ref = &cal_ascii($FQ[3], $reads_start_arg, $reads_end_arg);
            push @all_ascii, [@ascii_ref];
        }
    
        #print ASCII.
        if ($print_ascii){
            print "\n","."x50," ASCII Value: $filename ","."x50,"\n";
            &print_array_of_array(\@all_ascii, 0, $detail);
            next;
        }
    
        #Stastic ASCII value range.
        foreach my $ascii_ref (@all_ascii){
            $lt_58 += (grep { $_ <= 58} @{$ascii_ref});
            $gt_75 += (grep { $_ >= 75} @{$ascii_ref});
        }
    
        #Guess the Phred with ASCII value range. 
        if ($lt_58 > 1 && $gt_75 == 0 ){
            $Q = 33;
            $phred_result = "$filename: The encoding should be Phred33.\nThe quality score character number that ASCII value less than 58 : $lt_58\nThe quality score character number that ASCII value great than 75: $gt_75";
        }elsif($lt_58 == 0 && $gt_75 > 1){
            $Q = 64;
            $phred_result = "$filename: The encoding should be Phred64.\nThe quality score character number that ASCII value less than 58 : $lt_58\nThe quality score character number that ASCII value great than 75: $gt_75";
        }elsif($lt_58 == 0 && $gt_75 == 0){
            print STDERR "$filename: The encoding should be Phred33 that all of the nucleotide quality score great than 25 and less than 41, but it's advised to send more reads to be tested with '-n <int>' options.\n";
            exit;
        }else{
            print STDERR "$filename\nWarning: Abnormal endoding, Please test again with more reads or make a judgement by yourself with ASCII value by '-ascii' options.\n"; 
            exit;
        }
    
        #print score.
        if ($print_score){
            print "\n","."x50," Quality Score: $filename ","."x50,"\n";
            &print_array_of_array(\@all_ascii, $Q, $detail);
        }
    
        #Print the phred encoding result.
        print STDERR "$phred_result\n\n";
        }
    }
    
    #Print Score or ASCII value.
    sub print_array_of_array{
        my ($array_of_array_ref, $Q, $detail) = @_;
        my ($average_value, $total_value, $value_num) = (0, 0, 0);
        my @array_of_array = @{$array_of_array_ref};
        my %value_h;
        foreach my $array_ref (@array_of_array){
            for (my $i=0;$i<=$#{$array_ref};$i++){
                my $out_value  = ${$array_ref}[$i] - $Q;
                $value_h{$out_value}++;
                $total_value += $out_value;
                $value_num ++;
                print "$out_value " if ($detail);
            }
            print "\n" if ($detail);
        }
        unless ($detail){
            foreach my $out_value (sort {$a <=> $b} keys %value_h){
                print "$out_value\t$value_h{$out_value}\n";
            }
        }
        $average_value = (int ($total_value/$value_num)*100) /100;
        print "Average: $average_value\n";
    }
    
    #Calculate phred score.
    sub cal_ascii{
        my ($read,$reads_start, $reads_end) = @_;
        my @all_ascii = ();
        my $ascii = 0;
        #The $read string's end is a "\n";
        my $reads_len = length($read) - 1;
        #The $reads_end should be less than the read length.
        if( $reads_end_arg && $reads_end_arg <= $reads_len){
            $reads_end = $reads_end_arg;
        }else{
            $reads_end = $reads_len;
        }
        #If the the reads start position set, the $reads_start equal to it,
        #else the reads start position set to 1.
        if ( $reads_start_arg && $reads_start_arg <= $reads_end){
            $reads_start = $reads_start_arg;
        }else{
            $reads_start = 1;
        }
        #Convert 1 base coordinate system to 0 base coordinate system.
        for(my $j=$reads_start-1; $j<=$reads_end-1; $j++){
            $ascii = ord(substr($read,$j,1));
            push @all_ascii, $ascii;
        }
        return @all_ascii;
    }
    
    
    
    ```
    
    
    
    
    参考资料:
    
    1. [https://en.wikipedia.org/wiki/FASTQ_format](https://en.wikipedia.org/wiki/FASTQ_format)
    
    2. [https://en.wikipedia.org/wiki/Phred_quality_score](https://en.wikipedia.org/wiki/Phred_quality_score)
    
    3. [https://en.wikipedia.org/wiki/Phred_base_calling](https://en.wikipedia.org/wiki/Phred_base_calling)
    
    4. [http://maq.sourceforge.net/fastq.shtml](http://maq.sourceforge.net/fastq.shtml)
    
    5. [http://maq.sourceforge.net/qual.shtml](http://maq.sourceforge.net/qual.shtml)
    
    6. [http://supportres.illumina.com/documents/myillumina/a557afc4-bf0e-4dad-9e59-9c740dd1e751/casava_userguide_15011196d.pdf](http://supportres.illumina.com/documents/myillumina/a557afc4-bf0e-4dad-9e59-9c740dd1e751/casava_userguide_15011196d.pdf)
    

    相关文章

      网友评论

        本文标题:Fastq 格式说明 & (Phred33 or Phred64

        本文链接:https://www.haomeiwen.com/subject/qonofhtx.html