fastq 文件介绍

作者: 线断木偶人 | 来源:发表于2019-04-19 17:59 被阅读5次
    name
    NT064_S43_L001_R1_001.fastq.gz
    
    第一部分:样本名
    第二部分:(和 illumina Experiment Manage 的编号一致)S1 .. S* 后面跟的数字与样本在Sample Sheet中的顺序一致,从1开始。不能分配到确定样本的read会归到S0(Undetermined_S0)
    第三部分:泳道lane的编号
    第四部分:R1表示read1,R2表示read2。R1和R2为paired end reads
    第五部分:通常为001
    
    

    查看fastq的文件格式

    [xmxjy@xmxjy filter]$ less allhpv.fastq.gz | head
    @TPNB500301:48:HHTKNAFXY:1:11101:11715:1039 1:N:0:TCCGGAGA+NGGATAGG
    TGACGNTCTCAATATATGTGTGCTTTTTTGCATATTCATAATCTCCCTACTTTATTTTCTTTTATTTTTAATTGATACATAATCATTATACATATTTATGGGTTAAAGTGTAATGTTTTAATATGTGTAAACATATTGACCAAATCAGGGT
    +
    AA6AA#EEEEE6EEEEEEEEEEEEE6EEEAEEEAAEEEA/E6/EE/EE6EAAEAEEE6EAEEE6EEEEE/EEEEEAEE/EEEEEEAAEAEEEAEE<EEEE/EEAAE/EEEEEEE/EEEEAEEEEEEEEE/AEAEEAEE<<A/EE<<E//EE
    @TPNB500301:48:HHTKNAFXY:1:11101:14066:1039 1:N:0:TCCGGAGA+NGGATAGG
    AGACTNTCGTAATATATGTGTGCTTATTTGCATATTCATAATCTCCCTACTTTATTTTCTTTTATTTTTAATTGATACATAATCATTATACATATTTATGGGTTAAAGTGTAATGTTTTAATATGTGTACACATATTGACCAAATCAGGGT
    +
    6AAAA#EEEEEEEEEEEEAAEE/E/EEEEEAEE6EAA/EAEEEE</EEEA/EEEEEEEEEE6EEEE/E/EAEEAE/EAAEE6EEEEEEEAEEAEEEEEEEEEAEEAEAEEE/EEAEEEE/EEA/EEEE/EE<EAE/E/<E<E/EA<EAAEE
    

    意思如下:
    Each entry in a FASTQ file consists of four lines:
    • Sequence identifier
    • Sequence
    • Quality score identifier line (consisting of a +)
    • Quality score

    @TPNB500301:48:HHTKNAFXY:1:11101:11715:1039 1:N:0:TCCGGAGA+NGGATAGG
    
    以:分隔
    @<instrument>
    <run number>
    <flowcell ID>
    <lane>
    <tile>
    <x-pos>
    <y-pos>
    <read>
    <is filtered>
    <control number>
    <index sequence>
    

    下面是别人的一张图片


    image.png

    Quality score

    The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):

    !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
    
    image.png

    Q值

    质量值Q是p的整数映射(即相应碱基判定不正确的概率),主要有两种不同的公式被使用。第一种是评估碱基判定的可靠性的不标准Sanger变体,也称为Phred质量分数:

    Qsanger=-10\log_{10}p

    Solexa流程(即与Illumina Genome Analyzer一起交付的软件)较早使用了不同的映射编码概率p/(1-p),而不是p:
    Q=-10\log_{10}\frac{P}{1-P}

    测序质量值和准确度

    Phead Quality Score Probability of incorrect base call Base call accuracy
    10 1 in 10 90%
    20 1 in 100 99%
    30 1 in 1000 99.9%
    40 1 in 10000 99.99%
    50 1 in 100000 99.999%

    维基百科
    https://en.wikipedia.org/wiki/FASTQ_format#File_extension
    中文翻译
    http://www.cnblogs.com/yahengwang/p/8973948.html
    shell 操作
    https://www.jianshu.com/p/bc1fe435879c

    相关文章

      网友评论

        本文标题:fastq 文件介绍

        本文链接:https://www.haomeiwen.com/subject/nnaqgqtx.html