ANNOVAR注释会用到的refseq文件(bed文件格式,左闭右开,0 based坐标系统)的各列含义,最后几列经常忘记,备案一下(如果有看到的,忽略渣渣英文,能懂就好):
1: bin , indexing field to speed chromosome range queries
2: name (NM ID)
3: chr
4: strand
5: transcription start
6: transcription end
7: translation(CDS) start
8: translation(CDS) end
9: number of exon
10: every exon start
11: every exon end
12: score
13: name2 (gene name)
14: cdsStartStat, enum('none','unk','incmpl','cmpl')
15: cdsEndStat, enum('none','unk','incmpl','cmpl')
16: exonFrames, exon frame {0, 1, 2}, or -1 if no frame for exon
对于14,15列的'none','unk','incmpl','cmpl'含义:
"none" - No CDS (non-coding)
"unk" - CDS is unknown (coding, but not known)
"incmpl" - CDS is not complete at this end
"cmpl" - CDS is complete at this end
对于16列的数字含义及举例(注意转录本有正反链的区别,下面只是举例,没有考虑正反链):
exonFrames: the exonFrames field tells you how the two exons join together.
-1 means that the exon isentirely UTR. When the nucleotides in two exons are
required to form an amino acid together, the number is expressed as the
number of nucleotides in the first exon. Because one amino acid contains
3 nucleotide, so the max number in the first exon will be 2.
exonFrames example: there is 1 nucleotide at the end of exon1 that joins
with the first two nucleotides at the start of exon2, this means that exon2
picks up one nucleotide from the exon1 to make the amino acid, the nucleotide
number from the first exon is 1.
用中文再说一遍:这列用于表现不同外显子之间是如何组合在一起形成氨基酸的。当外显子完全是UTR的时候,这个值为 -1。当外显子包含CDS的序列的时候,如果不同外显子上的碱基,需要组合在一起形成氨基酸,会出现形成这个氨基酸 ,需要从前一个外显子获得的碱基个数的值。比如外显子2上开头的两个碱基,需要和外显子1的末尾一个碱基,组合在一起形成了一个氨基酸,这个值就是1。并且由于一个氨基酸就包含3个碱基,所以跨外显子最大的碱基贡献数就是2,也就是这个值最大就是2。
tips: hg19_refgene.txt is bed format, means the value is close at left and open at right, and 1 less than the actually coordinate on reference genome coordinate.
参考资料:
网友评论