hg19_refseq.txt文件各列含义

作者: 红外线biubiu | 来源:发表于2021-11-09 09:16 被阅读0次

hg19_refseq.txt文件各列含义
2018-04-25-(GTF文件各列含义 )
mysql-执行计划
Django项目各文件含义
shadow文件各字段含义
如何查看mysql数据库慢查询语句
MySQL解析查询 explain
explain查看sql语句返回的各列含义
文件属性、链接、用户及正则表达式
11-SIM卡各EF文件含义

ANNOVAR注释会用到的refseq文件（bed文件格式，左闭右开，0 based坐标系统）的各列含义，最后几列经常忘记，备案一下（如果有看到的，忽略渣渣英文，能懂就好）：

1: bin , indexing field to speed chromosome range queries

2: name (NM ID)

3: chr

4: strand

5: transcription start

6: transcription end

7: translation(CDS) start

8: translation(CDS) end

9: number of exon

10: every exon start

11: every exon end

12: score

13: name2 (gene name)

14: cdsStartStat, enum('none','unk','incmpl','cmpl')

15: cdsEndStat, enum('none','unk','incmpl','cmpl')

16: exonFrames, exon frame {0, 1, 2}, or -1 if no frame for exon

对于14，15列的'none','unk','incmpl','cmpl'含义：

"none" - No CDS (non-coding)

"unk" - CDS is unknown (coding, but not known)

"incmpl" - CDS is not complete at this end

"cmpl" - CDS is complete at this end

对于16列的数字含义及举例（注意转录本有正反链的区别，下面只是举例，没有考虑正反链）：

exonFrames: the exonFrames field tells you how the two exons join together.

-1 means that the exon isentirely UTR. When the nucleotides in two exons are

required to form an amino acid together, the number is expressed as the

number of nucleotides in the first exon. Because one amino acid contains

3 nucleotide, so the max number in the first exon will be 2.

exonFrames example: there is 1 nucleotide at the end of exon1 that joins

with the first two nucleotides at the start of exon2, this means that exon2

picks up one nucleotide from the exon1 to make the amino acid, the nucleotide

number from the first exon is 1.

用中文再说一遍：这列用于表现不同外显子之间是如何组合在一起形成氨基酸的。当外显子完全是UTR的时候，这个值为 -1。当外显子包含CDS的序列的时候，如果不同外显子上的碱基，需要组合在一起形成氨基酸，会出现形成这个氨基酸，需要从前一个外显子获得的碱基个数的值。比如外显子2上开头的两个碱基，需要和外显子1的末尾一个碱基，组合在一起形成了一个氨基酸，这个值就是1。并且由于一个氨基酸就包含3个碱基，所以跨外显子最大的碱基贡献数就是2，也就是这个值最大就是2。

tips: hg19_refgene.txt is bed format, means the value is close at left and open at right, and 1 less than the actually coordinate on reference genome coordinate.

参考资料：

1.https://github.com/ucscGenomeBrowser/kent/blob/ae4aa88945e566f60c950226ab06cbd2ee749789/src/hg/hgc/hgc.c#L2443