欢迎关注"生信修炼手册"!
在UCSC网站上,对于不同的参考基因组,提供了许多数据,这些数据中只要是提供了基因组区域的,理论上都可以用于annovar region-based annotation。在选择时,只需要根据目的选择对应的数据库即可。annovar 官方的文档中,给出了以下几种用法。
1. ENCODE annnotated regions
ENCODE项目提供了许多基因组特殊区域的注释,比如转录区,H3K4Me1,H3K4Me3,H3K27Ac 等各种修饰后的组蛋白结合区,DNaseI 超敏感区域,转录因子结合区域等。这些区域都可以用于annovar的注释。
对于不同的组蛋白修饰,常规的认知如下
-
active promoter: H3K4me3, H3K9Ac
-
active enhancer: H3K4me1, H3K27Ac
-
active elongation: H3K36me3, H3K79me2
-
repressed promoters and broad regions: H3K27me3, H3K9me3
H3K4me3和H3K9Ac这两种组蛋白修饰通常富集在激活的启动子区域上; H3K4me1和H3K27Ac这两种组蛋白修饰通常富集在基因的增强子区域;H3K36me3和 H3K79me2这两种组蛋白修饰通常富集在RNA聚合酶延伸区域 ;H3K27me3和H3K9me3这两种组蛋白修饰用于抑制基因表达。
目前,ENCODE的相关注释只支持hg18版本的数据库。以H3K4Me1
组蛋白修饰结合区域为例,说明如下
第一步:下载wgEncodeBroadChipSeqPeaksGm12878H3k4me1 数据库,命令如下
annotate_variation.pl -downdb wgEncodeBroadChipSeqPeaksGm12878H3k4me1 humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/wgEncodeBroadChipSeqPeaksGm12878H3k4me1.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory
数据库文件内容如下
585 chr1 0 625 . 1000 . 18.57 14.6993 -1
585 chr1 14800 18500 . 1000 . 7.21 12.9486 -1
585 chr1 19250 20600 . 1000 . 8.21 13.6344 -1
586 chr1 224700 228000 . 1000 . 12.17 13.2353 -1
587 chr1 367150 369075 . 1000 . 7.42 13.5232 -1
第二步,进行注释,命令如下
annotate_variation.pl -regionanno -buildver hg18 -dbtype wgEncodeBroadChipSeqPeaksGm12878H3k4me1 -scorecolumn 5 ex1.avinput humandb/
NOTICE: Output file is written to ex1.avinput.hg18_wgEncodeBroadChipSeqPeaksGm12878H3k4me1
NOTICE: Reading annotation database humandb/hg18_wgEncodeBroadChipSeqPeaksGm12878H3k4me1.txt ... Done with 66675 regions
NOTICE: Finished region-based annotation on 23 genetic variants
输出文件的后缀为hg18_wgEncodeBroadChipSeqPeaksGm12878H3k4me1, 在输入文件的前面新增了两列,内容如下
wgEncodeBroadChipSeqPeaksGm12878H3k4me1 Score=1000;Name=.
wgEncodeBroadChipSeqPeaksGm12878H3k4me1 Score=1000;Name=.
wgEncodeBroadChipSeqPeaksGm12878H3k4me1 Score=1000;Name=.
第一列为数据库的名称,第二列为名称和打分值,-scorecolumn 5
参数指定数据库文件中的第五列作为score值。
2. dbsnp 数据库
第一步:下载snp138数据库,命令如下
annotate_variation.pl -buildver hg19 -downdb snp138 humandb/
第二步,进行注释, 命令如下
annotate_variation.pl -buildver hg19 -regionanno -dbtype snp138 ex1.avinput humandb/
NOTICE: Output file is written to ex1.avinput.hg19_snp138
NOTICE: Reading annotation database humandb/hg19_snp138.txt ... Done with 1000 regions
NOTICE: Finished region-based annotation on 23 genetic variants
由于SNP数据库非常的大,我这里选取了前1000行用于测试,输出结果后缀为hg19_snp138
, 在输入文件的前面新增了两列,内容如下
snp138 Name=rs373328635
snp138 Name=rs368469931
第一列为数据库的名称,第二列为overlap区对应的rs号。
3. 对非编码区的变异位点进行分类
ENCODE项目针对不同的细胞系,对非编码区的变异位点根据位置分成了增强子,启动子等类别。细胞系列表如下:
在注释时,选择对应的细胞系即可。以Gm12878
细胞系进行说明
第一步:下载wgEncodeBroadHmmGm12878HMM 数据库,命令如下
annotate_variation.pl -downdb wgEncodeBroadHmmGm12878HMM humandb/
NOTICE: The --buildver is set as 'hg18' by default
NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/wgEncodeBroadHmmGm12878HMM.txt.gz ... Done
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory
数据库文件列数较多,截取了前5列,内容如下
585 chr1 0 600 15 Repetitive/CNV
585 chr1 1000 1600 8 Insulator
1347 chr1 100000000 100000200 6 Weak Enhancer
1347 chr1 100000200 100001000 2 Weak Promoter
1347 chr1 100001000 100001400 6 Weak Enhancer
1347 chr1 100001400 100001600 2 Weak Promoter
1347 chr1 100001600 100002200 6 Weak Enhancer
1347 chr1 100002200 100002400 2 Weak Promoter
1347 chr1 100002400 100003000 1 Active Promoter
第二步,进行注释,命令如下
annotate_variation.pl -regionanno -buildver hg18 -dbtype wgEncodeBroadHmmGm12878HMM ex1.avinput humandb/
NOTICE: Output file is written to ex1.avinput.hg18_wgEncodeBroadHmmGm12878HMM
NOTICE: Reading annotation database humandb/hg18_wgEncodeBroadHmmGm12878HMM.txt ... Done with 570580 regions
NOTICE: Finished region-based annotation on 23 genetic variants
输出文件的后缀为hg18_wgEncodeBroadHmmGm12878HMM, 在输入文件的前面新增了两列,内容如下
wgEncodeBroadHmmGm12878HMM Name=11 Weak Txn
wgEncodeBroadHmmGm12878HMM Name=10 Txn Elongation
wgEncodeBroadHmmGm12878HMM Name=11 Weak Txn
wgEncodeBroadHmmGm12878HMM Name=13 Heterochrom/lo
wgEncodeBroadHmmGm12878HMM Name=13 Heterochrom/lo
第一列为数据库的名称,第二列为突变位点的分类。在定义分类时,不同数字代表的含义如下
-
State 1 - Active Promoter
-
State 2 - Weak Promoter
-
State 3 - Inactive/poised Promoter
-
State 4 - Strong enhancer
-
State 5 - Strong enhancer
-
State 6 - Weak/poised enhancer
-
State 7 - Weak/poised enhancer
-
State 8 - Insulator
-
State 9 - Transcriptional transition
-
State 10 - Transcriptional elongation
-
State 11 - Weak transcribed
-
State 12 - Polycomb-repressed
-
State 13 - Heterochromatin; low signal
-
State 14 - Repetitive/Copy Number Variation
-
State 15 - Repetitive/Copy Number Variation
除了采用别人定义好的现成数据库,也可以采用自定义的数据库。自定义的数据库支持以下两种格式
1. 自定义的GFF3文件
关于GFF3格式的详细定义可以参考如下链接
http://www.sequenceontology.org/gff3.shtml
第一步,准备数据库,文件名称为hg19_tfbs_gff3.txt
, 内容如下
gff-version 3
chr1 tfloc TF_binding_site 83024 83039 849 - . ID=pos83031;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 229352 229367 849 - . ID=pos229359;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 405674 405689 849 + . ID=pos405681;Name=V$FREAC3_01
chr1 tfloc TF_binding_site 564503 564518 849 - . ID=pos564510;Name=V$FREAC3_01
文件名称自己定义,第一行必须为##gff-version 3
, 正文部分为\t
分隔的9列内容,数据库文件必须存放在humandb 目录下。
第二步,进行注释,命令如下
annotate_variation.pl -regionanno -buildver hg19 -dbtype gff3 -gff3dbfilehg19_tfbs_gff3.txt ex1.avinput humandb/
NOTICE: Output file is written to ex1.avinput.hg19_gff3
NOTICE: Reading annotation database humandb/hg19_tfbs_gff3.txt ... Done with 4 regions from 4 GFF3 records
NOTICE: Finished region-based annotation on 23 genetic variants
输出文件的后缀为hg19_gff3
, 在输入文件的前面新增了两列,内容如下
gff3 Score=849;Name=pos83031
gff3 Score=849;Name=pos564510
第一列为数据库的名称,第二列为overlap区间的名称和打分。
2. 自定义BED文件
annovar 支持直接使用bed文件作为数据库,需要注意的一点是,必须把bed文件拷贝到对应的数据库目录下,因为annovar 只会在数据库目录下去查找对应的文件。
注释的命令如下
annotate_variation.pl -regionanno -buildver hg19 -dbtype bed -colsWanted 5 -bedfile human_exon.bed ex1.avinput humandb/
NOTICE: Output file is written to ex1.avinput.hg19_bed
NOTICE: Reading annotation database humandb/human_exons.bed ... Done with 10 regions
NOTICE: Finished region-based annotation on 23 genetic variants
输出结果如下
bed Name=NR_024540_8 chr1 18174 18174 0 0
第一列表示数据库,第二列为bed文件中某一列的内容,默认为NA, 可以通过colsWanted
参数指定bed文件的某一列作为输出。
扫描关注微信号,更多精彩内容等着你!
网友评论