文献里面用到的基因组注释方法（不包括重复序列和ncRNA）

作者: 野生拟南芥 | 来源:发表于2019-07-22 11:39 被阅读48次

文献里面用到的基因组注释方法（不包括重复序列和ncRNA）
基因组注释理论基础
基因组注释--重复序列注释（一）：Trf软件安装与使用
重复序列注释
使用MAKER进行基因注释(基础入门）
maker基因组注释一（基础篇）
TRF--Tandem Repeat Finder
【基因组注释】ncRNA注释
基因组结构注释：使用RepeatModeler从头注释基因组的重
RepeatModeler + RepeatMasker

Genome assembly of a tropical maize inbred line provides insights into structural variation and crop improvement (NG, 2019)

1、同源比对注释

For homolog evidence, 744,030 annotated protein sequences of six species (Arabidopsis thaliana, Brachypodium distachyon, Oryza sativa, Setaria italica, Sorghum bicolor, Zea mays) were aligned to the genome using exonerate, and then clustered and filtered to result in the final homolog gene set.

2、转录组注释

Generated 327,904 high-quality full-length transcripts from Iso-seq and 1,795,841 Trinity-assembled transcripts from the RNA-seq. The transcripts from RNA-seq and Iso-Seq were further validated by PASA.

3、de novo

we used Augustus and FGENESH trained on 2,000 homolog genes which were supported by Iso-Seq full-length transcripts and monocots transcripts, respectively.

4、整合

All the evidence was submitted to MAKER resulting in 40,936 gene models and 48,224 transcripts. The output of MAKER was refined again by PASA only retaining the validated transcripts.

The genome of cultivated peanut provides insight into legume karyotypes, polyploid evolution and crop domestication

1、同源比对注释

2、转录组注释

RNA-seq and Iso-Seq reads were mapped onto the reference genome using TopHat and Bowtie 2, respectively. Hints with locations of potential intron–exon boundaries were generated from the alignment files with the software package BAM2 hints in the MAKER package. MAKER with AUGUSTUS was then used to predict genes in the repeat-masked reference genome.

3、de novo

AUGUSTUS, SNAP and GeneMark were used for ab initio gene prediction, using model training based on coding sequences from A. ipaensis, A. duranensis, G. max and A. thaliana.

4、整合

Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. (NG, 2018)

使用MAKER做了2轮分析，并且又手动做了很多调整。这里只记录第一轮，详见文章内容。
1、同源比对注释

2、转录组注释
Trinity assembled transcripts (genome-guided) were fed to PASA. The PASA-assembled transcripts were used for training.

3、de novo

SNAP, GENEMARK and AUGUSTUS, were each trained with those selected proteins.

4、整合

MAKER pipeline was used to integrate multiple tiers of coding evidence, including ab initio gene prediction, transcript evidence and protein evidence and generate a comprehensive set of protein-coding genes.

Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense

1、同源比对注释

For the homolog-based approach, GeMoMa (version 1.3.1) software was applied by using protein sequences from Populus trichocarpa, Arabidopsis thaliana, Vitis vinifera, Theobroma cacao and Gossypium raimondii.

2、转录组注释

For the transcript-based prediction, the Hisat (version 2.0.4) and Stringtie (version 1.2.3) programs were used to carry out reference-based transcriptome assembly (data from NCBI BioProject of PRJNA248163 and PRJNA266265). TransDecoder (version2.0; https://github.com/TransDecoder/TransDecoder/) and GeneMarkS-T (version 5.1) were used to predict genes based on transcripts. The PASA (version 2.0.2) software was used to predict genes based on unigenes and full-length transcripts from the PacBio sequencing.

3、de novo

For the de novo prediction, five software programs were used, including Genscan, Augustus (version 2.4), GlimmerHMM (version 3.0.4), GeneID (version 1.4) and SNAP (version 2006-07-28) to scan the repeat-masked genome.

4、整合

Gene models from these different approaches were combined using the EVM software (version 1.1.1).

The rubber tree genome reveals new insights into rubber production and species adaptation (NP, 2016)

1、同源比对注释

SPALN was used for protein homologue search with the parameter “-Q4 –O0 –M10 –H180” against proteins in Malpighiales from NRDB and Uniprot

2、转录组注释

the assembled transcripts from transcriptome sequencing were used to construct gene models by the PASA software for training the predictors, as well as extracting the most possible coding sequences (CDS) with the PASA inner-built Transdecoder program.

3、de novo

Four HMM based predictors for ab initio prediction were used, namely AUGUSTUS, GlimmerHMM, SNAP, and FGENESH++. The first three predictors were trained with PASA-built training sets and the FGENESH++ was run with pre-trained parameters specialized for Hevea.

4、整合

All results from the three types of prediction were integrated by EVM software.

除此之外还写了脚本对上述结果进行了过滤，使用了4个标准，详见文章

Finally, all gene models were updated and curated by PASA software to confirm the UTR region and alternative splicing form. Highly repetitive genes, such as “Retro-transposon”, were manually removed from the candidates.