写在前面
======================================================
=======================================================
=======================================================
=======================================================
=======================================================
即便是神经所的杨老师,以一种特殊的方式发《自然》也用了半年多的时间(手动狗头保命)。而我用100天发《细胞》,可能是在想桃子吃。既然100天发《细胞》难比登天,那100天复现一篇《细胞》,垫垫脚应该是能做到的吧。
这次100天计划目的是为了复现2019年由植物免疫领域众多大佬联合打造的植物泛NLR组文章,"A Species-Wide Inventory of NLR Genes and Alleles in Arabidopsis thaliana"。
Abstract
• 组装:Canu (version 1.3; -pacbio-corrected, trimReadsCoverage = 2, errorRate = 0.01, genomeSize = 2 m)
• 注释:MAKER(version 2.32; pred_flank = 150, keep_preds = 1, split_hit = 3200, ep_score_limit = 95, en_score_limit = 95)
• 基因预测:AUGUSTUS(version 3.1.0; defaults) and SNAP(version 2006-07-28; defaults). AUGUSTUS used the default ‘Arabidopsis’ profile for gene prediction, and SNAP used a custom Hidden Markov Model (hmm) based on NB-ARC and/or TIR containing genes. 参考序列:Araport11(https://www.araport.org/)
• 标记重复隐藏的区域:RepeatMasker(version open-4.0.5; model_org = Arabidopsis)
• 比对RNA-seq reads到参考基因组:hisat2(version 2.0.5;–no-mixed–no-discordant), Reads from silique, root, stem, leaf, and flower (PRJNA336053, PE; 100 bp;5-10 Mb)。通过比对reads来进行基因的预测,也可组装转录组数据(Cufflinks;version 2.2.1; defaults;)
• 蛋白结构域预测:InterProScan(version 5.20-59.0; -dp -iprlookup -appl Pfam,Coils) CC motifs were refined using a majority vote from Coils (2.2.1; InterProScan-defaults; (Lupas et al., 1991)), Paircoil2 (defaults; (McDonnell et al., 2006)), and NLR parser (v.2; defaults;
• 手动检查:WebApollo version 2.0.4 (http://ann-nblrrome.tuebingen.mpg.de/apollo/jbrowse/) A track for duplicated and diversified genes was added by aligning transcripts (track = est2genome-50) and proteins(track = protein2genome-50) from the reference gene annotation (Araport11) to each NLRome (–percent 50, exonerate; version 2.2.0.
• 对NLR进行分类:We defined as NLR genes those that contained at least an NB, a TIR, or a CCR (RPW8) domain. I.e., LRR or CC motifs alone were not
considered sufficient for NLR identification. We defined TNLs (at least a TIR domain), CNLs (CC+NB domain), RNLs (at least an RPW8 domain), and NLs (at least an NB domain).Canonical architectures contain only NB (Pfam accession PF00931), TIR (PF01582), RPW8 (PF05659), LRR (PF00560, PF07725, PF13306, PF13855) domains, or CC motifs (Figure S2).Non-canonical architectures contain at least one ID.大意为除了经典的,剩下的就是非经典的。
• DIAMOND;
version 0.9.1.102;–max-target-seqs 13169–more-sensitive–comp-based-stats
• orthAgogue, commit 82dcb7aeb67c,–use_scores–strict_coorthologs
后记
选择这篇文章的原因是:
1.往后会侧重基因组学的研究,而这篇文章基本涵盖了基因组学常用的工具;
2.单一的组学研究并不能满足当前的需要,泛基因组学会是一种必然的趋势。
3.文章share了部分代码,可以作为学习的参考。
由于种种原因,可能并不能完全复现这篇paper,但是肯定会对paper中使用的软件进行深入的学习。本次行动姑且命令为“100天细胞挑战”,即日生效。
参考链接:
https://www.sciencedirect.com/science/article/pii/S0092867419308372
网友评论