在过去的一个月,bioRxiv、medRxiv等一众预印本(preprint)平台以其特有的快速灵活继续扮演着新冠研究前沿阵地的角色。小编很高兴看到越来越多的文章经由预印本平台展示后,又投稿到学术期刊,并经同行评议(peer review)发表。
就在上个星期,《自然》杂志刊发了来自汕头大学及港大的管轶和广西医科大学胡艳玲联合团队的关于马来亚穿山甲上分离出新冠病毒相关病毒(SARS-CoV-2-related)的论文,这是一个了不起的发现。该文最先于2月18日投放到bioRxiv上,在当时就引起了相当大的反响。很多人可能将该文解读为指向甚至说是揭示新冠病毒为穿山甲传到人。但如对该文2月18日在bioRxiv提交的预印本和《自然》杂志刊发的论文的摘要部分做一简单比较,会发现在最后一句有明显变化。
bioRxiv版本:
The discovery of multiple lineages of pangolin coronavirus and their similarity to 2019-nCoV suggests that pangolins should be considered as possible intermediate hosts for this novel human virus
《自然》版本:
The discovery of multiple lineages of pangolin coronavirus and their similarity to SARS-CoV-2 suggests that pangolins should be considered as possible hosts in the emergence of novel coronaviruses。
请注意红色字样:在最初始的版本中,作者称,“穿山甲上的多个进化谱系的冠状病毒的发现及其与新冠病毒的相似性提示穿山甲应被视为新冠病毒的可能的中间宿主”,而经同行评议之后改为(未来的)冠状病毒的潜在宿主。这一字之差,结论已然有不小差异。
上个月,来自巴西里约热内卢联邦大学(Federal University of Rio de Janeiro)的研究人员对bioRxiv上投放的预印本和经同行评议后发表的正式论文在内容上做了比较【2】,发现Overall, our results suggest that editorial peer review has a statistically significant but small impact on improving quality of reporting。小编未仔细阅读该文,不好发表见解,但从直觉上说不敢苟同。这一点,从推送开头的例子中就可见一斑。我们的栏目“bioRxiv生信好文速览”一直以宣传预印本这一独具魅力的科研成果展示形式为己任,但小编以为,不可忽视同行评议的重要价值,把握好预印本和同行评议的关系,对文章的理解才能更加深刻吧。如此,对于接下来的十篇文章,请大家在欣赏的同时不忘独立思考,更何况,这些文章的选择都只是小编的一孔之见而已。
1. 英国爱克赛斯大学(University of Exeter):阿尔茨海默症的大型表观基因组关联分析
Meta-analysis of epigenome-wide association studies in Alzheimer’s disease highlights 220 differentially methylated loci across cortex
Epigenome-wide association studies of Alzheimer’s disease have highlighted neuropathology-associated DNA methylation differences, although existing studies have been limited in sample size and utilized different brain regions. Here, we combine data from six methylomic studies of Alzheimer’s disease (N=1,453 unique individuals) to identify differential methylation associated with Braak stage in different brain regions and across cortex. At an experiment-wide significance threshold (P<1.238 x10−7) we identified 236 CpGs in the prefrontal cortex, 95 CpGs in the temporal gyrus and ten CpGs in the entorhinal cortex, with none in the cerebellum. Our cross-cortex meta-analysis (N=1,408 donors) identified 220 CpGs associated with neuropathology, annotated to 121 genes, of which 96 genes had not been previously reported at experiment-wide significance. Polyepigenic scores derived from these 220 CpGs explain 24.7% of neuropathological variance, whilst polygenic scores accounted for 20.2% of variance in these samples. The meta-analysis summary statistics are available in our online data resource (www.epigenomicslab.com/ad-meta-analysis/).
2. 大块头大基因组:巨型红杉(the giant sequoia)的8Gb碱基里隐藏着怎样的奥秘?(吐槽一下,虽然课题本身很有价值,但感觉做的有些粗糙)
The giant sequoia genome and proliferation of disease resistance genes
The giant sequoia (Sequoiadendron giganteum) of California are massive, long-lived trees that grow along the U.S. Sierra Nevada mountains. As they grow primarily in isolated groves within a narrow range, conservation of existing trees has been a national goal for over 150 years. Genomic data are limited in giant sequoia, and the assembly and annotation of the first giant sequoia genome has been an important goal to allow marker development for restoration and management. Using Illumina and Oxford Nanopore sequencing combined with Dovetail chromosome conformation capture libraries, 8.125 Gbp of sequence was assembled into eleven chromosome-scale scaffolds. This giant sequoia assembly represents the first genome sequenced in the Cupressaceae family, and lays a foundation for using genomic tools to aid in giant sequoia conservation and management. Beyond conservation and management applications, the giant sequoia assembly is a resource for answering questions about the life history of this enigmatic and robust species. Here we provide an example by taking an inventory of the large and complex family of NLR type disease resistance genes.
左:加州红杉国家公园著名的谢尔门将军树(General Sherman),据称树龄在2300-2700年(wikipedia)。右:Maximum likelihood tree of NB-ARC domains of all NLR-Annotator detected NLR genes.
3. 法国巴斯德研究所Rayan Chikhi推出序列indexing新方法
REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
Motivation: In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. Results: We used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ~4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. Availability: https://github.com/kamimrcht/REINDEER
4. 新冠病毒起源的分析,指出了deep recombination的作用,多位大佬加持
Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic
There are outstanding evolutionary questions on the recent emergence of coronavirus SARS-CoV-2/hCoV-19 in Hubei province that caused the COVID-19 pandemic, including (1) the relationship of the new virus to the SARS-related coronaviruses, (2) the role of bats as a reservoir species, (3) the potential role of other mammals in the emergence event, and (4) the role of recombination in viral emergence. Here, we address these questions and find that the sarbecoviruses -- the viral subgenus responsible for the emergence of SARS-CoV and SARS-CoV-2 -- exhibit frequent recombination, but the SARS-CoV-2 lineage itself is not a recombinant of any viruses detected to date. In order to employ phylogenetic methods to date the divergence events between SARS-CoV-2 and the bat sarbecovirus reservoir, recombinant regions of a 68-genome sarbecovirus alignment were removed with three independent methods. Bayesian evolutionary rate and divergence date estimates were consistent for all three recombination-free alignments and robust to two different prior specifications based on HCoV-OC43 and MERS-CoV evolutionary rates. Divergence dates between SARS-CoV-2 and the bat sarbecovirus reservoir were estimated as 1948 (95% HPD: 1879-1999), 1969 (95% HPD: 1930-2000), and 1982 (95% HPD: 1948-2009). Despite intensified characterization of sarbecoviruses since SARS, the lineage giving rise to SARS-CoV-2 has been circulating unnoticed for decades in bats and been transmitted to other hosts such as pangolins. The occurrence of a third significant coronavirus emergence in 17 years together with the high prevalence and virus diversity in bats implies that these viruses are likely to cross species boundaries again.
5. 斯坦福大学Enard和Petrov:基因组学分析表明古代RNA病毒流行病在人类进化历史上有重要影响
Ancient RNA virus epidemics through the lens of recent adaptation in human genomes
Here, we detect the genomic footprints left by ancient viral epidemics that took place in the past ~50,000 years in the 26 human populations represented in the 1,000 Genomes Project. By using the enrichment in signals of adaptation at ~4,500 host loci that interact with specific types of viruses, we provide evidence that RNA viruses have driven a particularly large number of adaptive events across diverse human populations. These results suggest that different types of viruses may have exerted different selective pressures during human evolution. Knowledge of these past selective pressures will provide a deeper evolutionary perspective on current pathogenic threats.
6. 一款有用的hybrid变异体识别软件,来自伊利诺伊大学香槟分校Deming Chen组(软件的名字一定是最好记的)
HELLO: A hybrid variant calling approach
Next Generation Sequencing (NGS) technologies that cost-effectively characterize genomic regions and identify sequence variations using short reads are the current standard for genome sequencing. However, calling small indels in low-complexity regions of the genome using NGS is challenging. Recent advances in Third Generation Sequencing (TGS) provide long reads, which call large-structural variants accurately. However, these reads have context-dependent indel errors in low-complexity regions, resulting in lower accuracy of small indel calls compared to NGS reads. When both small and large-structural variants need to be called, both NGS and TGS reads may be available. Integration of the two data types with unique error profiles could improve robustness of small variant calling in challenging cases. However, there isn’t currently such a method integrating both types of data. We present a novel method that integrates NGS and TGS reads to call small variants. We leverage the Mixture of Experts paradigm which uses an ensemble of Deep Neural Networks (DNN), each processing a different data type to make predictions. We present improvements in our DNN design compared to previous work such as sequence processing using one-dimensional convolutions instead of image processing using two-dimensional convolutions and an algorithm to efficiently process sites with many variant candidates, which help us reduce computations. Using our method to integrate Illumina and PacBio reads, we find a reduction in the number of erroneous small variant calls of up to ~30%, compared to the state-of-the-art using only Illumina data. We also find improvements in calling small indels in low-complexity regions.
7. 一个新的尼安德塔人基因组
A high-coverage Neandertal genome from Chagyrskaya Cave
We sequenced the genome of a Neandertal from Chagyrskaya Cave in the Altai Mountains, Russia, to 27-fold genomic coverage. We estimate that this individual lived ~80,000 years ago and was more closely related to Neandertals in western Eurasia (1,2) than to Neandertals who lived earlier in Denisova Cave (3), which is located about 100 km away. About 12.9% of the Chagyrskaya genome is spanned by homozygous regions that are between 2.5 and 10 centiMorgans (cM) long. This is consistent with that Siberian Neandertals lived in relatively isolated populations of less than 60 individuals. In contrast, a Neandertal from Europe, a Denisovan from the Altai Mountains and ancient modern humans seem to have lived in populations of larger sizes. The availability of three Neandertal genomes of high quality allows a first view of genetic features that were unique to Neandertals and that are likely to have been at high frequency among them. We find that genes highly expressed in the striatum in the basal ganglia of the brain carry more amino acid-changing substitutions than genes expressed elsewhere in the brain, suggesting that the striatum may have evolved unique functions in Neandertals.
8. 卡耐基梅隆大学Jian Ma:3D基因组结构分析的整合分析
SPIN reveals genome-wide landscape of nuclear compartmentalization
Most sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the “reference flow” alignment method that uses information from multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow exhibits a similar level of accuracy and bias avoidance, but with 13% of the memory footprint and 6 times the speed.
9. 如何在群体基因组学分析中降低参考序列的偏倚?其实你只需要一个简单的思路
Reducing reference bias using multiple population reference genomes
Most sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the “reference flow” alignment method that uses information from multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow exhibits a similar level of accuracy and bias avoidance, but with 13% of the memory footprint and 6 times the speed.
10. 加州大学圣地亚哥分校Melissa Gymrek:自闭症与串联重复突变有关
The contribution of de novo tandem repeat mutations to autism spectrum disorders
Recent studies have demonstrated a strong contribution of germline de novo mutations to autism spectrum disorders (ASD). Tandem repeats (TRs), consisting of repeated sequence motifs of 1-20bp, comprise one of the largest sources of human genetic variation. Yet, the contribution of TR mutations to ASD has not been assessed on a genome-wide scale. Here, we leverage novel bioinformatics tools and ~35× whole genome sequencing of ~1,600 families from the Simons Simplex Collection (SSC) to analyze germline de novo TR mutations at nearly 1 million loci in ASD-affected and unaffected siblings. We identify an average of 54 high-confidence mutations per child at an estimated true positive rate of 90%. We find novel genome-wide TR mutation patterns, including a bias toward larger mutations from the maternal compared with paternal germlines. We demonstrate a significant genome-wide excess of TR mutations in ASD probands, which are significantly larger than in controls, enriched in promoters of fetal brain expressed genes, and more strongly predicted to alter expression during brain development. Overall, our results indicate a significant, but so far overlooked, contribution of repetitive regions to ASD.
11. 【bonus】 “最小恐龙”是不是龙?关于近期《自然》一篇热文的争议(来自古生物所等单位的六位国内学者)
Is Oculudentavis a bird or even archosaur?
Recent finding of a fossil, Oculudentavis khaungraae Xing et al. 2020, entombed in a Late Cretaceous amber was claimed to represent a humming bird-sized dinosaur [1]. Regardless the intriguing evolutional hypotheses about the bauplan of Mesozoic dinosaurs (including birds) posited therein, this enigmatic animal, however, demonstrates various lizard-like morphologies, which challenge the fundamental morphological gap between Lepidosauria and Archosauria. Here we reanalyze the original computed tomography scan data of Oculudentavis. A suit of squamate synapomorphies, including pleurodont marginal teeth and an open lower temporal fenestra, overwhelmingly support its squamate affinity, and that the avian or dinosaurian assignment of Oculudentavis is conclusively rejected.
注:扩展阅读可见末尾引文的中文报道【3】
12. 【bonus】施一公团队:非洲爪蟾核孔复合体高分辨率结构
Structure of the Cytoplasmic Ring of the Xenopus laevis Nuclear Pore Complex
Nuclear pore complex (NPC) exhibits structural plasticity and has only been characterized at local resolutions of up to 15 Å for the cytoplasmic ring (CR). Here we present a single-particle cryo-electron microscopy (cryo-EM) structure of the CR from Xenopus laevis NPC at average resolutions of 5.5-7.9 Å, with local resolutions reaching 4.5 Å. Improved resolutions allow identification and placement of secondary structural elements in the majority of the CR components. The two Y complexes in each CR subunit interact with each other and associate with those from flanking subunits, forming a circular scaffold. Within each CR subunit, the Nup358-containing region wraps around the stems of both Y complexes, likely stabilizing the scaffold. Nup205 connects the short arms of the two Y complexes and associates with the stem of a neighbouring Y complex. The Nup214-containing region uses an extended coiled-coil to link Nup85 of the two Y complexes and protrudes into the axial pore of the NPC. These previously uncharacterized structural features reveal insights into NPC assembly.
引文
1. Lam, T.T., Shum, M.H., Zhu, H. et al. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature (2020). https://doi.org/10.1038/s41586-020-2169-0
2. Comparing quality of reporting between preprints and peer-reviewed articles in the biomedical literature. Clarissa F. D. Carneiro, et al., bioRxiv 581892; doi: https://doi.org/10.1101/581892
3. 胡珉琦,袁一雪 有关“最小恐龙”质疑文章已投Nature 原论文作者曾考虑撤稿,科学网(2020)
作者原创,原载于生信人公众号
网友评论