montreal 生信人 2020.2.8
生信人自18年5月推出月度“biorxiv生信好文速览栏目”,平均每期10篇,至今为大家介绍过200篇预印本(preprint)文章了。我们开始做这个栏目的只是本着试一试的态度,一方面将biorxiv上的好文尽快呈现给广大生信从业者,另一方面也是想对预印本进行一些宣传和推广。
从最开始的一期阅读1500,到近期的2500,小编欣喜地看到,biorxiv似乎得到了越来越多人的关注和认可。尽管如此,相比于正式发表于知名杂志的论文,绝大部分媒体对于预印本结果的报道还是凤毛麟角。这样的局面在上个月月底完全改变了:biorxiv数次抢占了各大新闻媒体头条!其实,与其说是预印本“战胜”了传统的同行评议论文,倒不如说是武汉肺炎的影响太大了。这种情况下凸显出预印本在应对突发事件上无可比拟的优势,尽管小编以为在对这次疫情的响应中,甚至有时候,预印本传递的科研成果都显得不那么及时。
前几天有一个特别热的话题。印度理工学院(Indian institute of technology)的Bishwajit Kundu实验室声称:武汉肺炎病毒的spike protein的某段序列或是艾滋病病毒的蛋白插入所致【1】。这篇“神文”立即引发轩然大波,在短短一周时间已有27位网友在下方留言,推特上更是得到了363次转发。由于各界骂声不断,原作者迫于压力在两天之内即撤回原文。正所谓“成也萧何败也萧何”,预印本因缺乏同行评议的特点,既好好蹭了一波疫情的热度,也导致类似的争议性颇大甚至可能有严重问题的结果的出现,而应用预印本结果进行临床指导则更要格外谨慎。因此,biorxiv近日在其网站上特异增加了新的提醒:
换个角度来看,也正是因为预印本,本文得以避免潜在错误出现在正式的论文里。要知道,大部分时候,一篇文章只需两三位审稿人的审稿即可通过同行评议而发表,而这篇印度学者的文章可以说是得到了整个互联网做审稿人的超级贵宾礼遇。
此外,我们还要提醒大家,目前除biorxiv外,还有出色的很多预印本服务器,比如由biorxiv开发团队为班底的医学预印本服务器的medRxiv,本期推送我们也特异为大家带来其中的两篇最新文章。一起来看看吧。
1. 英国华威大学(University of Warwick)学者开发细菌基因组快速搜索工具
BlastFrost: Fast querying of 100,000s of bacterial genomes in Bifrost graphs
BlastFrost is a highly efficient method for querying 100,000s of genome assemblies. It builds on Bifrost, a recently developed dynamic data structure for compacted and colored de Bruijn graphs from bacterial genomes. BlastFrost queries a Bifrost data structure for sequences of interest, and extracts local subgraphs, thereby enabling the efficient identification of the presence or absence of individual genes or single nucleotide sequence variants. Here we describe the algorithms and implementation of BlastFrost. We also present two exemplar practical applications. In the first, we determined the presence of the individual genes within the SPI-2 Salmonella pathogenicity island within a collection of 926 representative genomes in minutes. In the second application, we determined the existence of known single nucleotide polymorphisms associated with fluoroquinolone resistance in the genes gyrA, gyrB and parE among 190, 209 Salmonella genomes. BlastFrost is available for download at https://github.com/nluhmann/BlastFrost.
2. 荷兰乌德勒支大学(Utrecht University)Snel实验室:真核生物激酶的进化历程
The first eukaryotic kinome tree illuminates the dynamic history of present-day kinases
Eukaryotic Protein Kinases (ePKs) are essential for eukaryotic cell signalling. Several phylogenetic trees of the ePK repertoire of single eukaryotes have been published, including the human kinome tree. However, a eukaryote-wide kinome tree was missing due to the large number of kinases in eukaryotes. Using a pipeline that overcomes this problem, we present here the first eukaryotic kinome tree. The tree reveals that the Last Eukaryotic Common Ancestor (LECA) possessed at least 92 ePKs, much more than previously thought. The retention of these LECA ePKs in present-day species is highly variable. Fourteen human kinases with unresolved placement in the human kinome tree were found to originate from three known ePK superfamilies. Further analysis of ePK superfamilies shows that they exhibit markedly diverse evolutionary dynamics between the LECA and present-day eukaryotes. The eukaryotic kinome tree thus unveils the evolutionary history of ePKs, but the tree also enables the transfer of functional information between related kinases.
3. 佐治亚大学(University of Georgia)利用PacBio+Nanopore+Bionano完成玉米无空缺染色体组装
Gapless assembly of maize chromosomes using long read technologies
Creating gapless telomere-to-telomere assemblies of complex genomes is one of the ultimate challenges in genomics. We used long read technologies and an optical map based approach to produce a maize genome assembly composed of only 63 contigs. The B73-Ab10 genome includes gapless assemblies of chromosome 3 (236 Mb) and chromosome 9 (162 Mb), multiple highly repetitive centromeres and heterochromatic knobs, and 53 Mb of the Ab10 meiotic drive haplotype.
4. Robert Edgar发布URMAP,号称快过BWA和bowtie2一个量级
URMAP, an ultra-fast read mapper
Mapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA and Bowtie2 with comparable accuracy on a benchmark test using simulated paired 150nt reads of a well-studied human genome. Software is freely available at https://drive5.com/urmap.
5. 跨物种基因表达比较工具EvoGeneX
Modeling gene expression evolution with EvoGeneX uncovers differences in evolution of species, organs and sexes
To solve this challenge, we introduce EvoGeneX, a computationally efficient method to uncover the mode of gene expression evolution based on the Ornstein-Uhlenbeck process. Importantly, EvoGeneX in addition to modelling expression variations between species, models within species variation. To estimate the within species variation, EvoGeneX formally incorporates the data from biological replicates as a part of the mathematical model. We show that by modelling the within species diversity EvoGeneX significantly outperforms the currently available computational method. In addition, to facilitate comparative analysis of gene expression evolution, we introduce a new approach to measure the dynamics of evolutionary divergence of a group of genes.We used EvoGeneX to analyse the evolution of expression across different organs, species and sexes of the Drosophila genus. Our analysis revealed differences in the evolutionary dynamics of male and female gonads, and uncovered examples of adaptive evolution of genes expressed in the head and in the thorax.
6. 英国学者研究发现人基因间隔区RNA主要起源于基因新生转录本
Intergenic RNA mainly derives from nascent transcripts of known genes
Eukaryotic genomes undergo pervasive transcription, leading to the production of many types of stable and unstable RNAs. Transcription is not restricted to regions with annotated gene features but includes almost any genomic context. Currently, the source and function of most RNAs originating from intergenic regions in the human genome remains unclear. We hypothesised that many intergenic RNA can be ascribed to the presence of as-yet unannotated genes or the ‘fuzzy’ transcription of known genes that extends beyond the annotated boundaries. To elucidate the contributions of these two sources, we assembled a dataset of >2.5 billion publicly available RNA-seq reads across 5 human cell lines and multiple cellular compartments to annotate transcriptional units in the human genome. About 80% of transcripts from unannotated intergenic regions can be attributed to the fuzzy transcription of existing genes; the remaining transcripts originate mainly from putative long non-coding RNA loci that are rarely spliced. We validated the transcriptional activity of these intergenic RNA using independent measurements, including transcriptional start sites, chromatin signatures, and genomic occupancies of RNA polymerase II in various phosphorylation states. We also analysed the nuclear localisation and sensitivities of intergenic transcripts to nucleases to illustrate that they tend to be rapidly degraded either ‘on-chromatin’ by XRN2 or ‘off-chromatin’ by the exosome.
7. 42个大麻基因组揭示大麻素合成基因的拷贝数变异
Sequence and annotation of 42 cannabis genomes reveals extensive copy number variation in cannabinoid synthesis and pathogen resistance genes
Cannabis is a diverse and polymorphic species. To better understand cannabinoid synthesis inheritance and its impact on pathogen resistance, we shotgun sequenced and assembled a Cannabis trio (sibling pair and their offspring) utilizing long read single molecule sequencing. This resulted in the most contiguous Cannabis sativa assemblies to date. These reference assemblies were further annotated with full-length male and female mRNA sequencing (Iso-Seq) to help inform isoform complexity, gene model predictions and identification of the Y chromosome. To further annotate the genetic diversity in the species, 40 male, female, and monoecious cannabis and hemp varietals were evaluated for copy number variation (CNV) and RNA expression. This identified multiple CNVs governing cannabinoid expression and 82 genes associated with resistance to Golovinomyces chicoracearum, the causal agent of powdery mildew in cannabis. Results indicated that breeding for plants with low tetrahydrocannabinolic acid (THCA) concentrations may result in deletion of pathogen resistance genes. Low THCA cultivars also have a polymorphism every 51 bases while dispensary grade high THCA cannabis exhibited a variant every 73 bases. A refined genetic map of the variation in cannabis can guide more stable and directed breeding efforts for desired chemotypes and pathogen-resistant cultivars.
8. 约翰霍普金斯大学Steven Salzberg称Genbank中超200万条序列受污染
Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in 114,035 sequences and 2767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746 sequences and 6795 species in the GenBank database, and 14,132 protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator
9. 【medRxiv】武汉“封城”效果如何?看看北师大和牛津大学研究人员的联合报告
Early evaluation of the Wuhan City travel restrictions in response to the 2019 novel coronavirus outbreak
An ongoing outbreak of a novel coronavirus (2019-nCoV) was first reported in China and has spread worldwide. On January 23rd 2020 China shut down transit in and out of Wuhan, a major transport hub and conurbation of 11 million inhabitants, to contain the outbreak. By combining epidemiological and human mobility data we find that the travel ban slowed the dispersal of nCoV from Wuhan to other cities in China by 2.91 days (95% CI: 2.54-3.29). This delay provided time to establish and reinforce other control measures that are essential to halt the epidemic. The ongoing dissemination of 2019-nCoV provides an opportunity to examine how travel restrictions impede the spatial dispersal of an emerging infectious disease.
10. 【medRxiv】美国学者认为目前的各国机场检查对遏制武汉肺炎在国际传播贡献不大
Estimated effectiveness of traveller screening to prevent international spread of 2019 novel coronavirus (2019-nCoV)
Traveller screening is being used to limit further global spread of 2019 novel coronavirus (nCoV) following its recent emergence. Here, we analyze the expected impact of different travel screening programs given remaining uncertainty around the values of key nCoV life history and epidemiological parameters. Even under best-case assumptions, we estimate that screening will miss around half of infected travellers. Breaking down the factors leading to screening successes and failures, we find that most cases missed by screening are fundamentally undetectable, because they have not yet developed symptoms and are unaware they were exposed. These findings emphasize the need for measures to track travellers who become ill after being missed by a travel screening program. We make our model available for interactive use so stakeholders can explore scenarios of interest using the most up-to-date information. We hope these findings contribute to evidence-based policy to combat the spread of nCoV, and to prospective planning to mitigate future emerging.
引文
1. Pradhan et al., Uncanny similarity of unique inserts in the 2019-nCoV spike protein to HIV-1 gp120 and Gag. BioRxiv, 2020
原载于生信人公众号
网友评论