文献阅读：来自培养的人类肠道细菌的1,520个参考基因组能够进行

作者: 龙star180 | 来源:发表于2022-10-07 16:36 被阅读0次

NBT | 基于培养的1520人肠道细菌参考基因组
人体共生菌培养组
Cell：一半都不认识的140000种病毒，已入住人体肠胃
Perl单行实战笔记1：计算metagenome shotgun
生信（六）（转载）开卷有益
使用CompareM计算细菌基因组间的amino acid id
宏基因组研究工具 | 小鼠肠道宏基因组目录(iMGMC)
每日读书*How the Immune System Works
人类参考基因组
你真的了解你的“大肠杆菌”么？

期刊：

nature biotechnology (68.164/Q1)

1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses

来自培养的人类肠道细菌的1,520个参考基因组能够进行功能性微生物组分析

Abstract

Reference genomes are essential for metagenomic analyses and functional characterization of the human gut microbiota. We present the Culturable Genome Reference (CGR), a collection of 1,520 nonredundant, high-quality draft genomes generated from >6,000 bacteria cultivated from fecal samples of healthy humans. Of the 1,520 genomes, which were chosen to cover all major bacterial phyla and genera in the human gut, 264 are not represented in existing reference genome catalogs. We show that this increase in the number of reference bacterial genomes improves the rate of mapping metagenomic sequencing reads from 50% to >70%, enabling higher-resolution descriptions of the human gut microbiome. We use the CGR genomes to annotate functions of 338 bacterial species, showing the utility of this resource for functional studies. We also carry out a pan-genome analysis of 38 important human gut species, which reveals the diversity and specificity of functional enrichment between their core and dispensable genomes.

参考基因组对于人类肠道微生物群的宏基因组分析和功能表征至关重要。我们提出了可培养基因组参考（CGR），这是一个由1,520个非冗余的高质量基因组草案组成的集合，这些基因组是从健康人类的粪便样本中培养出来的6000多个细菌中产生的。在这1520个基因组中，有264个是现有参考基因组目录中没有的，这些基因组被选择来覆盖人类肠道中的所有主要细菌门类和属。我们表明，参考细菌基因组数量的增加将宏基因组测序读数的映射率从50%提高到70%以上，从而能够对人类肠道微生物组进行更高分辨率的描述。我们使用CGR基因组来注释338个细菌物种的功能，显示了这一资源在功能研究方面的效用。我们还对38个重要的人类肠道物种进行了泛基因组分析，揭示了其核心基因组和可有可无基因组之间功能富集的多样性和特殊性。

Result

The human gut microbiota refers to the all the microorganisms that inhabit the human gastrointestinal tract. Diverse roles of the gut microbiota in human health and disease have been recognized1,2. Metagenomic studies have transformed our understanding of the taxonomic and functional diversity of human microbiota, but more than half of the sequencing reads from a typical human fecal metagenome cannot be mapped to existing bacterial reference genomes 3,4. The lack of high-quality reference genomes has become an obstacle for high-resolution analyses of the human gut microbiome.

人类的肠道微生物群是指居住在人类胃肠道的所有微生物。肠道微生物群在人类健康和疾病中的不同作用已经得到认可1,2。宏基因组研究改变了我们对人类微生物群的分类和功能多样性的理解，但来自典型的人类粪便宏基因组的一半以上的测序读数不能映射到现有的细菌参考基因组3,4。缺乏高质量的参考基因组已经成为人类肠道微生物组高分辨率分析的一个障碍。

Although the previously reported Integrated Gene Catalog (IGC) has enabled metagenomic, metatranscriptomic and metaproteomic analyses 3,5,6, the gap between compositional and functional analyses can only be filled by individual bacterial genomes. Genes that co-vary among samples can be clustered into metagenomic linkage groups 7, metagenomic clusters8 and metagenomic species 9,10, whose annotation depends on alignment to the limited number of existing reference genomes. Other metagenomics-based analyses of the gut microbiome—for example, single nucleotide polymorphisms (SNPs), indels and copy number variations—rely on the coverage and quality of reference genomes 11–13.

尽管之前报道的综合基因目录（IGC）已经实现了宏基因组、宏转录组和宏蛋白组分析3,5,6，但组成和功能分析之间的差距只能由单个细菌基因组来填补。样本之间共同变化的基因可以被聚类为宏基因组联结组7、宏基因组集群8和宏基因组物种9,10，其注释取决于与有限的现有参考基因组的比对。其他基于宏基因组学的肠道微生物组分析--例如单核苷酸多态性（SNP）、缩略词和拷贝数变异--依赖于参考基因组的覆盖范围和质量11-13。

Despite the rapid increase in the number of sequenced bacterial and archaeal genomes, reference genomes for gut bacteria are underrepresented. It is estimated that <4% of the bacterial genomes in the US National Center for Biotechnology Information (NCBI) database belong to the human gut microbiota. Rather, the focus has been on clinically relevant pathogenic bacteria, which are overrepresented in the microbial databases. The first catalog of 178 reference bacterial genomes for the human microbiota was reported by the Human Microbiome Project (HMP) 14 in 2010. To date, the HMP has sequenced >2,000 microbial genomes cultivated from human body sites, 437 of which are gut microbiota (data accessed 8 September 2017). However, the number of reference gut bacterial genomes is still far from saturated.

尽管被测序的细菌和古细菌基因组的数量迅速增加，但肠道细菌的参考基因组的代表性不足。据估计，在美国国家生物技术信息中心（NCBI）数据库中，<4%的细菌基因组属于人类肠道微生物群。相反，重点是临床相关的致病菌，这些细菌在微生物数据库中的代表性过高。2010年，人类微生物组计划（HMP）14报告了人类微生物群的第一个178个参考细菌基因组目录。到目前为止，HMP已经测序了>2000个从人体部位培养的微生物基因组，其中437个是肠道微生物群（2017年9月8日的数据）。然而，参考肠道细菌基因组的数量仍然远远没有达到饱和。

We present a reference catalog of genomes of cultivated human gut bacteria (named the CGR), established by culture-based isolation of >6,000 bacterial isolates from fecal samples of healthy individuals. The CGR comprises 1,520 nonredundant, high-quality draft bacterial genomes, contributing at least 264 new reference genomes to the gut microbiome. After inclusion of CGR genomes, the mapping rate of selected metagenomic datasets improved from around 50% to over 70%. In addition to improving metagenomic analyses, the CGR will improve functional characterization and pan-genomic analyses of the gut microbiota at high resolution.

我们提出了一个培养的人类肠道细菌基因组的参考目录（命名为CGR），该目录是通过从健康人的粪便样本中分离出超过6000个细菌而建立的。CGR包括1520个非冗余的高质量细菌基因组草案，为肠道微生物组提供了至少264个新的参考基因组。纳入CGR的基因组后，选定的宏基因组数据集的映射率从50%左右提高到70%以上。除了改善宏基因组分析外，CGR还将改善肠道微生物群的功能特征和高分辨率的泛基因组分析。

Results

Expanded catalog of gut bacterial genomes. We obtained 6,487 bacterial isolates from fresh fecal samples donated by 155 healthy volunteers by using 11 different media under anaerobic conditions (Supplementary Fig. 1a and Supplementary Table 1). Notably, more than half of the isolates were cultured from MPYG medium (Supplementary Fig. 1b). All the isolates were subjected to 16 S rRNA gene amplicon sequencing analysis, and 1,759 nonredundant isolates that provided broad coverage of the phylogenetic tree were selected for whole-genome sequencing (Supplementary Fig. 1c and Supplementary Table 2). After de novo assembly of the next-generation sequencing reads, we identified 104 isolates that contained more than one genome. These assembled sequences were then parsed into 212 genomes using our in-house pipeline (Supplementary Table 3). Briefly, multi-genomes were split at scaffold level on the basis of G+C content versus sequencing depth. The closest reference genomes for the spilt scaffolds were determined on the basis of average nucleotide identity (ANI), and the mis-split scaffolds were mapped back to their closest reference genome to get the final split genome (see Methods). In total, we obtained a collection of 1,867 newly assembled genomes, 1,520 (81.4%) of which fulfilled the HMP’s criteria for high-quality draft genomes and exceeded 95% genome completeness and less than 5% contamination as evaluated by CheckM. The genome sizes and G+C contents of CGR ranged from 0.2 to 7.9 Mbp and 26.56% to 64.28%, respectively. A total of 5,749,641 genes were predicted from the annotation data (Supplementary Table 4).

肠道细菌基因组的扩展目录。我们通过在厌氧条件下使用11种不同的培养基，从155名健康志愿者捐赠的新鲜粪便样本中获得了6487个细菌分离物（补充图1a和补充表1）。值得注意的是，超过一半的分离物是从MPYG培养基中培养出来的（补充图1b）。所有的分离物都进行了16S rRNA基因扩增子测序分析，并选择了1,759个提供系统发育树广泛覆盖的非冗余分离物进行全基因组测序（补充图1c和补充表2）。在对下一代测序读数进行重新组装后，我们发现有104个分离物含有一个以上的基因组。然后使用我们的内部管道将这些组装的序列解析为212个基因组（补充表3）。简而言之，根据G+C含量与测序深度，在支架水平上对多基因组进行分割。根据平均核苷酸一致性（ANI）确定拆分的支架最接近的参考基因组，并将错误拆分的支架映射到其最接近的参考基因组上，得到最终的拆分基因组（见方法）。我们总共得到了1,867个新组装的基因组，其中1,520个（81.4%）符合HMP的高质量基因组草案的标准，并且超过了95%的基因组完整性和CheckM评估的小于5%的污染度。CGR的基因组大小和G+C含量分别为0.2至7.9 Mbp和26.56%至64.28%。根据注释数据共预测了5,749,641个基因（补充表4）。

Taxonomic annotation of CGR was carried out using a self-constructed, efficient ANI-based pipeline (Supplementary Fig. 2). The 1,520 high-quality genomes were classified into 338 species-level clusters (ANI ≥ 95%, a species delineation corresponding to 70% DNA–DNA hybridization), which covered all the major phyla of the human gut microbiota, including Firmicutes (211 clusters, 796 genomes), Bacteroidetes (60 clusters, 447 genomes), Actinobacteria (54 clusters, 235 genomes), Proteobacteria (10 clusters, 36 genomes) and Fusobacteria (3 clusters, 6 genomes) (Fig. 1a and Supplementary Table 5). Among these 338 clusters, 134 clusters (corresponding to 264 genomes) were not annotated to any present reference genomes in NCBI (Fig. 1a), and 50 clusters did not fall within any sequenced genera (Supplementary Table 5). To corroborate the presence of novel species in CGR, we carried out additional taxonomic identification using 16 S rRNA gene analysis. A species was recognized as novel if its 16 S rRNA gene sequence had <98.7% similarity with known species in the EzBioCloud database (see Methods). Overall, we identified 350 distinct bacterial species (based on operational taxonomic units), including 149 candidate novel species, 42 of which represent candidate novel genera. These results underscore the value of the individual reference genomes provided by the CGR.

CGR的分类注释是使用一个自建的、基于ANI的高效管道进行的（补充图2）。1,520个高质量的基因组被分为338个物种级别的集群（ANI≥95%，物种划分对应70%的DNA-DNA杂交），这些集群涵盖了人类肠道微生物群的所有主要门类。包括韧皮菌（211个集群，796个基因组）、类杆菌（60个集群，447个基因组）、放线菌（54个集群，235个基因组）、蛋白菌（10个集群，36个基因组）和真菌（3个集群，6个基因组）（图. 1a和补充表5）。在这338个簇中，有134个簇（对应264个基因组）没有被注释到NCBI中的任何现有参考基因组（图1a），有50个簇不属于任何测序属（补充表5）。为了证实CGR中新物种的存在，我们使用16S rRNA基因分析进行了额外的分类鉴定。如果一个物种的16 S rRNA基因序列与EzBioCloud数据库中的已知物种相似度小于98.7%，则被认定为新物种（见方法）。总的来说，我们确定了350个不同的细菌物种（基于操作分类单位），包括149个候选新物种，其中42个代表候选新属。这些结果强调了CGR所提供的单个参考基因组的价值。

Despite the variation of individual microbiota at the genus level, the CGR identified bacterial populations with broad diversity, covering eight out of nine core genera in the Chinese gut microbiota 15. More than 80 species were novel in comparison with the previously sequenced species from a reported 1,000 cultured bacterial species from the human gastrointestinal tract16 (Supplementary Fig. 3a). Moreover, the CGR successfully identified 38 genera that were of low relative abundance (<1%) according to the IGC6, which is a large catalog of reference genes derived from a collection of ~1,250 metagenomic samples from individuals on three continents (Supplementary Fig. 3b). Among them, 7 genera were identified with more than 20 genomes (Bifidobacterium, Collinsella, Coprobacillus, Dorea, Streptococcus, Prevotella and Parabacteroides). The CGR also identified another 9 genera that were not detected by IGC6 (Butyricicoccus, Butyricimonas, Catenibacterium, Dielma, Erysipelatoclostridium, Megamonas, Melissococcus, Peptoclostridium and Vagococcus) (Supplementary Fig. 3b). These results underscore the contribution of the CGR to the existing database of gut bacterial whole genomes.

尽管单个微生物群在属的层面上存在差异，但CGR确定的细菌群具有广泛的多样性，涵盖了中国肠道微生物群中9个核心属中的8个 15。与之前报道的来自人类胃肠道的1000个培养细菌物种相比，有80多个物种是新的（补充图3a）。此外，根据IGC6，CGR成功识别了38个相对丰度较低（<1%）的属，IGC6是一个大型的参考基因目录，来自三大洲个体的约1250个元基因组样本的集合（补充图3b）。其中，有7个属的基因组被鉴定为超过20个（双歧杆菌、柯林斯菌、Coprobacillus、Dorea、链球菌、Prevotella和Parabacteroides）。CGR还发现了另外9个IGC6没有检测到的属（Butyricoccus, Butyricimonas, Catenibacterium, Dielma, Erysipelatoclostridium, Megamonas, Melissococcus, Peptoclostridium 和 Vagococcus）（补充图3b）。这些结果强调了CGR对现有肠道细菌全基因组数据库的贡献。

Improvement in metagenomic and SNP analyses.

宏基因组和SNP分析的改进。

The existing reference genomes for metagenomic sequence mapping are far from saturated. For example, the genomes or draft genomes of bacteria and archaea used in a recent study allowed mapping of less than half of the sequences in the fecal metagenome 3,4. To illustrate the value of the CGR to metagenomic analyses, we performed sequence mapping using previous metagenomic data 6 with or without CGR. For Chinese samples, the read mapping rate in the original study that used the IGCR dataset (3,449 reference genomes from IGC 6) was 52.00%, which was significantly improved to 76.88% after the inclusion of the CGR dataset (Fig. 2a and Supplementary Table 6). Since all the samples in the CGR were from China, it is reasonable to assume that this genome dataset contributes substantially to the Chinese fecal metagenome. To evaluate the contribution of the CGR to the mapping of non-Chinese metagenomes, we carried out a similar analysis using metagenomic data from American, Spanish and Danish fecal samples. Notably, the metagenomic read mapping ratios of these samples all increased substantially (Fig. 2a), although to a lesser extent compared with that of Chinese samples (Supplementary Fig. 4a). The improvement of mapping rates in both Chinese and non-Chinese samples indicates that the CGR covers a considerable number of gut bacterial species shared by people between these countries. To reveal the improvement of gene and protein diversity enabled by the CGR, we compared the gene and protein cumulative curve based on genomes used in a previous IGC study and after addition of the CGR (Supplementary Fig. 4b,c). The number of gene and protein families increased with inclusion of the first 1,500 genomes, but more or less plateaued at around 3,000 genomes. The addition of our CGR genomes led to a substantial increase in the number of added gene and protein families as a function of genome number. A total of 373,555 gene clusters and 149,945 protein clusters were added by inclusion of the CGR, corresponding to a 22% and 16% increase in known gene and protein sequence diversity, respectively.

用于宏基因组序列测绘的现有参考基因组远远没有达到饱和。例如，在最近的一项研究中使用的细菌和古细菌的基因组或基因组草案允许绘制粪便宏基因组中不到一半的序列3,4。为了说明CGR对宏基因组分析的价值，我们使用以前的宏基因组数据6进行了有无CGR的序列映射。对于中国样本，在最初的研究中，使用IGCR数据集（来自IGC 6的3,449个参考基因组）的读数映射率为52.00%，在加入CGR数据集后，该比率明显提高到76.88%（图2a和补充表6）。由于CGR的所有样本都来自中国，我们有理由认为这个基因组数据集对中国的粪便宏基因组有很大的贡献。为了评估CGR对非中国基宏因组图谱的贡献，我们利用美国、西班牙和丹麦粪便样本的宏基因组数据进行了类似的分析。值得注意的是，这些样本的宏基因组读数映射率都有大幅提高（图2a），尽管与中国样本相比，提高的幅度较小（补充图4a）。中国和非中国样本的映射率的提高表明，CGR涵盖了这些国家的人们共享的相当数量的肠道细菌物种。为了揭示CGR带来的基因和蛋白多样性的改善，我们比较了基于以前IGC研究中使用的基因组和加入CGR后的基因和蛋白累积曲线（补充图4b,c）。基因和蛋白家族的数量随着前1500个基因组的加入而增加，但在3000个基因组左右时或多或少趋于平稳。加入我们的CGR基因组后，作为基因组数量的函数，增加的基因和蛋白家族的数量大幅增加。通过加入CGR，总共增加了373,555个基因簇和149,945个蛋白簇，对应于已知基因和蛋白序列多样性分别增加了22%和16%。

To further illustrate the utility of the CGR, we used it to analyze gut microbiome SNPs in a cohort of 250 samples from the TwinsUK registry 17. We generated a new set of 282 nonredundant representative genomes from the CGR (see Methods, Supplementary Fig. 5 and Supplementary Table 7), which number nearly doubled the 152 reference genomes used in the original TwinsUK analysis 17. To highlight the new reference genomes identified by analysis with the existing genomes and the CGR genomes, we performed an ANIbased alignment of the 282 genomes with the previously reported 152 genomes. Among the 192 newly added reference genomes, 85 were classified species while 107 were unclassified species (Fig. 2b). A high SNP density was found in Ruminococcussp. CAG:108 (Clu 21), unclassified Firmicutes (Clu 157), Eubacterium rectale (Clu 6), Escherichia coli (Clu 22), and Ruminococcus sp. CAG:57 (Clu 19), suggesting a high degree of variations in the genomes of these species, while Lactobacillus gasseri (Clu 241), Enterococcus fecalis (Clu 316), Enterococcus durans (Clu 274) and Streptococcus mutans (Clu 217) showed lower SNP density. A total of 9.14 million SNPs were identified. The number of SNPs was increased for some species due to the newly added high-quality reference genomes in the CGR. We conclude that the CGR is a valuable resource for metagenomic studies because of the significant improvement in metagenomic resolution it enables.

为了进一步说明CGR的效用，我们用它来分析来自TwinsUK登记处17的250个样本队列中的肠道微生物组SNPs。我们从CGR中产生了一套新的282个非冗余的代表基因组（见方法，补充图5和补充表7），这个数字几乎是最初TwinsUK分析中使用的152个参考基因组的两倍17。为了突出通过与现有基因组和CGR基因组的分析所确定的新参考基因组，我们对282个基因组与以前报道的152个基因组进行了基于ANI的比对。在新增加的192个参考基因组中，85个是已分类的物种，107个是未分类的物种（图2b）。在Ruminococcussp. CAG:108（Clu 21）、未分类的Firmicutes（Clu 157）、Eubacterium rectale（Clu 6）、Escherichia coli（Clu 22）和Ruminococcus sp. CAG:57（Clu 19），表明这些物种的基因组有高度的变异，而Lactobacillus gasseri（Clu 241）、Enterococcus fecalis（Clu 316）、Enterococcus durans（Clu 274）和Streptococcus mutans（Clu 217）显示较低的SNP密度。总共鉴定了914万个SNPs。由于CGR中新增加的高质量参考基因组，一些物种的SNPs数量有所增加。我们的结论是，CGR是宏基因组研究的宝贵资源，因为它使宏基因组的分辨率得到显著提高。

Functions of gut microbiome bacteria. To better elucidate functions of the gut microbiota, we annotated gene functions in 1,520 CGR genomes using KEGG (the Kyoto Encyclopedia of Genes and Genomes) 18. Functional pathways at KEGG level 2 showed that pathways involved in carbohydrate and amino acid metabolism are abundant in all isolated strains, suggesting that these are core functions of the gut microbiota (Supplementary Fig. 6). We also analyzed KEGG level 3 pathways and focused on those enriched at the phylum or genus level (Fig. 3a). We found that lipopolysaccharide biosynthesis (ko00540) genes were widely distributed in the phyla Fusobacteria, Bacteroidetes and Proteobacteria, the main phyla of gram-negative bacteria. Genes involved in glycan degradation (ko00531 and ko00511) were abundant in the genomes of the Bacteroidetes phylum. This observation is consistent with the notion that members of Bacteroidetes are prominent human gut symbionts that help degrade glycans in the diet and the gut mucosa 19. The members of the Bacteroidetes also possess a high proportion of genes involved in sphingolipid metabolism (ko00600), glycosphingolipid biosynthesis (ko00601, ko00603 and ko00604) and steroid hormone biosynthesis (ko00140). Sphingolipids and hormone biosynthesis are ubiquitous in eukaryotic cells but not present in most bacteria. These results suggest that members of the Bacteroidetes not only participate in energy metabolism in the gut, but may also act in sphingolipid and hormone signaling in mammalian cells. The Proteobacteria showed relatively high abundance in genes involved in degradation of xenobiotics (ko01220), possibly contributing to the degradation of environmental chemicals and pharmaceuticals in the gut.

肠道微生物群细菌的功能。为了更好地阐明肠道微生物群的功能，我们使用KEGG（京都基因和基因组百科全书）18对1520个CGR基因组的基因功能进行了注释。KEGG 2级的功能途径显示，参与碳水化合物和氨基酸代谢的途径在所有分离的菌株中都很丰富，这表明这些是肠道微生物群的核心功能（补充图6）。我们还分析了KEGG 3级途径，并关注那些在系统或属级富集的途径（图3a）。我们发现脂多糖生物合成（ko00540）基因广泛分布于革兰氏阴性菌的主要门类--镰刀菌门、类杆菌门和变形杆菌门。参与糖降解的基因（ko00531和ko00511）在类杆菌门的基因组中很丰富。这一观察结果与类杆菌类成员是突出的人类肠道共生体，帮助降解饮食和肠道粘膜中的糖类的观点相一致19。类杆菌成员还拥有大量参与鞘脂代谢（ko00600）、糖鞘脂生物合成（ko00601、ko00603和ko00604）和类固醇激素生物合成（ko00140）的基因。鞘磷脂和激素生物合成在真核细胞中无处不在，但在大多数细菌中不存在。这些结果表明，类杆菌的成员不仅参与肠道的能量代谢，而且可能在哺乳动物细胞中起鞘脂和激素信号的作用。蛋白细菌在参与降解异物的基因中显示出相对较高的丰度（ko01220），可能有助于环境化学品和药品在肠道中的降解。

The signal transduction system (two-component system, ko02020) and xenobiotics degradation (KEGG level 2 pathway) were ubiquitous in the genera Paenibacillus, Bacillus, Klebsiella, Escherichia, Citrobacter and Enterobacter, which are also presented in environmental niches, such as soil and water. The abundant signal transduction and xenobiotics degradation systems allow these genera to sense and respond to various stresses and toxic substance presented in natural environments. Cell motility (chemotaxis, ko02030; flagellar assembly, ko02040) was conserved in the genera Roseburia, Paenibacillus, Bacillus, Escherichia, Citrobacter and Enterobacter, but varied within the genera Clostridium and Eubacterium.

信号转导系统（双组分系统，ko02020）和异生物降解（KEGG 2级途径）在帕尼巴氏杆菌、芽孢杆菌、克雷伯氏菌、埃希氏菌、枸橼酸杆菌和肠杆菌属中无处不在，它们也出现在环境中，如土壤和水里。丰富的信号转导和异物降解系统使这些菌属能够感知并应对自然环境中的各种压力和有毒物质。细胞运动能力（趋化性，ko02030；鞭毛装配，ko02040）在罗斯菌属、帕尼巴氏杆菌属、芽孢杆菌属、埃希氏杆菌属、枸橼酸杆菌属和肠杆菌属中是保守的，但在梭菌属和尤氏杆菌属中有所不同。

Next we investigated functions and pathways that are annotated in the KEGG database, but not categorized as KEGG pathways (Fig. 3b and Supplementary Table 9). Virulence factors and antibiotic resistance genes were annotated using the Virulence Factors Database (VFDB) 20 and Comprehensive Antibiotic Resistance Database (CARD) 21, respectively. Virulence factors and antibiotic resistance are clinically relevant and are abundant in the Proteobacteria phylum, suggesting that this phylum may be a reservoir for opportunistic pathogens. We examined the distribution of genes involved in responses to stresses frequently encountered by gut bacteria: oxygen tolerance and acid resistance. Oxygen tolerance was reflected by the number of genes encoding catalase and superoxide dismutase, two detoxification enzymes that scavenge reactive oxygen species produced during aerobic respiration. As expected, the facultative anaerobic bacteria in the genera Paenibacillus, Bacillus, Klebsiella, Escherichia, Citrobacter and Enterobacter were more oxygen tolerant. In addition to the previously reported Bacteroides fragilis22, other members of Bacteroidetes also showed moderate oxygen tolerance. Notably, bacteria in the Bacteroidetes phylum and the Bifidobacterium genus generally lacked acid resistance genes, suggesting that potential probiotics based on these organisms may suffer impaired survival in the acidic stomach environment after oral administration. Finally, we examined the distribution of six bacterial functions in the CGR that might have beneficial effects on human health. Amino acid and vitamin B synthesis genes were widely present in various gut bacteria, suggesting that gut microbiota might be an alternative source for nutrients that are sparse in vegetarian diets. Genes encoding bacterial bile salt hydrolases, which transform primary bile acids into secondary bile acids in the human intestine, were also ubiquitous in most gut bacteria. Genes encoding β-galactosidases, which might attenuate problems associated with lactose intolerance, were relatively abundant in the phylum Bacteroidetes. Genes involved in bacteriocin synthesis in gut bacteria were relatively rare and did not show phylum- or genusspecific distribution.

接下来我们调查了在KEGG数据库中被注释的功能和途径，但没有被归类为KEGG途径（图3b和补充表9）。病毒性因子和抗生素抗性基因分别使用病毒性因子数据库（VFDB）20和抗生素抗性综合数据库（CARD）21进行注释。病毒因子和抗生素耐药性与临床相关，在变形菌门中含量丰富，表明该门可能是机会主义病原体的储存库。我们研究了参与应对肠道细菌经常遇到的压力的基因分布：耐氧性和耐酸性。氧气耐受性由编码过氧化氢酶和超氧化物歧化酶的基因数量反映，这两种解毒酶可以清除有氧呼吸过程中产生的活性氧。正如预期的那样，Paenibacillus、Bacillus、Klebsiella、Escherichia、Citrobacter和Enterobacter等属的嗜好性厌氧细菌更耐氧。除了以前报道的Bacteroides fragilis22外，类杆菌的其他成员也表现出适度的耐氧性。值得注意的是，类杆菌门和双歧杆菌属的细菌普遍缺乏耐酸基因，这表明基于这些生物的潜在益生菌在口服后可能在酸性胃环境中的生存受到影响。最后，我们检查了CGR中可能对人类健康有好处的六种细菌功能的分布。氨基酸和维生素B的合成基因广泛存在于各种肠道细菌中，这表明肠道微生物群可能是素食饮食中稀少的营养物质的替代来源。编码细菌胆汁盐水解酶的基因在大多数肠道细菌中也是无处不在的，这些基因在人类肠道中将初级胆汁酸转化为次级胆汁酸。编码β-半乳糖苷酶的基因可能减轻与乳糖不耐受有关的问题，在类杆菌门中相对丰富。肠道细菌中参与细菌素合成的基因相对罕见，没有显示出系统或属的特定分布。

Core and pan-genomes of underrepresented gut bacteria.

代表性不足的肠道细菌的核心和泛基因组。

We carried out a pan-genome analysis of 36 species or clusters that contain more than ten genomes, as well as two other species enriched in healthy controls compared with patients with type 2 diabetes in previous studies 7,23,24, Fecalibacterium prausnitzii (cluster 63, seven genomes) and butyrate-producing bacterium SS3_4 (cluster 45, nine genomes). These clusters covered the phyla Firmicutes, Bacteroidetes, Actinobacteria and Proteobacteria (Supplementary Fig. 7a and Supplementary Table 8a). The pan-genome of a cluster can be defined as the sum of the core genes and dispensable genes (including unique genes and accessory genes) of all the members within that cluster 25. Our pan-genome analysis showed that Eubacterium rectale (cluster 37) contained the lowest proportion of core genes (12%); the remaining genes fell into accessory and unique genomes (38% and 40%, respectively). In contrast, Eubacterium 3_1 (cluster 6) contained the largest proportion of core genes (53%) (Supplementary Fig. 7b). The pan-genome fitting curves showed that most clusters in Bacteroidetes displayed an ‘open’ pan-genome and had a relatively large pan-genome size, with Bacteroides vulgatus having the largest pan-genome size at 14,970 genes (Supplementary Figs. 8 and 9 and Supplementary Table 8b). In contrast, members in the phylum Actinobacteria tend to represent a relatively ‘closed’ pan-genome, which was only slightly expanded by the addition of CGR genomes. These results suggest that gut bacterial genomes are variable in the Bacteroidetes phylum, less variable in the Firmicutes and Proteobacteria, and fairly conserved in the Actinobacteria.

我们对包含10个以上基因组的36个物种或集群进行了泛基因组分析，以及在以前的研究中7,23,24与2型糖尿病患者相比在健康对照组中富集的另外两个物种，即普鲁士尼茨大肠杆菌（集群63，7个基因组）和产丁酸的SS3_4细菌（集群45，9个基因组）。这些群组涵盖了韧皮菌门、类杆菌门、放线菌门和变形菌门（补充图7a和补充表8a）。一个簇的泛基因组可以定义为该簇内所有成员的核心基因和可有可无的基因（包括独特基因和附属基因）之和25。我们的泛基因组分析表明，直肠分枝菌（第37群）包含的核心基因比例最低（12%）；其余的基因属于附属基因和独特基因组（分别为38%和40%）。相比之下，Eubacterium 3_1（群集6）包含最大比例的核心基因（53%）（补充图7b）。泛基因组拟合曲线显示，类杆菌中的大多数聚类显示出 "开放 "的泛基因组，并具有相对较大的泛基因组规模，其中硫化杆菌的泛基因组规模最大，达14970个基因（补充图8和9及补充表8b）。相比之下，放线菌门中的成员倾向于代表一个相对 "封闭 "的泛基因组，该泛基因组仅在加入CGR基因组后略有扩大。这些结果表明，肠道细菌基因组在类杆菌门中是可变的，在韧皮菌门和变形菌门中可变性较小，而在放线菌门中相当保守。

We also explored the distribution of genes involved in butyrate synthesis and antibiotic resistance in the pan-genomes of gut bacteria. Functional annotation showed that six clusters contained the complete acetyl-CoA to butyrate biosynthesis pathway (Fig. 4a). Among them, F. prausnitzii, E. rectale, butyrate-producing bacterium SS3_4 and Roseburia sp. CAG:45 harbored the complete pathway in their core genome, suggesting that the butyrate-producing function was highly conserved in these species. This result is consistent with the reported butyrate-producing capacity of these species 26–28. To explore the distribution of antibiotic resistance within the established pan-genomes, we annotated 25 antibiotic resistance genes (ARGs) in each pan-genome. Consistent with a previous report 29, the tetracycline resistance gene was widely present in the dispensable genome of these clusters (Fig. 4b). Notably, Escherichia coli contained almost all ARGs (23 of 25) in its pangenome, with half of these present in the core genome (Fig. 4b). In contrast, Bifidobacterium species, including B. bifibium, B. adolescentis, B. longum and B. pseudocatenulatum, rarely contained ARGs in their pan-genomes.

我们还探索了参与丁酸盐合成和抗生素抗性的基因在肠道细菌泛基因组中的分布。功能注释显示，有六个集群包含完整的乙酰辅酶到丁酸盐的生物合成途径（图4a）。其中，F. prausnitzii、E. rectale、产丁酸的细菌SS3_4和Roseburia sp. CAG:45在其核心基因组中包含了完整的途径，这表明在这些物种中产丁酸的功能是高度保守的。这一结果与报道的这些物种的丁酸盐生产能力是一致的26-28。为了探索已建立的泛基因组内抗生素抗性的分布，我们在每个泛基因组中注释了25个抗生素抗性基因（ARG）。与以前的报告29一致，四环素抗性基因广泛存在于这些集群的可支配基因组中（图4b）。值得注意的是，大肠杆菌在其泛基因组中几乎含有所有的ARG（25个中的23个），其中一半存在于核心基因组中（图4b）。相比之下，双歧杆菌物种，包括双歧杆菌、青少年杆菌、长杆菌和假杆菌，在其泛基因组中很少含有ARG。

To obtain a better understanding of the distribution of bacterial functions in the core and dispensable genomes, we annotated the genomes using the Clusters of Orthologous Groups (COG) database 30. This revealed that several housekeeping functions were significantly enriched in the core genome, including post-translational modification, protein turnover and chaperones (O, P=7.28×10–12); translation, ribosomal structure and biogenesis (J, P=7.28×10–12); energy production and conversion (C, P=7.28×10–12); amino acid transport and metabolism (E, P=7.28×10–12); nucleotide transport and metabolism (F, P=7.28×10–12); coenzyme transport and metabolism (H, P=1.46×10–11); lipid transport and metabolism (I, P=2.40×10–10); and inorganic ion transport and metabolism (P, P=2.40×10–10) (Supplementary Fig. 10). By contrast, COG categories enriched in the dispensable genome included cell wall-membrane-envelope biogenesis (M, P=2.70×10–9); cell motility (N, P=3.11×10–5); signal transduction mechanisms (T, P=0.00039); intracellular trafficking secretion and vesicular transport (U, P=1.22×10–7); defense mechanisms (V, P=7.28×10–12); transcription (K, P=3.64×10–11); replication recombination and repair (L, P=7.28×10–12); and function unknown (S, P=0.03111). The remaining COG categories showed no significant differences in core and dispensable genome.

为了更好地了解细菌功能在核心基因组和可有可无基因组中的分布情况，我们使用正交组簇（COG）数据库30对基因组进行了注释。这表明，几个内务功能在核心基因组中显著富集，包括翻译后修饰、蛋白质周转和伴侣（O，P=7.28×10-12）；翻译、核糖体结构和生物生成（J，P=7.28×10-12）；能量生产和转换（C，P=7. 28×10-12）；氨基酸运输和代谢（E，P=7.28×10-12）；核苷酸运输和代谢（F，P=7.28×10-12）；辅酶运输和代谢（H，P=1.46×10-11）；脂质运输和代谢（I，P=2.40×10-10）；和无机离子运输和代谢（P，P=2.40×10-10）（补充图10）。相比之下，在可有可无基因组中富集的COG类别包括细胞壁-膜-包膜生物生成（M，P=2.70×10-9）；细胞运动（N，P=3.11×10-5）；信号转导机制（T，P=0。 00039）；细胞内贩运分泌和囊泡运输（U，P=1.22×10-7）；防御机制（V，P=7.28×10-12）；转录（K，P=3.64×10-11）；复制重组和修复（L，P=7.28×10-12）；和功能未知（S，P=0.03111）。其余的COG类别在核心基因组和可有可无的基因组中没有显示出明显的差异。

Discussion

We used 11 culturing conditions for isolation of gut bacteria and archived more than 6,000 isolates. From this collection of isolates, we generated 1,520 high-quality draft reference genomes. The high coverage of the resulting CGR at the genus and species levels (including low-abundance species) demonstrates the value of culture-based methods for strain isolation from the gut microbiota. In line with this, a large number of gut bacterial species that were previously considered as unculturable have been successfully cultivated in two recent studies31,32. Although there was some overlap between the novel species archived by CGR and in these two studies, the CGR contains 659 additional genomes (representing 209 clusters or species). Our cultivation methods can be applied to expand the CGR until it is saturated with the genomes of culturable gut bacteria. After that, single-cell sequencing can be used to investigate genomes of unculturable bacteria, with an overall aim of defining a saturated set of reference genomes of gut microbiota to underpin a better understanding of gut microbiome biology.

我们使用了11种培养条件来分离肠道细菌，并存档了6000多个分离物。从这些分离物中，我们产生了1520个高质量的参考基因组草案。所产生的CGR在属和种水平上的高覆盖率（包括低丰度的物种）表明了基于培养的方法对于从肠道微生物群中分离菌株的价值。与此相呼应，在最近的两项研究中，大量以前被认为是不可培养的肠道细菌物种被成功培养出来31,32。尽管CGR存档的新物种与这两项研究中的新物种有一些重叠，但CGR包含659个额外的基因组（代表209个群组或物种）。我们的培养方法可用于扩大CGR，直到它被可培养的肠道细菌的基因组所饱和。之后，单细胞测序可用于研究不可培养的细菌的基因组，总体目标是定义一套饱和的肠道微生物群参考基因组，以支持对肠道微生物群生物学的更好理解。

We applied out CGR genome dataset to assign functions to gut bacteria. For example, we found that virulence factors and antibiotic resistance genes are enriched in Klebsiella, Escherichia, Citrobacter and Enterobacter, which are opportunistic pathogens frequently isolated in clinical samples 33. The abundance of signal transduction and cell motility genes in these bacteria could further contribute to their pathogenicity 34,35. Notably, the Proteobacteria also possess abundant genes for degradation of xenobiotics, which might affect drug metabolism of patients in drug therapy. In line with this, a recent study reported that intratumor Proteobacteria can metabolize chemotherapeutic drugs into inactive forms and thus attenuate the efficacy of cancer therapies 36. The genes involved in beneficial functions such as glycan degradation and vitamin B synthesis are enriched in the Bacteroides genus, consistent with its mutualistic role in the human gut. Notably, we found that Bacteroides species contain a considerable number of genes involved in sphingolipid and steroid hormone synthesis, suggesting their potential for modulating signaling in mammalian cells. In support of this, a recent study reported that Bacteroides fragilis can take advantage of sphingolipid signaling to enable symbiosis in the intestine 37. It is noteworthy that genes involved in glycan degradation and sphingolipid metabolism were also enriched in the genus Bifidobacterium, another wellknown gut commensal microbe. However, genes involved in both pathways were not abundant in the Prevotella genus, suggesting a distinct function of Prevotella compared with other members of the Bacteroidetes phylum. This might account for observed negative correlations between the relative abundances of Prevotella and Bacteroides in the gut microbiota38. The potential role of gut bacteria in metabolism of estrogens has long been recognized 39, but detailed mechanistic studies are still lacking. It will be interesting to explore the implication of this unique function of gut bacteria in hormonerelated health or disease. The CGR also enabled the identification of several potential bacteriocin-producing bacteria strains, which merit further verification.

我们应用CGR基因组数据集对肠道细菌的功能进行了分配。例如，我们发现毒力因子和抗生素抗性基因在克雷伯氏菌、埃希氏菌、枸橼酸杆菌和肠杆菌中富集，这些细菌是临床样本中经常分离的机会主义病原体33。这些细菌中丰富的信号转导和细胞运动基因可能进一步导致其致病性34,35。值得注意的是，变形杆菌还拥有丰富的降解异物的基因，这可能会影响患者在药物治疗中的药物代谢。与此相呼应，最近的一项研究报告称，肿瘤内的蛋白细菌可以将化疗药物代谢为非活性形式，从而减弱癌症治疗的疗效36。参与有益功能的基因，如糖类降解和维生素B的合成，在Bacteroides属中富集，与它在人类肠道中的互助作用相一致。值得注意的是，我们发现细菌属含有相当数量的参与鞘脂和类固醇激素合成的基因，这表明它们有可能调节哺乳动物细胞中的信号传导。为了支持这一点，最近的一项研究报告说，脆弱拟杆菌可以利用球蛋白脂的信号传递来实现肠道内的共生37。值得注意的是，参与糖降解和鞘脂代谢的基因也在双歧杆菌属中富集，双歧杆菌是另一种著名的肠道共生微生物。然而，参与这两种途径的基因在普雷沃特菌属中并不丰富，这表明普雷沃特菌与类杆菌门的其他成员相比，具有独特的功能。这可能是观察到的普雷沃特菌和拟杆菌在肠道微生物群中的相对丰度之间存在负相关的原因38。肠道细菌在雌激素代谢中的潜在作用早已被认可39，但仍缺乏详细的机制研究。探讨肠道细菌在与激素有关的健康或疾病中的这一独特功能的含义将是有趣的。CGR还使几个潜在的产生细菌毒素的菌株得到了确认，这些菌株值得进一步验证。

The CGR will improve metagenomic analyses, genome variation analyses, functional characterization and pan-genome analyses. The isolated gut bacteria strains have been deposited in the China National GeneBank (CNGB) and may be useful for studies that aim to alter microbiota functions, as novel probiotics, or for verification of disease-associated bacterial markers.

CGR将改进宏基因组分析、基因组变异分析、功能特征分析和泛基因组分析。分离出的肠道细菌菌株已存入中国国家基因库（CNGB），并可能有助于旨在改变微生物群功能的研究，作为新型益生菌，或用于验证疾病相关的细菌标记。

Methods

Anaerobic cultivation of fecal bacteria.

粪便细菌的厌氧培养。

Fecal samples were collected from 155 healthy human donors not taking any drugs during the last month before sampling. Detailed information is given in Supplementary Table 2. The samples were immediately transferred to an anaerobic chamber (Bactron Anaerobic Chamber, Bactron IV-2, Shellab, USA), homogenized in pre-reduced phosphate buffered saline (PBS) supplemented with 0.1% cysteine, and then diluted and spread on agar plates with different growth media (Supplementary Table 1). Plates were incubated under anaerobic condition in an atmosphere of 90% N 2, 5% CO2 and 5% H2 at 37 °C for 2–3 d. Single colonies were picked and streaked onto new plates to obtain single clones. All the strains were stored in a glycerol suspension (20%, v/v) containing 0.1% cysteine at –80 °C. The collection of the 155 samples was approved by the Institutional Review Board on Bioethics and Biosafety of BGI under number BGI-IRB17005-T1. All protocols were in compliance with the Declaration of Helsinki and explicit informed consent was obtained from all participants. Bacteria in the CGR (Culturable Genome Reference) are deposited in and are available from the E-BioBank (EBB) of the China National GeneBank (http://ebiobank.cngb.org/index.php?g=Content&m=Hql&a=sample_5&id=393).

粪便样本来自155名健康的人类捐赠者，在采样前的最后一个月没有服用任何药物。详细资料见补充表2。样品被立即转移到厌氧室（Bactron厌氧室，Bactron IV-2，Shellab，美国），在补充有0.1%半胱氨酸的预还原磷酸盐缓冲盐水（PBS）中均质，然后稀释并铺在有不同生长介质的琼脂平板上（补充表1）。将平板在厌氧条件下，在90% N 2、5% CO2和5% H2的气氛中，在37℃下培养2-3天。摘取单个菌落，并将其分装到新的平板上以获得单个克隆。所有菌株被保存在含有0.1%半胱氨酸的甘油悬浮液（20%，v/v）中，保存在-80℃。155个样本的收集得到了BGI生物伦理和生物安全机构审查委员会的批准，编号为BGI-IRB17005-T1。所有方案都符合《赫尔辛基宣言》，并获得所有参与者的明确知情同意。CGR（可培养基因组参考）中的细菌被保存在中国国家基因库（http://ebiobank.cngb.org/index.php?g=Content&m=Hql&a=sample_5&id=393）的电子生物库（EBB）中，并可从该库中获取。

Whole-genome sequencing and de novo assembly.

全基因组测序和从头组装。

DNA extraction. Isolates cultivated to stationary phase were centrifuged at 7,227 g at 4 °C for 10 min, and the resulting pellets were resuspended in 1 ml of Tris-EDTA. For bacterial cell lysis, 50 µl of 10% SDS and 10µl of proteinase K (20 mg/ml) were added, and the solution was incubated at 55 °C in a water bath for 2 h. The released genomic DNA was extracted using the phenol-chloroform method 40.

DNA提取。培养到静止期的分离物在4℃下以7,227g离心10分钟，并将得到的颗粒重新悬浮在1ml Tris-EDTA中。对于细菌细胞裂解，加入50微升10%的SDS和10微升蛋白酶K（20毫克/毫升），溶液在55℃的水浴中孵育2小时，使用苯酚-氯仿法40提取释放的基因组DNA。

Genome sequencing and assembly. Paired-end libraries with an insert size of 500 bp were constructed and sequenced on Illumina Hiseq 2000 platform to obtain about 100 ×clean data for each sample. The reads were assembled using SOAPdenovo 2.04 41 to form scaffolds from which the rRNA genes were extracted by RNAMMer 1.2 42. An in-house pipeline was used to obtain the best assembly containing an orthogonal design to investigate L,M,d,D,L,u,G (arguments of SOAPdenovo) and a single-factor design to investigate K (argument of SOAPdenovo) by comprehensively considering contig average length, longest scaffold and rRNA score. Libraries with an insert size of 240 bp were constructed and sequenced on the ionProton platform, which produced about 100 ×clean data for each sample. The reads were assembled through SPAdes (version 3.1.0) 43 to form scaffolds.

基因组测序和组装。在Illumina Hiseq 2000平台上构建了插入大小为500bp的成对端文库并进行测序，以获得每个样品的约100×清洁数据。使用SOAPdenovo 2.04 41对读数进行组装，形成支架，用RNAMMer 1.2 42从中提取rRNA基因。一个内部管道被用来获得最佳组装，包含一个正交设计来研究L,M,d,D,L,u,G（SOAPdenovo的参数）和一个单因素设计来研究K（SOAPdenovo的参数），综合考虑Contig平均长度，最长的支架和rRNA得分。构建了插入大小为240bp的文库，并在ionProton平台上进行测序，为每个样品产生了大约100×清洁数据。读数通过SPAdes（3.1.0版）43进行组装，形成支架。

Assessment of genome quality. Six high-quality draft assembly criteria from the Human Microbiome Project (HMP) 14 and rRNA (5 s, 16 s and 23 s) completeness were adopted to ensure the assembly quality. The criteria are (i) 90% of the genome assembly must be included in contigs >500 bp, (ii) 90% of the assembled bases must be at >5×read coverage, (iii) the contig N50 must be>5 kb, (iv) scaffold N50 must be >20 kb, (v) average contig length must be>5 kb, and (vi)>90% of the core genes 44,45 must be present in the assembly.

评估基因组质量。采用人类微生物组计划（HMP）14和rRNA（5s、16s和23s）完整性的六个高质量的组装草案标准来确保组装质量。这些标准是：(i)90%的基因组装配必须包括在>500 bp的等位基因组中，(ii)90%的装配碱基必须在>5×读数覆盖率，(iii)等位基因组N50必须>5 kb，(iv)支架N50必须>20 kb，(v)平均等位基因组长度必须>5 kb，以及(vi)>90%的核心基因44,45必须存在于装配中。

。。。。。。

NBT | 基于培养的1520人肠道细菌参考基因组
文献信息：文献：1,520 reference genomes from cultivated human gu...
人体共生菌培养组
CGR 文献：1,520 reference genomes from cultivated human gut ...
Cell：一半都不认识的140000种病毒，已入住人体肠胃
人类的肠道是一个非常复杂的生物环境。除了细菌外，数十万种称为噬菌体的病毒也可以在其中生活，它们也会感染细菌。肠道微...
Perl单行实战笔记1：计算metagenome shotgun
移步github 问题描述：对宏基因组（例如肠道微生物）进行全基因组shotgun测序，对其中已知的高丰度菌进行...
生信（六）（转载）开卷有益
关键词：文献；今天分享一些经典文献/书籍给大家，供参考。基因组组装文献• Assembly of larg...
使用CompareM计算细菌基因组间的amino acid id
手里有一个细菌的基因组序列，如何分析和挖掘基因组信息对我是一个头疼的问题！！先依葫芦画瓢吧，看了一些文献，很多做比...
宏基因组研究工具 | 小鼠肠道宏基因组目录(iMGMC)
近日，来自德国的研究人员在《Cell Reports》杂志发布了一个宏基因组研究的综合资源：小鼠肠道宏基因组目录(...
每日读书*How the Immune System Works
避免过度激活我们的肠道是至少1000不同类型的大约一百万亿个细菌的家。大部分这些细菌是“共生体”。（来自希...
人类参考基因组
一、人类参考基因组的来源 1、人类基因组计划 1）2001年草图，绘制人类基因组图谱 2、数据库的名称 1）UCS...
你真的了解你的“大肠杆菌”么？
Lag phase细菌刚接种到培养基中，细菌新陈代谢活跃，但是不分裂，细菌在对新环境进行适应和调整。接着细菌培养进...