BMC Genomics (4.547/Q2)
Genome-wide identification of SSR and SNP markers from the non-heading Chinese cabbage for comparative genomic analyses
用于比较基因组分析的不结球大白菜中 SSR 和 SNP 标记的全基因组鉴定
Abstract
Background: Non-heading Chinese cabbage (NHCC), belonging to Brassica, is an important leaf vegetable in Asia. Although genetic analyses have been performed through conventional selection and breeding efforts, the domestication history of NHCC and the genetics underlying its morphological diversity remain unclear. Thus, the reliable molecular markers representative of the whole genome are required for molecular-assisted selection in NHCC.
Results: A total of 20,836 simple sequence repeats (SSRs) were detected in NHCC, containing repeat types from mononucleotide to nonanucleotide. The average density was 62.93 SSRs/Mb. In gene regions, 5,435 SSRs were identified in 4,569 genes. A total of 5,008 primer pairs were designed, and 74 were randomly selected for validation. Among these, 60 (81.08%) were polymorphic in 18 Cruciferae. The number of polymorphic bands ranged from two to five, with an average of 2.70 for each primer. The average values of the polymorphism information content, observed heterozygosity, Hardy-Weinberg equilibrium, and Shannon’s information index were 0.2970, 0.4136, 0.5706, and 0.5885, respectively. Four clusters were classified according to the unweighted pair-group method with arithmetic average cluster analysis of 18 genotypes. In addition, a total of 1,228,979 single nucleotide polymorphisms (SNPs) were identified in the NHCC through a comparison with the genome of Chinese cabbage, and the average SNP density in the whole genome was 4.33/Kb. The number of SNPs ranged from 341,939 to 591,586 in the 10 accessions, and the average heterozygous SNPs ratio was ~42.53%. All analyses showed these markers were high quality and reliable. Therefore, they could be used in the construction of a linkage map and for genetic diversity studies for NHCC in future.
Conclusions: This is the first systematic and comprehensive analysis and identification of SSRs in NHCC and 17 species. The development of a large number of SNP and SSR markers was successfully achieved for NHCC. These novel markers are valuable for constructing genetic linkage maps, comparative genome analysis, quantitative trait locus (QTL) mapping, genome-wide association studies, and marker-assisted selection in NHCC breeding system research.
背景:不结球大白菜(NHCC)属于芸薹属,是亚洲重要的叶菜类蔬菜。尽管通过常规选择和育种工作进行了遗传分析,但 NHCC 的驯化历史及其形态多样性背后的遗传学仍不清楚。因此,NHCC 的分子辅助选择需要代表全基因组的可靠分子标记。
结果:在 NHCC 中共检测到 20,836 个简单序列重复 (SSR),包含从单核苷酸到九核苷酸的重复类型。平均密度为 62.93 SSRs/Mb。在基因区域,4,569 个基因中鉴定出 5,435 个 SSR。共设计引物对5008对,随机抽取74对进行验证。其中60个(81.08%)在18个十字花科中具有多态性。多态性条带的数量从 2 到 5 不等,每个引物平均为 2.70。多态性信息含量、观察到的杂合性、Hardy-Weinberg平衡和香农信息指数的平均值分别为0.2970、0.4136、0.5706和0.5885。根据对18个基因型的算术平均聚类分析,根据未加权对组方法对四个聚类进行分类。此外,通过与大白菜基因组比较,在NHCC中共鉴定出1,228,979个单核苷酸多态性(SNP),全基因组的平均SNP密度为4.33/Kb。 10 份材料的 SNP 数量在 341,939 到 591,586 之间,平均杂合 SNP 比率约为 42.53%。所有分析均表明这些标记物质量高且可靠。因此,它们可用于构建连锁图谱和未来 NHCC 的遗传多样性研究。
结论:这是对 NHCC 和 17 个物种中 SSR 的首次系统和全面的分析和鉴定。 NHCC成功开发了大量SNP和SSR标记。这些新标记对于构建遗传连锁图谱、比较基因组分析、数量性状位点 (QTL) 作图、全基因组关联研究和 NHCC 育种系统研究中的标记辅助选择具有重要意义。
Keywords: Non-heading Chinese cabbage, Comparative genomic analysis, SSR, SNP
关键词:不结球大白菜,比较基因组分析,SSR,SNP
Background
Non-heading Chinese cabbage (Brassica rapa ssp. chinensis, 2n = 2x = 20) is a species belonging to the Brassica genus of Cruciferae family, which contains 338 genera and over 3,700 species, including the model plant Arabidopsis. In 2012, the production of Brassica vegetables reached 70.10 million tons worldwide (http://faostat.fao.org). The six widely cultivated Brassica are described by the classical and famous “U’s triangle”, which includes three diploid species, B. rapa (A genome, 2n = 20), Brassica nigra (B genome, 2n = 16), Brassica oleracea (C genome, 2n = 18), and three allopolyploid species, Brassica juncea (AB genome, 2n = 36), Brassica napus (AC genome, 2n = 38) and Brassica carinata (BC genome, 2n = 34). Brassica rapa contains several subspecies such as Chinese cabbage (B. rapa ssp. pekinensis), NHCC and turnip (B. rapa ssp. rapa). According to the main cultivation specialties, biological properties, and morphological characteristics, NHCC are classified into five varieties, including Pak-choi, Wutatsai, Flowering Chinese cabbage, Taitsai, and Tillering cabbage. NHCC is widely used as a vegetable crop because of the strong adaptability, short growth period, good quality, unique flavor, and rich nutrition. Thus, it is widely cultivated in Southeast Asia, Japan, USA, and Europe, and is gradually becoming an important vegetable worldwide.
不结球大白菜(Brassica rapa ssp. chinensis,2n = 2x = 20)是十字花科芸薹属的一个物种,包括模式植物拟南芥在内,共有338属3700余种。 2012年,全球芸薹类蔬菜产量达到7010万吨(http://faostat.fao.org)。 6 种广泛栽培的芸薹属经典著名的“U 三角”,其中包括三个二倍体物种,油菜(A 基因组,2n = 20)、黑芥(B 基因组,2n = 16)、甘蓝(C基因组,2n = 18)和三个异源多倍体物种,芥菜(AB 基因组,2n = 36)、欧洲油菜(AC 基因组,2n = 38)和油菜(BC 基因组,2n = 34)。油菜包含几个亚种,例如大白菜 (B. rapa ssp. pekinensis)、NHCC 和萝卜 (B. rapa ssp. rapa)。根据主要栽培特性、生物学特性和形态特征,NHCC分为白菜、五他菜、开花大白菜、大菜、分蘖白菜5个品种。 NHCC具有适应性强、生长期短、品质好、风味独特、营养丰富等特点,被广泛用作蔬菜作物。因此,它在东南亚、日本、美国和欧洲广泛种植,并逐渐成为世界范围内的重要蔬菜。
The development of molecular markers for the detection and exploitation of DNA polymorphisms is a significant application in the field of molecular genetics. The detection and analysis of genetic variation can help us to understand the molecular basis of various biological phenomena. Since the advent of restriction fragment length polymorphism (RFLP) markers, a range of other markers, such as random amplified polymorphism DNA (RAPD), amplified fragment length polymorphism (AFLP), sequence tag sites (STSs), SNPs, and SSRs, have been introduced during the 20th century to fulfill various demands of breeders.
开发用于检测和利用 DNA 多态性的分子标记是分子遗传学领域的重要应用。 遗传变异的检测和分析可以帮助我们了解各种生物现象的分子基础。 自限制性片段长度多态性 (RFLP) 标记出现以来,一系列其他标记,如随机扩增多态性 DNA (RAPD)、扩增片段长度多态性 (AFLP)、序列标签位点 (STS)、SNP 和 SSR 在 20 世纪被引入以满足育种者的各种需求。
Assigning molecular markers to linkage groups and constructing genetic maps is an important step for analyzing the genome of species. These linkage maps have been used for marker-assisted breeding, map-based cloning strategies, genome organization and comparative genomics of important species, and the dissection of quantitative traits. Polymerase chain reaction-based markers have been widely used in the construction of genetic linkage maps for B. oleracea, B. nigra, B. juncea, and B. napus. A number of genetic linkage maps based on a range of markers, including RFLP, RAPD, SSR, and AFLP, have been constructed for Chinese cabbage. However, there are few linkage maps for NHCC.
将分子标记分配给连锁群并构建遗传图谱是分析物种基因组的重要步骤。这些连锁图已用于标记辅助育种、基于图的克隆策略、重要物种的基因组组织和比较基因组学以及数量性状的剖析。基于聚合酶链反应的标记已广泛用于构建甘蓝、黑芥、芥菜和欧洲油菜的遗传连锁图谱。基于一系列标记,包括RFLP、RAPD、SSR和AFLP,已经为大白菜构建了许多遗传连锁图谱。然而,NHCC 的连锁图谱很少。
SSR and SNP markers are distributed throughout the genome, and they gradually became preferred markers for many applications in genetics and genomics. They are suitable for the fine mapping of genes and association studies, which aim at identifying alleles potentially affecting important agronomic traits. However, without large numbers of SSR and SNP markers, such studies have not been available in most crop species. Currently, with the development of next-generation sequencing, it is feasible to develop a large number of SSR and SNP markers. Developing a large set of SSR and SNP markers will facilitate the fine mapping of QTLs, improve the identification and exploitation of genes affecting important traits, and enable selective breeding through genomic selection.
SSR 和 SNP 标记分布在整个基因组中,它们逐渐成为许多遗传学和基因组学应用的首选标记。 它们适用于基因的精细定位和关联研究,旨在识别可能影响重要农艺性状的等位基因。 然而,在没有大量 SSR 和 SNP 标记的情况下,大多数作物物种都无法进行此类研究。 目前,随着二代测序的发展,开发大量的SSR和SNP标记是可行的。 开发大量的 SSR 和 SNP 标记将有助于 QTL 的精细定位,改进影响重要性状的基因的识别和利用,并通过基因组选择实现选择性育种。
NHCC, with rich diverse germplasms, originated from China. Given its important economic value and its close relationship to A. thaliana and Chinese cabbage, 10 NHCC accessions were re-sequenced. Additionally, the representative accession, NHCC001, has been assembled recently. Here, we report, for the first time, a survey of whole genome sequences to develop a large number of SSR and SSR markers. These markers enhance the density of the existing genetic NHCC maps, which could also be a useful source for high-throughput QTL mapping and marker-assisted NHCC improvement. Furthermore, breeders could introduce beneficial genes, improving genetic diversity, using these markers for marker-assisted selection.
NHCC起源于中国,具有丰富多样的种质资源。 鉴于其重要的经济价值及其与拟南芥和大白菜的密切关系,对 10 个 NHCC 种质进行了重新测序。 此外,最近还组装了代表种质 NHCC001。 在这里,我们首次报告了对全基因组序列的调查,以开发大量 SSR 和 SSR 标记。 这些标记提高了现有遗传 NHCC 图的密度,这也可能是高通量 QTL 作图和标记辅助 NHCC 改进的有用来源。 此外,育种者可以引入有益基因,提高遗传多样性,使用这些标记进行标记辅助选择。
Results
The development of SSRs in NHCC and a comparative analysis with 17 species
We analyzed the distribution of perfect microsatellites with ≥ 3 repeat units, and a minimum total length of 18 bp in ~331.1 Mb of the NHCC genome. All SSRs identified in this study have been submitted to the nhccdata website (http://nhccdata.njau.edu.cn/). The content of perfect microsatellites in the genomic sequences of NHCC and 17 other species were identified. A total of 20,836 SSRs with repeats were detected in the NHCC genome, which translated to an overall density across the genome of 62.93 SSRs/Mb. Surprisingly, NHCC had a higher microsatellite density than sorghum (56.00 SSR/Mb), potato (55.78 SSR/Mb), and maize (24.85 SSR/Mb). However, its microsatellite density was far less than watermelon (217.96 SSR/Mb), moss (241.08 SSR/Mb), and volvox (377.15 SSR/Mb) (Table 1, Additional file 1: Table S1).
我们分析了在 NHCC 基因组约 331.1 Mb 中具有 ≥ 3 个重复单元且最小总长度为 18 bp 的完美微卫星的分布。本研究中确定的所有 SSR 均已提交至 nhccdata 网站(http://nhccdata.njau.edu.cn/)。鉴定了 NHCC 和其他 17 个物种的基因组序列中完美微卫星的含量。在 NHCC 基因组中共检测到 20,836 个具有重复序列的 SSR,这意味着整个基因组的总密度为 62.93 SSR/Mb。令人惊讶的是,NHCC 的微卫星密度高于高粱 (56.00 SSR/Mb)、马铃薯 (55.78 SSR/Mb) 和玉米 (24.85 SSR/Mb)。然而,其微卫星密度远低于西瓜(217.96 SSR/Mb)、苔藓(241.08 SSR/Mb)和 团藻(377.15 SSR/Mb)(表 1,附加文件 1:表 S1)。
Table 1Dinucleotides were the most common SSR type in NHCC genomic sequences, representing 39.82% of all SSRs, followed by mono- (23.88%) and trinucleotides (18.22%). Octa- and nonanucleotides were the least frequent repeat types, together representing less than 2% of the total SSRs (Table 2). The distribution of SSR types in NHCC was most similar to those of Chinese cabbage and Arabidopsis, which had comparable relative and absolute frequencies for each SSR type, and it was at least similar to the distribution in volvox and watermelon, for which trinucleotides were by far the most frequent repeat type. There were many more dinucleotides and trinucleotides than that in moss and volvox, respectively (Additional file 2: Figure S1, Additional file 1: Table S2).
二核苷酸是 NHCC 基因组序列中最常见的 SSR 类型,占所有 SSR 的 39.82%,其次是单核苷酸(23.88%)和三核苷酸(18.22%)。 八核苷酸和九核苷酸是最不常见的重复类型,合计不到总 SSR 的 2%(表 2)。 NHCC中SSR类型的分布与大白菜和拟南芥的分布最为相似,每种SSR类型的相对频率和绝对频率相当,至少与volvox和西瓜的分布相似,三核苷酸远 最频繁的重复类型。 分别比 moss 和 volvox 中的二核苷酸和三核苷酸多得多(附加文件 2:图 S1,附加文件 1:表 S2)。
Table 2We examined the distribution of NHCC microsatellites with regard to the number of repeat units (Figure 1, Additional file 1: Table S3). For all SSR classes, the microsatellite frequency decreased as the number of repeat units increased. However, the rate of this change was more gradual for mononucleotides and dinucleotide than for longer repeat types, with pentanucleotide (from 919 to 154) to nonanucleotide (from 104 to 10) showing the most dramatic frequency reduction. Moreover, the total length of dinucleotide sequences was much larger than the other repetitive sequences, with a total length of 288.42 Kb. On the one hand, the mean number of repeat units in the dinucleotides (17.38) was over twice as high as the number of repeat units in the trinucleotides (7.10), and it was four times higher than in penta- to nonanucleotides (4.04–2.96) (Table 2). On the other hand, the dinucleotide repeats (25.06 SSR/Mb) occurred more frequently than other dinucleotides in the NHCC. Therefore, the dinucleotide repeats had a greater contribution to the genome fraction occupied by SSRs than other dinucleotide (Table 1).
我们检查了 NHCC 微卫星在重复单元数量方面的分布(图 1,附加文件 1:表 S3)。对于所有 SSR 类别,微卫星频率随着重复单元数量的增加而降低。然而,单核苷酸和二核苷酸的这种变化速度比更长的重复类型更缓慢,五核苷酸(从 919 到 154)到九核苷酸(从 104 到 10)显示出最显着的频率降低。此外,二核苷酸序列的总长度远大于其他重复序列,总长度为288.42 Kb。一方面,二核苷酸的平均重复单元数(17.38)是三核苷酸重复单元数(7.10)的两倍多,是五至九核苷酸的四倍(4.04- 2.96)(表 2)。另一方面,二核苷酸重复(25.06 SSR/Mb)比 NHCC 中的其他二核苷酸更频繁地发生。因此,与其他二核苷酸相比,二核苷酸重复对 SSR 占据的基因组部分的贡献更大(表 1)。
Figure 1Of the 20,836 identified SSR markers, 822 were not anchored on chromosomes. In addition, the number of SSR markers was different on each chromosome. The density of SSRs ranged from 67.25 to 73.53 across the chromosomes. The most SSR markers (2,996, 14.38%) occurred on chromosome 9, while the least were on chromosome 4 (1,337, 6.42%) (Additional file 1: Table S4). A total of 1,685 types of repeat motifs were detected in NHCC genomic SSR. The most type was the A/T (4,185, accounting for 20.09%), followed by AG/CT (4,107, 19.71%), AT/TA (3,700, 17.76%), and AAG/CTT (1,281, 6.15%), which was similar to other Cruciferous species. The remaining types had repeat ratios of less than 4%, and the CG/GC repeat motif was not found among the NHCC genomic SSRs (Additional file 1: Table S5).
在已识别的 20,836 个 SSR 标记中,有 822 个未锚定在染色体上。 此外,每条染色体上 SSR 标记的数量也不同。 整个染色体的 SSR 密度范围为 67.25 至 73.53。 最多的 SSR 标记(2,996, 14.38%)出现在 9 号染色体上,而最少的出现在 4 号染色体上(1,337, 6.42%)(附加文件 1:表 S4)。 在 NHCC 基因组 SSR 中共检测到 1,685 种重复基序。 最多的是A/T(4185,占20.09%),其次是AG/CT(4107,19.71%)、AT/TA(3700,17.76%)和AAG/CTT(1281,6.15%), 这与其他十字花科植物相似。 其余类型的重复率低于 4%,在 NHCC 基因组 SSR 中未发现 CG/GC 重复基序(附加文件 1:表 S5)。
The characteristics of SSR markers in NHCC genes and a functional analysis
A total of 5,435 SSRs were identified in NHCC genes, accounting for 26.08% of the total genomic SSRs. Trinucleotides were the most common SSR type, representing 38.47% (2,091) of all genic SSRs. Even though dinucleotides are the most common type in the genome, trinucleotides may be more common in the gene regions because they do not cause gene translational changes. They are followed by dinucleotide (1,532, 28.19%) and mononucleotides (904, 16.63%) (Table 2). In total, 611 repeat motifs were identified in NHCC genes, and the most type was AG/CT (783, 14.41%), followed by A/T (763, 14.13%), AAG/CTT (631, 11.61%), and AT/TA (582, 10.71%). The other repeat motifs occurred at rates of less than 10.00%, which was similar to their rates in the genome (Additional file 1: Table S6).
NHCC基因中共鉴定出5435个SSR,占基因组SSR总数的26.08%。 三核苷酸是最常见的 SSR 类型,占所有基因 SSR 的 38.47% (2,091)。 尽管二核苷酸是基因组中最常见的类型,但三核苷酸可能在基因区域中更常见,因为它们不会引起基因翻译变化。 其次是二核苷酸(1,532, 28.19%)和单核苷酸(904, 16.63%)(表 2)。 NHCC基因共鉴定出611个重复基序,其中AG/CT类型最多(783, 14.41%),其次是A/T(763, 14.13%)、AAG/CTT(631, 11.61%)和 AT/TA (582, 10.71%)。 其他重复基序的发生率低于 10.00%,这与它们在基因组中的发生率相似(附加文件 1:表 S6)。
These SSR markers were located in 4,569 genes, accounting for 10.97% of the total number of genes, and 708 genes contained several SSR markers. The functions of 3,141 genes containing SSRs were divided into three classes, cellular location, molecular function, and biological process. They were further subdivided into 38 functional subsets. The greatest number of genes was associated with the binding factors (2,245, 71.47%), followed by the genes involved in metabolic processes, catalytic activities, and cellular processes. This was similar to the classification of the 3,036 genes containing non-synonymous SNPs (Additional file 3: Figure S2).
这些SSR标记位于4569个基因中,占基因总数的10.97%,其中708个基因包含多个SSR标记。 含有 SSR 的 3141 个基因的功能分为细胞定位、分子功能和生物过程三类。 它们进一步细分为 38 个功能子集。 与结合因子相关的基因数量最多(2,245, 71.47%),其次是参与代谢过程、催化活性和细胞过程的基因。 这类似于包含非同义 SNP 的 3,036 个基因的分类(附加文件 3:图 S2)。
SSRs located near important functional genes, such as flower genes and glucosinolate genes, were also identified. Most plants undergo a major physiological change from vegetative to reproductive development before flowering. The formation of flowers is a prerequisite for successful sexual reproduction, and fruits of angiosperm flowers are a staple of human and livestock diets. Glucosinolates are a category of amino acid-derived secondary metabolites found in the Cruciferae family. Glucosinolates and their degradation products play important roles in pathogen and insect interactions, especially in human health. Based on their importance, we identified these genes and their related SSR markers in NHCC. In our analysis, 110 and 93 genes showed high homology (>90%) to the Chinese cabbage flower genes and glucosinolate genes, respectively. Finally, 180 and 136 SSRs were found in the vicinity of (<40 Kb) 86 flower genes and 62 glucosinolate genes, respectively. Interestingly, the number of these genes and related SSRs on chromosome 9 was more than on each of the other nine chromosomes. These markers will be useful for marker-assisted selection breeding in the future (Figure 2, Additional file 1: Table S7).
还鉴定了位于重要功能基因附近的 SSR,例如花基因和硫代葡萄糖苷基因。大多数植物在开花前经历了从营养发育到生殖发育的重大生理变化。花的形成是成功有性繁殖的先决条件,被子植物花的果实是人类和牲畜饮食的主食。硫代葡萄糖苷是在十字花科中发现的一类氨基酸衍生的次级代谢物。硫代葡萄糖苷及其降解产物在病原体和昆虫相互作用中发挥重要作用,尤其是在人类健康方面。基于它们的重要性,我们在 NHCC 中鉴定了这些基因及其相关的 SSR 标记。在我们的分析中,110 个和 93 个基因分别与大白菜花基因和硫代葡萄糖苷基因显示出高度同源性(>90%)。最后,在 (<40 Kb) 86 个花基因和 62 个硫代葡萄糖苷基因附近分别发现了 180 个和 136 个 SSR。有趣的是,第 9 号染色体上这些基因和相关 SSR 的数量比其他 9 条染色体上的要多。这些标记将在未来用于标记辅助选择育种(图 2,附加文件 1:表 S7)。
Figure 2The abundance and length frequency analyses of SSR repeat motifs
We conducted a detailed analysis of individual repeat motifs for each type of SSR found in the genomic sequences of NHCC and the other 17 species. The results showed that A/T (84.12%), AG/CT (49.51%), AAG/CTT (33.74%), AAAT/ATTT (27.06%), AAAAT/ATTTT (20.62%), AAAAAT/ATTTTT (9.51%), AAACCCT/AGGGTTT (12.73%), AAAAAAAT/ATTTTTTT (10.66%), and AAAATAAAT/ATTTATTTT (8.26%) were the most frequent motifs from mono- to nonanucleotides in the NHCC genome. A/T repeats were not only the predominant mononucleotide, but they were also the most frequent motif in the entire genome, accounting for 20.09% of the total SSRs, followed by AG/CT (19.71%) and AT/AT (17.76%) repeats. These three repeat types were more than half of the total SSRs in NHCC genomic sequences (Additional file 1: Table S5). In addition, the motif density was also calculated in the other 17 species for a comparative analysis. The results showed that the density of A/T repeats was higher than C/G repeats in most examined species (14/18). For dinucleotides, all species had a relative low density (0–0.17) of CG/CG repeats. The number of AT/AT repeats was higher than other dinucleotides in 17 species. However, AG/CT repeats (12.40) were slightly more abundant than AT/AT repeats (11.18) in NHCC. Surprisingly, the density of AC/GT repeats (44.64) was far greater than of other dinucleotides in volvox. The density of AAG/CTT repeats was greater than other trinucleotide in Cruciferous (Arabidopsis, Chinese cabbage and NHCC), which was different from the other examined species. Most species had a higher density of AAT/ATT repeats than other trinucleotide repeats. However, the density of CCG/CGG repeats was higher than other trinucleotides in rice and volvox. In NHCC, as well as in most of other species examined, the frequencies of different tetranucleotides revealed that repeats of AAAT/ATTT were most common, whereas ACAT/ATGT (36.96) and AGAT/ATCT (1.64) repeats predominated in volvox and rice, respectively. Conversely, the GC-rich motifs were of relatively lower densities in most species analyzed, such as CCCG/CGGG and CCGG/CCGG. However, the opposite distribution was observed in volvox (Additional file 1: Table S2, Additional file 4: Figure S3a–c).
我们对 NHCC 和其他 17 个物种的基因组序列中发现的每种 SSR 的单个重复基序进行了详细分析。结果显示A/T(84.12%)、AG/CT(49.51%)、AAG/CTT(33.74%)、AAAT/ATTT(27.06%)、AAAAT/ATTTT(20.62%)、AAAAAT/ATTTTT(9.51%) )、AAACCCT/AGGGTTT (12.73%)、AAAAAAAT/ATTTTTTT (10.66%) 和 AAAATAAAT/ATTTATTTT (8.26%) 是 NHCC 基因组中从单核苷酸到九核苷酸最常见的基序。 A/T重复不仅是主要的单核苷酸,而且是整个基因组中最常见的基序,占总SSR的20.09%,其次是AG/CT(19.71%)和AT/AT(17.76%)重复。这三种重复类型占 NHCC 基因组序列中总 SSR 的一半以上(附加文件 1:表 S5)。此外,还计算了其他 17 个物种的基序密度以进行比较分析。结果表明,在大多数受检物种中,A/T 重复序列的密度高于 C/G 重复序列 (14/18)。对于二核苷酸,所有物种的 CG/CG 重复序列密度相对较低(0-0.17)。在 17 个物种中,AT/AT 重复的数量高于其他二核苷酸。然而,在 NHCC 中,AG/CT 重复序列 (12.40) 比 AT/AT 重复序列 (11.18) 略丰富。令人惊讶的是,AC/GT 重复序列的密度 (44.64) 远高于 volvox 中的其他二核苷酸。 AAG/CTT重复序列的密度高于十字花科(拟南芥、大白菜和NHCC)中的其他三核苷酸,这与其他检查的物种不同。大多数物种的 AAT/ATT 重复密度高于其他三核苷酸重复。然而,CCG/CGG 重复序列的密度高于水稻和 volvox 中的其他三核苷酸。在 NHCC 以及所检查的大多数其他物种中,不同四核苷酸的频率显示 AAAT/ATTT 重复是最常见的,而 ACAT/ATGT (36.96) 和 AGAT/ATCT (1.64) 重复在 volvox 和水稻中占主导地位,分别。相反,在大多数分析的物种中,富含 GC 的基序的密度相对较低,例如 CCCG/CGGG 和 CCGG/CCGG。然而,在 volvox 中观察到相反的分布(附加文件 1:表 S2,附加文件 4:图 S3a-c)。
The polymorphism analysis of SSR markers among 18 Cruciferae accessions
A total of 5,008 (92.14%) SSR primer pairs were designed from the 5,435 SSRs in the gene sequences. Of these, 74 primer pairs were selected for validation by SSR loci amplification, and 63 produced a reproducible and clear amplicon of the expected size. The product sizes ranged from 101 to 280 bp. A total of 60 (81.08%) were polymorphic among the 18 analyzed species of Cruciferae, including one Arabidopsis, two broccoli, one Chinese cabbage, and 14 NHCC accessions (Additional file 1: Table S8, Table S9).
从基因序列中的 5,435 个 SSR 中设计了总共 5,008 (92.14%) 个 SSR 引物对。 其中,选择了 74 对引物通过 SSR 基因座扩增进行验证,其中 63 对产生了预期大小的可重复且清晰的扩增子。 产品大小范围为 101 至 280 bp。 在分析的 18 种十字花科植物中,共有 60 种(81.08%)具有多态性,包括一种拟南芥、两种西兰花、一种大白菜和 14 种 NHCC 种质(附加文件 1:表 S8,表 S9)。
A total of 162 polymorphic bands were produced by 60 primer pairs in the 18 accessions. The number of polymorphic bands ranged from two to five, with an average of 2.70 for each primer. The major allele frequency at each locus ranged from 0.4667 to 0.9722. The polymorphism information content (PIC) at each locus ranged from 0.0526 to 0.5802, with an average of 0.2970/loci. The expected heterozygosity ranged from 0.0556 to 0.6506, and the observed heterozygosity ranged from 0.0000 to 1.0000. Although a limited number of SSR primers were used in this experiment, they produced rich polymorphic bands in the 18 Cruciferae accessions. The gene flow estimated from F-statistics was from 0.0000 to 4.5000. A total of 13 SSR primers showed significant deviations from the Hardy–Weinberg equilibrium (PHW < 0.05). Nei’s genetic identity ranged from 0.5165 to 0.8799, and the genetic distance ranged from 0.1280 to 0.6439. Shannon’s information index ranged from 0.1269 to 1.2203, with an average of 0.5885 (Table 3, Additional file 1: Table S10). These results indicated that a large amount of genetic diversity in the 18 Cruciferae had been assessed. The dendrogram showed that they fell into four distinct clusters. Cluster 1 was comprised of 14 NHCC accessions, including five varieties of NHCC. Chinese cabbage belonged to Cluster 2, which had a close relationship with the Taitsai variety of NHCC. Clusters 3 and 4 contained broccoli and Arabidopsis, respectively. The principal component analysis (PCA) and population structure analysis corroborated this classification (Figure 3).
18份材料中60对引物共产生162条多态性条带。多态性条带的数量从 2 到 5 不等,每个引物平均为 2.70。每个位点的主要等位基因频率范围为 0.4667 至 0.9722。每个位点的多态性信息含量(PIC)范围为0.0526~0.5802,平均为0.2970/位点。预期杂合度范围为 0.0556 到 0.6506,观察到的杂合度范围为 0.0000 到 1.0000。尽管本实验中使用了数量有限的 SSR 引物,但它们在 18 个十字花科种质中产生了丰富的多态性条带。从 F 统计量估计的基因流是从 0.0000 到 4.5000。共有 13 个 SSR 引物显示出与 Hardy-Weinberg 平衡的显着偏差(PHW < 0.05)。 Nei的遗传同一性范围为0.5165至0.8799,遗传距离范围为0.1280至0.6439。 Shannon 的信息指数介于 0.1269 到 1.2203 之间,平均为 0.5885(表 3,附加文件 1:表 S10)。这些结果表明,已经对 18 个十字花科植物的大量遗传多样性进行了评估。树状图显示它们分为四个不同的集群。集群 1 由 14 个 NHCC 种质组成,包括 5 个 NHCC 品种。大白菜属于Cluster 2,与NHCC的Taitsai品种关系密切。第 3 组和第 4 组分别包含西兰花和拟南芥。主成分分析 (PCA) 和种群结构分析证实了这一分类(图 3)。
Figure 3Figure 3 A cluster analyses of 18 Cruciferae accessions. (a) Plot of the three principal components from a principal components analysis of SSR variation among 18 genotypes of Cruciferae. Green circles represent non-heading Chinese cabbage accessions; pink pentagrams represent Chinese cabbage; blue triangles represent broccoli accessions; and red squares represent Arabidopsis. (b) Dendrogram for 18 Cruciferae accessions derived from a UPGMA cluster analysis based on 60 SSR markers. Four clusters were obtained according to the genetic identity (~75%). Cluster 1 indicated NHCC accessions; Cluster 2 indicated Chinese cabbage; Cluster 3 indicated broccoli accessions; and Cluster 4 indicated Arabidopsis. The numbers were bootstrap values based on 1,000 iterations. Only bootstrap values larger than 50 were indicated. (c) Bayesian clustering (STRUCTURE, K = 4) of 18 Cruciferae accessions.
图 3 对 18 个十字花科种质的聚类分析。 ( a )从十字花科 18 种基因型中 SSR 变异的主成分分析中绘制的三个主成分图。 绿色圆圈代表无标题大白菜种质; 粉红色的五角星代表大白菜; 蓝色三角形代表西兰花种质; 红色方块代表拟南芥。 (b) 来自基于 60 个 SSR 标记的 UPGMA 聚类分析的 18 个十字花科种质的树状图。 根据遗传同一性(~75%)获得了四个簇。 第 1 组表示 NHCC 种质; 第 2 组表示大白菜; 第 3 组表示西兰花种质; 簇 4 表示拟南芥。 这些数字是基于 1,000 次迭代的引导值。 仅指示大于 50 的引导值。 (c) 18 个十字花科种质的贝叶斯聚类 (STRUCTURE, K = 4)。
The identification and characteristic of SNPs in 10 NHCC accessions
A comparison of 10 NHCC accessions of five varieties with the Chinese cabbage genome was used to develop SNPs. To increase accuracy and minimize false-positive SNPs, we eliminated SNP sites that had missing data in any one of the 10 NHCC accessions. Finally, 1,228,979 SNP loci were identified, and the average SNP density in the whole genome was 4.33/Kb. This was greater than in tomato (0.6/Kb) and rice (1.7/Kb), but lower than in citrus (6.1/Kb) and potato (11.5/Kb) [31]. All SNPs identified in this study have been submitted to the nhccdata website (http://nhccdata.njau.edu.cn/).
将五个品种的 10 个 NHCC 种质与大白菜基因组进行比较,用于开发 SNP。 为了提高准确性并最大限度地减少假阳性 SNP,我们消除了 10 个 NHCC 加入中任何一个中缺失数据的 SNP 位点。 最终鉴定出1,228,979个SNP位点,全基因组平均SNP密度为4.33/Kb。 这高于番茄 (0.6/Kb) 和水稻 (1.7/Kb),但低于柑橘 (6.1/Kb) 和马铃薯 (11.5/Kb) [31]。 本研究中鉴定的所有 SNP 均已提交至 nhccdata 网站(http://nhccdata.njau.edu.cn/)。
The number of SNPs for each accession ranged from 341,939 to 591,586. The average heterozygous ratio of the SNPs was ~42.53%, and the heterozygous ratio ranged from 18.92% to 65.07% among 10 NHCC accessions. An average of 189,666 SNPs was identified in coding domain sequences. The number of non-synonymous SNPs ranged from 47,178 to 85,510, with an average of 66,965 (Table 4). Of the identified SNPs, excluding those that were heterozygous, an average ~56.88% of SNPs belonged to the transition type in the 10 NHCC. The transition/transversion ratio can be used to measure the genetic distances. Generally, the higher transition/transversion ratio, the lower genetic divergence between two species. The high ratio of 1.32 between the NHCC and Chinese cabbage revealed the relatively low level of polymorphisms between them. A relatively high frequency of C/T alleles was identified, which was also observed in citrus, eggplant, and bean (Figure 4, Additional file 1: Table S11).
每次加入的 SNP 数量从 341,939 到 591,586 不等。SNP的平均杂合率为~42.53%,10个NHCC加入的杂合率范围为18.92%至65.07%。在编码域序列中平均鉴定出 189,666 个 SNP。非同义 SNP 的数量从 47,178 到 85,510 不等,平均为 66,965(表 4)。在已识别的 SNP 中,不包括杂合子,平均约 56.88% 的 SNP 属于 10 个 NHCC 中的过渡类型。过渡/颠换比可用于测量遗传距离。一般来说,转换/颠换比率越高,两个物种之间的遗传分歧越低。NHCC和大白菜之间的高比率为1.32,表明它们之间的多态性水平相对较低。确定了相对较高频率的 C/T 等位基因,这也在柑橘、茄子和豆类中观察到(图 4,附加文件 1:表 S11)。
Figure 4The excavation of unique SNPs and genes from five NHCC varieties
The five varieties of NHCC have their own morphological characteristics. The variety-related SNPs and genes were quickly and accurately identified using the varieties genomic information. Based on the genotypes and phenotypes of the five varieties, the genes associated with variety-related traits were uncovered. For example, by comparing the Tillering cabbage and other four varieties, genes associated with tillering were identified. Similarly, the flowering and early bolting genes were identified by comparing the flowering Chinese cabbage variety and other varieties. Additionally, we have detected the expression of variety-specific genes at the transcriptome level. The functional annotation and the metabolic networks were also conducted for differentially expressed genes (DEGs).
NHCC的五个变种具有各自的形态特征。 利用品种基因组信息,快速准确地鉴定出与品种相关的 SNP 和基因。 根据5个品种的基因型和表型,揭示了与品种相关性状相关的基因。 例如,通过比较分蘖白菜和其他四个品种,确定了与分蘖相关的基因。 同样,通过比较开花大白菜品种和其他品种,鉴定了开花和早期抽薹基因。 此外,我们在转录组水平检测到品种特异性基因的表达。 还对差异表达基因(DEG)进行了功能注释和代谢网络。
At the genomic level, we identified variety-specific SNPs. The non-synonymous SNPs could directly change the encoded amino acid, which could change the function of the protein. Therefore, we surveyed the non�synonymous SNPs in each accession. To better analyze the point mutations, which ranged from 1,133 to 2,104 in the five varieties, we exploited the variety-specific non-synonymous SNPs. These SNPs were located in 710 to 1,107 genes of the five varieties. Transcriptome data were used to identify 897, 651, 970, 1,247, and 699 genes in NHCC001, NHCC006, NHCC008, NHCC009, and NHCC010, respectively (Additional file 1: Table S12). Then, the variety-specific DEGs were identified, whose expression levels were 0.5 or 2 times expression level than each of other varieties. A total of 189 variety-specific DEGs were discovered, consisting of 28, 1, 45, 26, and 2 low expressing genes and 34, 5, 24, 11, and 13 high expressing genes in the five varieties, respectively.
在基因组水平上,我们确定了品种特异性 SNP。 非同义SNP可以直接改变编码的氨基酸,从而改变蛋白质的功能。 因此,我们调查了每个种质中的非同义 SNP。 为了更好地分析五个品种中从 1,133 到 2,104 的点突变,我们利用了品种特异性非同义 SNP。 这些 SNP 位于五个品种的 710 到 1,107 个基因中。 转录组数据分别用于鉴定 NHCC001、NHCC006、NHCC008、NHCC009 和 NHCC010 中的 897、651、970、1247 和 699 个基因(附加文件 1:表 S12)。 然后,鉴定出品种特异性DEG,其表达水平是其他品种的0.5或2倍。 共发现189个品种特异性DEG,分别由5个品种的28、1、45、26、2个低表达基因和34、5、24、11、13个高表达基因组成。
To obtain a more intuitive understanding of the relationship among these DEGs, clustering analyses were carried out based on the expression level. The high expressing DEGs could be divided into five groups, corresponding to the five varieties (Figure 5), while low expressing DEGs did not completely cluster based on variety (Additional file 5: Figure S4). Furthermore, the relationships among these genes was studied using Cytoscape software. Finally, the absolute Pearson’s correlation coefficients of the 1,662 gene pairs were greater than 0.8 in the high expressing DEGs. Most genes had positive relationships, except the Cabbage-G_a_f_g047569, CabbageG_a_f_g033595, and Cabbage-G_a_f_g009143 genes. These genes could be divided into four groups, corresponding to the four varieties. Only one gene was identified in NHCC006, so it was not involved in the network (Figure 6). The relationships among low expressing genes were complex, with 221 negative- and 373 positive-related gene pairs (Additional file 6: Figure S5). In addition, 673 negative- and 3,377 positive-related gene pairs existed in the high and low expressing genes, respectively (Additional file 7: Figure S6).
为了更直观地了解这些DEG之间的关系,基于表达水平进行了聚类分析。高表达的 DEG 可以分为五组,对应于五个品种(图 5),而低表达的 DEG 并没有完全根据品种进行聚类(附加文件 5:图 S4)。此外,使用 Cytoscape 软件研究了这些基因之间的关系。最后,在高表达的 DEG 中,1,662 个基因对的绝对 Pearson 相关系数大于 0.8。大多数基因具有正相关关系,除了 Cabbage-G_a_f_g047569、CabbageG_a_f_g033595 和 Cabbage-G_a_f_g009143 基因。这些基因可以分为四组,对应四个品种。在 NHCC006 中仅鉴定出一个基因,因此它不参与网络(图 6)。低表达基因之间的关系很复杂,有 221 个阴性和 373 个阳性相关基因对(附加文件 6:图 S5)。此外,高表达基因和低表达基因中分别存在 673 个阴性和 3377 个阳性相关基因对(附加文件 7:图 S6)。
Figure 6Using strict standards, which defined the expression level of the gene as 0.2 or 5 times the lowest or highest expression, respectively, of the other varieties, 33 variety-specific DEGs were identified. Of which, 15, 9, 8, and 1 genes were found in NHCC001, NHCC009, NHCC008, and NHCC010, respectively, while none was identified in NHCC006. The analysis of Pearson’s correlation coefficients showed that 15 negative- and 94 positive-related gene pairs were present in these genes, and the Cabbage-G_a_f_g013270 gene existed in more negative gene pairs than any other genes (Additional file 8: Figure S7).
使用严格的标准,将基因的表达水平分别定义为其他品种最低或最高表达水平的 0.2 倍或 5 倍,鉴定出 33 个品种特异性 DEG。其中,NHCC001、NHCC009、NHCC008、NHCC010中分别发现了15、9、8、1个基因,而在NHCC006中没有发现。Pearson 相关系数分析表明,这些基因中存在 15 个负相关基因对和 94 个正相关基因对,并且 Cabbage-G_a_f_g013270 基因存在于比任何其他基因更多的负基因对中(附加文件 8:图 S7)。
For a more intuitive presentation of these non-synonymous SNPs, we plotted their distribution on the chromosomes (Figure 7), revealing that their distributions were different in each accession. This may be because of differential selection during the breeding process. In general, regions with more non-synonymous mutations were often the subject of selection. In 10 NHCC accessions, 3,228 regions with a total length of 20 Kb were identified. The number of non-synonymous SNPs was greater than 20 in these regions. The number of these regions was different on each chromosome, ranging from 21 (A10) to 720 (A03). In addition, we mapped the density of non-synonymous SNPs on the chromosomes for each accession (Additional file 9: Figure S8).
为了更直观地展示这些非同义 SNP,我们绘制了它们在染色体上的分布(图 7),揭示了它们的分布在每个加入中是不同的。这可能是因为育种过程中的差异选择。一般来说,具有更多非同义突变的区域通常是选择的对象。在 10 个 NHCC 加入中,确定了 3,228 个区域,总长度为 20 Kb。在这些区域中,非同义 SNP 的数量大于 20。每条染色体上这些区域的数量不同,范围从21(A10)到720(A03)。此外,我们绘制了每个加入的染色体上非同义 SNP 的密度(附加文件 9:图 S8)。
Figure 7The evolutionary relationship of 10 NHCC accessions by SNP markers
To understand the phylogenetic relationships causing morphological diversity in NHCC, a neighbor-joining phylogenetic tree was constructed by MEGA5 using 10 NHCC accessions and Chinese cabbage Chiifu-401-42. The SNPs located in the coding domain sequences, excluding the missing site, were used to construct the phylogenic tree (Figure 8). In the phylogenetic tree, two accessions of Pak-choi, NHCC001 and NHCC026, and flowering Chinese cabbage, NHCC008 and NHCC013, clustered together. The Taitsai (NHCC015) had a close relationship with Chinese cabbage. Although NHCC010 and NHCC029 belonged to the Tillering cabbage, they did not cluster together. The previous classification might be only based on the tiller, which affected by only a few genes. Thus, they did not cluster together in this tree whose construction was based on genome-wide SNPs. The NHCC010 and NHCC006, which share land collapse and short plant height characteristics, clustered together. Additionally, NHCC029, which shares similar traits with NHCC015, clustered together. These phenomena indicated that the morphological classification might be based on one or several distinct external plant characteristics. However, classification should be determined by the internal genes, coupled with complex environmental interactions. Therefore, the traditional morphological classification might be erroneous. Currently, we can correct traditional morphological classifications through whole-genome sequencing and re-sequencing, furthering the understanding of the NHCC.
为了了解导致 NHCC 形态多样性的系统发育关系,MEGA5 使用 10 个 NHCC 种质和大白菜 Chiifu-401-42 构建了邻接系统发育树。位于编码域序列中的 SNP(不包括缺失位点)用于构建系统发育树(图 8)。在系统发育树中,白菜的两个种质NHCC001和NHCC026以及大白菜NHCC008和NHCC013聚集在一起。 Taitsai(NHCC015)与大白菜关系密切。虽然 NHCC010 和 NHCC029 属于分蘖白菜,但它们并没有聚集在一起。以前的分类可能仅基于分蘖,它仅受少数基因影响。因此,它们并没有聚集在这棵基于全基因组 SNP 构建的树中。 NHCC010和NHCC006具有土地塌陷和株高矮的特点,聚集在一起。此外,与 NHCC015 具有相似特征的 NHCC029 聚集在一起。这些现象表明形态分类可能基于一种或几种不同的外部植物特征。然而,分类应该由内部基因决定,再加上复杂的环境相互作用。因此,传统的形态分类可能是错误的。目前,我们可以通过全基因组测序和重测序来纠正传统的形态学分类,进一步加深对NHCC的认识。
Figure 8Discussion
Efficient and strict flow chart for identification of SSR and SNP markers
In this study, our major aims were to find a large set of accurate SSR and SNP markers in the NHCC, and to gain further insight into the genetic diversity and relationships among representative cultivars and related species. We analyzed the distribution and frequency of microsatellites with mono- to nonanucleotide motifs. To find more accurate SSRs, we used the strict standard that the total SSR length is not less than 18 bp. Thus, the results of this study may differ from previous studies. When compared with previous research, the results obtained could differ because of the following aspects: (1) different search parameters, including the different minimum length (no less than 18 bp versus 12 bp), and different repeat types (mono- to nonanucleotide versus di- to octanucleotide or another range); (2) different software and algorithms used for the SSR search (MISA versus SSRtool); (3) the data used for SSR detection was of a different size and version; and (4) the different analytical methods and manifestations used (count/Mb versus length/Mb). These seemingly minor differences in procedure could strongly influence microsatellite distributions and comparisons among studies. For the development of SNP markers, errors in sequencing or assembling of the NHCC genome also might lead to false SNPs. Therefore, it is important to consider the above-mentioned points when we compared the SSR or SNP frequency and density generated by different genome datasets or research groups.
在这项研究中,我们的主要目标是在 NHCC 中找到大量准确的 SSR 和 SNP 标记,并进一步了解代表性栽培品种和相关物种之间的遗传多样性和关系。我们分析了具有单核苷酸至非核苷酸基序的微卫星的分布和频率。为了找到更准确的 SSR,我们使用了 SSR 总长度不小于 18 bp 的严格标准。因此,本研究的结果可能与以往的研究不同。与以往的研究相比,获得的结果可能因以下几个方面而有所不同:(1)不同的搜索参数,包括不同的最小长度(不小于 18 bp 与 12 bp),以及不同的重复类型(单核苷酸与二至八核苷酸或其他范围); (2) 用于 SSR 搜索的不同软件和算法(MISA 与 SSRtool); (3) 用于SSR检测的数据大小和版本不同; (4) 使用的不同分析方法和表现形式(计数/Mb 与长度/Mb)。这些看似微小的程序差异可能会强烈影响微卫星分布和研究之间的比较。对于 SNP 标记的开发,NHCC 基因组的测序或组装错误也可能导致错误的 SNP。因此,当我们比较不同基因组数据集或研究组产生的 SSR 或 SNP 频率和密度时,考虑上述几点很重要。
Genetic relationship analysis of 18 Cruciferae species
The 14 NHCC accessions and four other Cruciferae species were analyzed using SSR markers. The analyses of a dendrogram and population structure, as well as PCA, revealed four clusters. Although the research did not completely distinguish the five NHCC varieties, which may have been because of the limited number of SSR markers used for the genetic analyses, it accurately separated NHCC, Chinese cabbage, broccoli, and Arabidopsis. Thus, a larger number of SNP markers were used to construct the phylogenic tree. Both of the SSR and SNP marker analyses revealed that the Taitsai variety (NHCC015 and NHCC009) had a close relationship with Chinese cabbage. It was also consistent with the theory that Chinese cabbage was derived from a hybrid of Taitsai and turnip. The Pak-choi, flowering Chinese cabbage, and Taitsai varieties could be distinguished using the SNP markers. The classification of Tillering cabbage and Wutatsai might be only based on one or several distinct phenotypic plant characteristics; thus, we attempted to distinguish them using whole genome SNPs. Classification only based on morphology may be problematic, and a true classification should be determined using the internal genes of the whole genome, coupled with the complex environmental factors. Currently, it is possible for us to adjust traditional morphological classifications using the SSRs and SNPs of the whole genome. Furthermore, these markers developed in our study can be useful for population structure analyses of NHCC and other related species in the future.
使用 SSR 标记分析了 14 个 NHCC 种质和 4 个其他十字花科物种。对树状图和种群结构以及 PCA 的分析揭示了四个集群。尽管该研究没有完全区分五个 NHCC 品种,这可能是因为用于遗传分析的 SSR 标记数量有限,但它准确地区分了 NHCC、大白菜、西兰花和拟南芥。因此,大量的 SNP 标记用于构建系统发育树。 SSR 和 SNP 标记分析均表明 Taitsai 品种(NHCC015 和 NHCC009)与大白菜关系密切。这也与大白菜是大白菜和萝卜的杂交品种的理论一致。可以使用 SNP 标记区分小白菜、大白菜和大白菜品种。分蘖白菜和五田菜的分类可能仅基于一种或几种不同的表型植物特征;因此,我们尝试使用全基因组 SNP 来区分它们。仅基于形态的分类可能有问题,真正的分类应该使用全基因组的内部基因,再加上复杂的环境因素来确定。目前,我们可以利用全基因组的 SSR 和 SNP 来调整传统的形态学分类。此外,我们研究中开发的这些标记可用于未来 NHCC 和其他相关物种的种群结构分析。
Use of new SSR and SNP markers for NHCC and Cruciferous species research
It was important to develop molecular markers to investigate genetic variability and explore genome evolutionary. Until now, only a few low-density genetic maps have been constructed owing to lack of highly polymorphic and reliable molecular markers in NHCC. In addition, most linkage maps with important agronomic trait loci have been developed with primarily low-throughput markers, such as AFLP, RFLP, and RAPD or a few SSR markers. The development of these markers is time consuming, labor intensive, and expensive. Thus, only a few economically important genes had been identified using a map-based cloning strategy in NHCC, suggesting that marker-assisted selection breeding was still not well developed compared with in other horticultural species, such as cucumber. SSR or SNP markers have proven to be useful markers in the population genetic studies of species. Currently, with the development of bioinformatics and the next-generation sequencing technology, it is very convenient and feasible to obtain a large number of SSR and SNP markers by genome sequencing. In this study, we developed a large number of SSR and SNP markers, and obtained their exact physical positions in the NHCC genome. We designed primer pairs for NHCC SSRs, and verified the polymorphism by polymerase chain reaction (PCR) and gel electrophoresis in some important Cruciferous species. NHCC had a relatively large level of morphological and genetic polymorphisms, and SNPs were identified in different varieties. In our study, the SNPs were classified according to the five varieties. Variety-specific genes were also identified and verified using the transcriptome. These genes might be useful for distinguishing the five varieties of NHCC.
开发分子标记来研究遗传变异性和探索基因组进化是很重要的。到目前为止,由于 NHCC 缺乏高度多态性和可靠的分子标记,仅构建了少数低密度遗传图谱。此外,大多数具有重要农艺性状基因座的连锁图谱都是用主要的低通量标记开发的,例如 AFLP、RFLP 和 RAPD 或一些 SSR 标记。这些标记的开发耗时、劳动密集且昂贵。因此,在 NHCC 中使用基于图谱的克隆策略仅鉴定了少数具有经济意义的基因,这表明与其他园艺物种(如黄瓜)相比,标记辅助选择育种仍然没有得到很好的发展。 SSR 或 SNP 标记已被证明是物种种群遗传研究中的有用标记。目前,随着生物信息学和新一代测序技术的发展,通过基因组测序获得大量的SSR和SNP标记是非常方便可行的。在这项研究中,我们开发了大量的 SSR 和 SNP 标记,并获得了它们在 NHCC 基因组中的确切物理位置。我们为 NHCC SSR 设计了引物对,并通过聚合酶链式反应 (PCR) 和凝胶电泳验证了一些重要十字花科植物的多态性。 NHCC具有较高水平的形态和遗传多态性,并且在不同品种中鉴定出SNP。在我们的研究中,根据五个品种对 SNP 进行分类。还使用转录组鉴定和验证了品种特异性基因。这些基因可能有助于区分五种 NHCC。
Conclusions
NHCC is an ecologically important vegetable crop in Southeast Asia, Japan, USA, and Europe. However, the insufficient genomic and transcriptome data in public databases have limited our understanding of the molecular mechanisms underlying the adaptation of NHCC. With the development of high-throughput genome sequencing technology, it is now possible to uncover large numbers of DNA markers. This work contributed to a detailed characterization of 20,836 SSRs and 1,228,979 SNPs in NHCC and compared them with markers in other representative species. For the SSR markers, dinucleotide repeats were the most frequent SSRs in the genome. While the frequency of trinucleotide repeats were much higher than dinucleotides in gene sequences. Primers for the SSRs in the gene sequences of NHCC were designed, and the SSR polymorphisms were verified using PCR. The results showed that the SSR markers were highly polymorphic among the 18 Cruciferous species. By comparing NHCC with Chinese cabbage, a large number of SNP markers were identified in the five NHCC varieties. The potential variety-specific related genes identified lay a solid foundation for further investigations into comparative genome analyses among the five varieties. Furthermore, they will be useful for further functional genomic studies in the Brassica genus. These SNP and SSR markers will be valuable genomic resources for future Cruciferous research and breeding applications.
NHCC是东南亚、日本、美国和欧洲的重要生态蔬菜作物。然而,公共数据库中基因组和转录组数据不足限制了我们对 NHCC 适应的分子机制的理解。随着高通量基因组测序技术的发展,现在可以发现大量的 DNA 标记。这项工作有助于详细描述 NHCC 中的 20,836 个 SSR 和 1,228,979 个 SNP,并将它们与其他代表性物种中的标记进行比较。对于 SSR 标记,二核苷酸重复是基因组中最常见的 SSR。而基因序列中三核苷酸重复的频率远高于二核苷酸。设计了NHCC基因序列中SSR的引物,并通过PCR验证了SSR的多态性。结果表明,SSR标记在18个十字花科植物中具有高度多态性。通过将NHCC与大白菜进行比较,在5个NHCC品种中鉴定出大量的SNP标记。确定的潜在品种特异性相关基因为进一步研究五个品种之间的比较基因组分析奠定了坚实的基础。此外,它们将有助于进一步研究芸薹属的功能基因组。这些 SNP 和 SSR 标记将成为未来十字花科研究和育种应用的宝贵基因组资源。
网友评论