DBGWAS：基于k-mer和De Bruijn图的GWAS

作者: 胡童远 | 来源:发表于2021-01-06 09:34 被阅读0次

DBGWAS：基于k-mer和De Bruijn图的GWAS
二代测序数据组装
基因组的重头组装
De Bruijn assembly: 01
Abyss:基于布隆过滤器的基因组组装软件
德布鲁因图 (De Bruijn graph)
GWAS分析qq-plot可以得到什么信息
「基因组survey」使用GenomeScope进行基因组分析
基于家系数据的GWAS分析
RNA-seq从入门到自闭（Kallisto和Salmon）

文献信息

标题：A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events
中文：一种快速而不确定的细菌全基因组关联研究方法: 弥合k-mers和遗传事件之间的差距
杂志：plos genetics
时间：2018

摘要

应用于细菌基因组的全基因组关联研究(GWAS)方法在发现遗传标记或详细评估标记效应方面显示了良好的结果。最近，基于k-mer组成的无比对方法证明了其探索副基因组的能力。然而，它们会导致冗余的描述和结果，有时很难解释。这里我们介绍DBGWAS，一个扩展的基于k-mer的GWAS方法，产生与不同表型相关的可解释的遗传变异。我们的方法依赖于压缩的De Bruijn图(cDBG)，将由关联模型识别的cDBG节点收集到初始cDBG中从其邻域定义的子图中。DBGWAS是无对齐的，只需要一组contigs和表型。特别地，它不需要预先注释或参考基因组。它产生的子图表示表型相关的遗传变异，如局部多态性和移动遗传元素(MGE)。它提供了一个图形化框架来帮助解释GWAS结果。重要的是，它的计算效率也很高——实验平均花费一个半小时。我们验证了我们的方法使用抗生素耐药性表型的三种细菌。DBGWAS恢复了已知的耐药决定因素，如结核分枝杆菌核心基因的突变，以及金黄色葡萄球菌和铜绿假单胞菌水平转移获得的基因，以及它们的MGE背景。它还使我们能够制定新的假设，涉及尚未在抗生素耐药性文献中描述的基因变异。

背景

最近的几项in silico研究描述了基因组抗生素图谱(genomic antibiogram)的定义，人们对这种补充经典表型的方法寄予厚望。几项研究已经表明，在某些情况下，基因组抗生素图至少可以和表型抗生素图一样好。与我们的方法相反，这些研究需要广泛的耐药性标记数据库。DBGWAS肯定会对此类数据库的扩展或不可知论基因组抗生素图的开发做出贡献。

构建cDBG
DGWAS通过构建compated DBG减少基因组冗余，标记突变

DBGWAS流程
1 识别突变
2 突变关联
3 显著信号分析

举例：DBGWAS发现的细菌耐受决定因子

DBGWAS鉴定出的不同遗传活动位点
启动子，管家基因，核心基因，可移动元件

DBGWAS发现没有先验知识的新突变

下载安装
地址：https://gitlab.com/leoisl/dbgwas
下载（自测）：https://www.dropbox.com/s/s9oojqfl1kgi4l5/DBGWAS-0.5.4-Linux-precompiled.tar.gz?dl=1

使用方法
脚本来源：https://gitlab.com/biomerieux-data-science/clustlasso-dbgwas-integration/-/tree/master/src

## 01.run-dbgwas.R
drug = "meropenem"
meta = read.csv2("./input/meta-data.csv")

#----------------------------#
# prepare dbgwas config file #
#----------------------------#
# specify strain ids
strain.ids = meta$strain_id
# convert phenotype to binary
p.bin = as.numeric(factor(meta[, drug], levels = c("S","R")))
p.bin = p.bin - 1
# define path to assemblies
input.files = paste0("./input/genome_assembly/", meta$fasta)
# create ouptut file
X = data.frame("ID"=strain.ids, "pheno"=p.bin, "Path"=input.files)
dbgwas.conf = "./output/01.dbgwas-conf.txt"
dir.create("output")
write.table(X, file = dbgwas.conf, row.names = F, quote = F)

./DBGWAS \
-strains ./output/01.dbgwas-conf.txt \
-output ./output/dbgwas2 \
-nc-db ./input/DBGWAS_merged_ResDB.fasta \
-nb-cores 8

运行过程：

Step 1. Building DBG and mapping strains on the DBG...
[DSK: counting kmers                     ]  0    %   elapsed:   0 min 0  sec   r
[DSK: Pass 1/1, Step 1: partitioning     ]  0    %   elapsed:   0 min 0  sec   r
[DSK: Pass 1/1, Step 2: counting kmers   ]  54.1 %   elapsed:   7 min 33 sec   r
[MPHF: populate                          ]  0    %   elapsed:   0 min 0  sec   r
[Bloom: read solid kmers                 ]  0    %   elapsed:   0 min 0  sec   r
[Debloom: build extension                ]  0    %   elapsed:   0 min 0  sec   r
[Debloom: finalization                   ]  12.5 %   elapsed:   0 min 0  sec   r
[Debloom: cascading                      ]  25   %   elapsed:   0 min 1  sec   r
[Graph: build branching nodes            ]  2    %   elapsed:   0 min 0  sec   r
[Graph: building unitigs                 ]  2    %   elapsed:   1 min 14 sec   r
[Loading endpoints of unitigs            ]  0    %   elapsed:   0 min 0  sec   r
[Building .nodes and .edges files 
################################################################################
Stats:
Number of kmers: 15741804
Number of unitigs: 404907
################################################################################
build_dbg
[Starting mapping process... ]
Using 8 cores to map 50 read files.
6680 reads processed.

[Generating bugwas and gemma input]...
[Generating bugwas and gemma input] - Done!
[Generating unitigs2PhenoCounter...]
[Generating unitigs2PhenoCounter...] - Done!

[Mapping process finished!]
map_reads
Done!

Step 2. Running statistical test (bugwas + gemma)...
Executing Rscript --vanilla /hwfssz1/ST_META/PN/hutongyuan/software/dbgwas/bin/DBGWAS_lib//DBGWAS.R /hwfssz1/ST_META/PN/hutongyuan/software/dbgwas/bin/DBGWAS_lib/ /hwfssz1/ST_META/PN/hutongyuan/clustlasso-dbgwas/./output/dbgwas2/step1 /hwfssz1/ST_META/PN/hutongyuan/clustlasso-dbgwas/./output/dbgwas2/step1/bugwas_input.id_phenotype bugwas_out /hwfssz1/ST_META/PN/hutongyuan/software/dbgwas/bin/DBGWAS_lib//gemma.0.93b 0.01 2>&1...
[DBGWAS] Reading unitigs from /hwfssz1/ST_META/PN/hutongyuan/clustlasso-dbgwas/./output/dbgwas2/step1/bugwas_input.unique_rows.binary
[DBGWAS] Reading phenotypes from /hwfssz1/ST_META/PN/hutongyuan/clustlasso-dbgwas/./output/dbgwas2/step1/bugwas_input.id_phenotype
[DBGWAS] Restricting genotype 44191/44192 patterns with MAF >= 0.01.
[DBGWAS] Building kinship matrix
Get kinship matrix
Reading Files ...
## number of total individuals = 50
## number of analyzed individuals = 50
## number of covariates = 1
## number of total SNPs = 364201
## number of analyzed SNPs = 364201
Calculating Relatedness Matrix ...
Reading SNPs  ==================================================100.00%
[DBGWAS] Performing association tests
Reading Files ...
## number of total individuals = 50
## number of analyzed individuals = 50
## number of covariates = 1
## number of total SNPs = 44191
## number of analyzed SNPs = 44191
Start Eigen-Decomposition...
lambda REMLE estimate in the null (linear mixed) model = 7.55791
lambda MLE estimate in the null (linear mixed) model = 7.936
pve estimate in the null (linear mixed) model = 0.453773
se(pve) in the null (linear mixed) model = 0.160796
Reading SNPs  ==================================================100.00%
Read 25 items
Biallelic data processed successfully.
Executing Rscript --vanilla /hwfssz1/ST_META/PN/hutongyuan/software/dbgwas/bin/DBGWAS_lib//DBGWAS.R /hwfssz1/ST_META/PN/hutongyuan/software/dbgwas/bin/DBGWAS_lib/ /hwfssz1/ST_META/PN/hutongyuan/clustlasso-dbgwas/./output/dbgwas2/step1 /hwfssz1/ST_META/PN/hutongyuan/clustlasso-dbgwas/./output/dbgwas2/step1/bugwas_input.id_phenotype bugwas_out /hwfssz1/ST_META/PN/hutongyuan/software/dbgwas/bin/DBGWAS_lib//gemma.0.93b 0.01 2>&1 - Done!
################################################################################
Stats:
Total number of patterns: 44191
################################################################################
statistical_test
Done!
Step 3. Building visualisation around significant unitigs...
Executing /hwfssz1/ST_META/PN/hutongyuan/software/dbgwas/bin/DBGWAS_lib///makeblastdb -dbtype nucl -in ./output/dbgwas2/step3/nucl_db_fixed...


Building a new DB, current time: 12/10/2020 15:58:08
New DB name:   /hwfssz1/ST_META/PN/hutongyuan/clustlasso-dbgwas/output/dbgwas2/step3/nucl_db_fixed
New DB title:  ./output/dbgwas2/step3/nucl_db_fixed
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 7792 sequences in 0.389215 seconds.
Executing /hwfssz1/ST_META/PN/hutongyuan/software/dbgwas/bin/DBGWAS_lib///makeblastdb -dbtype nucl -in ./output/dbgwas2/step3/nucl_db_fixed - Done!
[Getting the significant unitigs from the patterns...]
Selected 100 most significant patterns...
[Getting significant unitigs from the patterns...] - Done!
[Reading input and creating BOOST graph...]
[Reading input and creating BOOST graph...] - Done!
[Computing nodes' neighbourhoods...]
[Computing nodes' neighbourhoods...] - Done!
[Generating the visualisation files...]
Rendering comp_0...
Annotating...
Annotating... - Done!
Building Cytoscape graph and textual output...
Building Cytoscape graph and textual output... - Done!
...

Rendering comp_351...
Annotating...
Annotating... - Done!
Building Cytoscape graph and textual output...
Building Cytoscape graph and textual output... - Done!
Rendering comp_351... - Done!

[Generating the visualisation files...] - Done!
[Creating index file...]
[Rendering thumbnails...]
[Rendering thumbnails...] - Done!
[Creating index file...] - Done!


******************************************************************************
We are done. The output can be found at ./output/dbgwas2/visualisations/index.html
******************************************************************************

generate_output
Done!

阅读：
基因组组装算法 De Bruijn Graph
What is a unitig? How does it differ from a contig?

DBGWAS：基于k-mer和De Bruijn图的GWAS
文献信息标题：A fast and agnostic method for bacterial genome-w...
二代测序数据组装
二代数据组装构建contig：将所有小片段打成K-mer构建de Bruijn图，然后会根据给定的参数对de B...
基因组的重头组装
1. de novo assembly De Bruijn 图是目前二代测序序列最常用的拼接算法，该算法将已经非常...
De Bruijn assembly: 01
本文介绍一下 De Bruijn 图在基因组组装中的原理和应用，依据：https://www.cs.jhu.edu...
Abyss:基于布隆过滤器的基因组组装软件
主流的NGS基因组组装软件都是先将序列划分成kmer, 然后基于de Bruijn Graph图论算法，得到组装好...
德布鲁因图 (De Bruijn graph)
在图论[https://en.wikipedia.org/wiki/Graph_theory]中，m个符号的n维德...
GWAS分析qq-plot可以得到什么信息
在做GWAS时，曼哈顿图和qq图是两种必不可少的图片，现就qq图进行一简要说明先上图：我们在进行GWAS分析时...
「基因组survey」使用GenomeScope进行基因组分析
在我写的基因组survey介绍了如何通过jellyfish统计k-mer然后绘制k-mer分布图研究基因组的方法。...
基于家系数据的GWAS分析
欢迎关注"生信修炼手册"！通过GWAS分析可以寻找与某一疾病或性状相关的突变位点，传统的GWAS都是基于cont...
RNA-seq从入门到自闭（Kallisto和Salmon）
这是RNA-seq上游分析的最后一站，seq数据定量。这一篇文章会介绍基于k-mer定量两软件：kallisto和...