Paragraph对SV进行基因分型

作者: 斩毛毛 | 来源:发表于2022-06-10 15:15 被阅读0次

Paragraph对SV进行基因分型
opitype:对HLA I型基因进行4位分型
GWAS学习之路-名词辨析
基因分型DLBCL
关于单倍型和Phasing
名词解释
基因组基本概念、名词
GTEx数据库
基因组数据的定向和填充（Phasing and Imputati
GWAS理论 1-2 表型考察处理与标记开发和分型

基于重测序数据用于对sv进行genotype.

githup: https://github.com/Illumina/paragraph
相关文章：Chen, S. et al. Paragraph: a graph-based structural variant genotyper for short-read
sequence data. Genome Biol. 20, 291 (2019).

1、安装

可以参考 doc/installation.md

2、测试数据

安装完成后，bin目录下有运行脚本。

python3 bin/multigrmpy.py -i share/test-data/round-trip-genotyping/candidates.vcf \
                          -m share/test-data/round-trip-genotyping/samples.txt \
                          -r share/test-data/round-trip-genotyping/dummy.fa \
                          -o test \

其中：

candidates.vcf：候选的SV，vcf格式
samples.txt:：BAM文件；Tab或者comma分割
dummy.fa：ref

输出结果如下所示：

输出结果

如果运行成功，genotype.vcf.gz的结果类似expected file

3、输入文件需求

VCF格式

对于vcf中的SV格式可以使用全部序列，或者symbolic,只要满足VCF 4.0就可以。

样本展示manifest

利用tab分割，包括以下列
必须列：

id：每个样本的名称必须是唯一
path：bam文件路径
depth：整个基因组的平均深度，利用bin/idxdepth计算（比samtools更快）。
read length: 整个基因组平均reads长度（bp）
可选列：
depth sd:
depth variance
sex:M or F，or unknow；影响ChrX和ChrY分型。

运行时间

为了提高效率，建议设置-M参数（针对于一个SV最大的reads 深度），可以跳过这些高深度的区域。建议 -M 参数为自己数据平均样本depth的20倍。

群体 genotype

为了高效的对群体进行genotype，建议首先对单个样本进行genotype，然后在合并；

对每一个样本进行操作
运行 mulitgram.py，针对于不同的样本的深度，设置不同的 -M 参数。
多线程（-t 参数），非常建议加上
对所有的genomes.vcf.gz进行合并，得到一个包含所有样本的VCF；可以使用 bcftools merge进行合并。

其他信息

bin目录下的脚本

Bam文件深度统计

$\color{red}{idxdepth}$
可以快速对bam文件进行全基因组depth计算

bin/idxdepth -b \<bam/cram> -r \<reference fasta> -o \<output>

其输出结果为一个json文件

{
    "autosome": {
        "contigs": [
            "chr1"
        ],
        "depth": 1
    },
    "bam_path": "fake_path.bam",
    "contigs": [
        // ...
    ],
    "read_length": 50,
    "reference": "fake_reference.fa",
    "unaligned_reads": 0
}

图形化序列的read 数量

$\color{red}{paragraph}$
grmpy中的核心程序，用于统计图形化序列的read数量。
输入文件为 JSON文件和一个bam文件+reference。
输入文件为
a. 每一个节点和边的read数量
b. 每一个变异的位置和read数量

Genotyper

$\color{red}{multigrmpy.py}$
上面以及提过，其基本用法

python3 bin/multigrmpy.py -i \<input\> \
  -m \<manifest> \
  -r \<reference fasta> \
  -o \<output directory>

a. input: 变异的VCF活着JSON文件
b. manifest: BAM文件的一个列表
e.g.

id      path          read length  depth
sample1 sample1.bam   150          50
sample2 sample2.bam   150          50

c. ref.
d. 输出文件

output

other tools

$\color{red}{vcf2paragraph.py}$
$\color{red}{addVariants.py}$
$\color{red}{compare-alignments.py}$
$\color{red}{findgrm.py}$
$\color{red}{msa2vcf.py}$
将多序列比对的fasta结果，转变为vcf格式。
$\color{red}{paragraph2dot.py}$
将图形化的JSON文件变为点图进行可视化。

参考

网友评论

本文标题：Paragraph对SV进行基因分型

本文链接：https://www.haomeiwen.com/subject/uxkbmrtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！