美文网首页生信科研信息学
snakemake搭建生信分析流程

snakemake搭建生信分析流程

作者: dming1024 | 来源:发表于2019-05-12 13:52 被阅读60次

    参考文章:
    https://blog.csdn.net/u012110870/article/details/85330457
    https://www.jianshu.com/p/14b9eccc0c0e

    为了能够充分利用自己的云端服务器资源,我就选择尝试搭建自己的生信分析流程,首先是熟悉一下利用snakemake的流程。

    wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
    tar -xf v3.11.0.tar.bz2 --strip 1
    conda env create --name snakemake-tutorial --file environment.yaml
    source activate snakemake-tutorial
    # 退出当前环境
    source deactivate
    

    可以看下示例数据

    ls -lh
    total 652K
    -rw-rw-r-- 1 root root 229K Mar  8  2017 genome.fa
    -rw-rw-r-- 1 root root 2.6K Mar  8  2017 genome.fa.amb
    -rw-rw-r-- 1 root root   83 Mar  8  2017 genome.fa.ann
    -rw-rw-r-- 1 root root 225K Mar  8  2017 genome.fa.bwt
    -rw-rw-r-- 1 root root   18 Mar  8  2017 genome.fa.fai
    -rw-rw-r-- 1 root root  57K Mar  8  2017 genome.fa.pac
    -rw-rw-r-- 1 root root 113K Mar  8  2017 genome.fa.sa
    drwxrwxr-x 2 root root 4.0K Mar  8  2017 samples
    
    1. bwa比对
    vim Snakefile
    # 编辑如下内容
    rule bwa_map:
        input:
            "data/genome.fa",
            "data/samples/{sample}.fastq"
        output:
            "mapped_reads/{sample}.bam"
        shell:
            """
            bwa mem {input} | samtools view -Sb - > {output}
            """
    

    尝试运行下

     snakemake -np mapped_reads/{A,B,C}.bam
    
    rule bwa_map:
        input: data/genome.fa, data/samples/C.fastq
        output: mapped_reads/C.bam
        log: logs/bwa_mem/C.log
        jobid: 0
        wildcards: sample=C
    
    
            (bwa mem -R '@RG    ID:C    SM:C' data/genome.fa data/samples/C.fastq|samtools view -Sb - > mapped_reads/C.bam) 2> logs/bwa_mem/C.log
            
    
    rule bwa_map:
        input: data/genome.fa, data/samples/A.fastq
        output: mapped_reads/A.bam
        log: logs/bwa_mem/A.log
        jobid: 1
        wildcards: sample=A
    
    
            (bwa mem -R '@RG    ID:A    SM:A' data/genome.fa data/samples/A.fastq|samtools view -Sb - > mapped_reads/A.bam) 2> logs/bwa_mem/A.log
            
    
    rule bwa_map:
        input: data/genome.fa, data/samples/B.fastq
        output: mapped_reads/B.bam
        log: logs/bwa_mem/B.log
        jobid: 2
        wildcards: sample=B
    
    
            (bwa mem -R '@RG    ID:B    SM:B' data/genome.fa data/samples/B.fastq|samtools view -Sb - > mapped_reads/B.bam) 2> logs/bwa_mem/B.log
            
    Job counts:
        count   jobs
        3   bwa_map
        3
    

    没问题!再进行下个流程的编写

    2.比对结果排序

    vim Snakefile
    rule samtools_sort:
        input:
            "mapped_reads/{sample}.bam"
        output:
            "sorted_reads/{sample}.bam"
        shell:
            "samtools sort -T sorted_reads/{wildcards.sample}"#不知道能否用通配符呢?
            " -O bam {input} > {output}"
    

    以之前比对的输入文件作为此次运行的输出文件,sort之后输出到另一个文件夹中。这里的wildcards.sample来获取通配名。

    1. 建立索引
    vim Snakefile
    rule samtools_index:
        input:
            "sorted_reads/{sample}.bam"
        output:
            "sorted_reads/{sample}.bam.bai"
        shell:
            "samtools index {input}"
    

    可以将流程进行可视化,为dag.svg文件

    snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
    

    4.基因组变异识别

    vim Snakefile
    rule bcftools_call:
        input:
            fa="data/genome.fa",
            bamA="sorted_reads/A.bam"
            bamB="sorted_reads/B.bam"
            baiA="sorted_reads/A.bam.bai"
            baiB="sorted_reads/B.bam.bai"
        output:
            "calls/all.vcf"
        shell:
            "samtools mpileup -g -f {input.fa} {input.bamA} {input.bamB} | "
            "bcftools call -mv - > {output}"
    

    这样书写样本路径,有些麻烦,可以进一步将input进行简化:

    SAMPLES=["A","B"]
    rule bcftools_call:
        input:
            fa="data/genome.fa",
            bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
            bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
        output:
            "calls/all.vcf"
        shell:
            "samtools mpileup -g -f {input.fa} {input.bam} | "
            "bcftools call -mv - > {output}"
    

    5.(optional)用python编写报告

    vim Snakefile
    rule report:
        input:
            "calls/all.vcf"
        output:
            "report.html"
        run:
            from snakemake.utils import report
            with open(input[0]) as vcf:
                n_calls = sum(1 for l in vcf if not l.startswith("#"))
    
            report("""
            An example variant calling workflow
            ===================================
    
            Reads were mapped to the Yeast
            reference genome and variants were called jointly with
            SAMtools/BCFtools.
    
            This resulted in {n_calls} variants (see Table T1_).
            """, output[0], T1=input[0])
    
    

    6.(optional)增加目标规则(不是很懂,先贴上)

    rule all:
        input:
            "report.html
    

    最后优化的分析流程如下:

    configfile: "config.yaml"
    
    
    rule all:
        input:
            "report.html"
    
    
    rule bwa_map:
        input:
            "data/genome.fa",
            lambda wildcards: config["samples"][wildcards.sample]
        output:
            temp("mapped_reads/{sample}.bam") #比对过程中的temp文件,运行完成之后会自动删除
        params:
            rg="@RG\tID:{sample}\tSM:{sample}"#bwa的比对参数
        log:
            "logs/bwa_mem/{sample}.log"
        shell:
            "(bwa mem -R '{params.rg}' -t {threads} {input} | "
            "samtools view -Sb - > {output}) 2> {log}"
    
    
    rule samtools_sort:
        input:
            "mapped_reads/{sample}.bam"
        output:
            protected("sorted_reads/{sample}.bam")
        shell:
            "samtools sort -T sorted_reads/{wildcards.sample} "
            "-O bam {input} > {output}"
    
    
    rule samtools_index:
        input:
            "sorted_reads/{sample}.bam"
        output:
            "sorted_reads/{sample}.bam.bai"
        shell:
            "samtools index {input}"
    
    
    rule bcftools_call:
        input:
            fa="data/genome.fa",
            bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
            bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
        output:
            "calls/all.vcf"
        shell:
            "samtools mpileup -g -f {input.fa} {input.bam} | "
            "bcftools call -mv - > {output}"
    
    
    rule report:
        input:
            "calls/all.vcf"
        output:
            "report.html"
        run:
            from snakemake.utils import report
            with open(input[0]) as vcf:
                n_calls = sum(1 for l in vcf if not l.startswith("#"))
    
            report("""
            An example variant calling workflow
            ===================================
    
            Reads were mapped to the Yeast
            reference genome and variants were called jointly with
            SAMtools/BCFtools.
    
            This resulted in {n_calls} variants (see Table T1_).
            """, output[0], T1=input[0])
    

    config.yaml是一个样本路径文件

    cat config.yaml
    samples:
        A: data/samples/A.fastq
        B: data/samples/B.fastq
    

    运行snakmake -s Snakefile就ok了

    snakmake 
    Provided cores: 1
    Rules claiming more threads will be scaled down.
    Job counts:
        count   jobs
        1   all
        1   bcftools_call
        3   bwa_map
        1   report
        3   samtools_index
        3   samtools_sort
        12
    
    rule bwa_map:
        input: data/genome.fa, data/samples/B.fastq
        output: mapped_reads/B.bam
        log: logs/bwa_mem/B.log
        jobid: 11
        wildcards: sample=B
    
    Finished job 11.
    1 of 12 steps (8%) done
    
    rule samtools_sort:
        input: mapped_reads/B.bam
        output: sorted_reads/B.bam
        jobid: 7
        wildcards: sample=B
    ....#代码太多了,就不全部粘贴了
    Finished job 1.
    11 of 12 steps (92%) done
    
    localrule all:
        input: report.html
        jobid: 0
    
    Finished job 0.
    12 of 12 steps (100%) done
    

    运行完成之后,便会有calls文件,文件夹下有.vcf文件,即使snp分析结果
    可以简单查看运行的流程图:

    snakemake --dag -s snakefile1 sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
    

    整个的分析流程图

     snakemake --dag| dot -Tsvg > dag.svg
    

    知识点:
    1.wildcards。用来获取通配符匹配到的部分,例如对于通配符"{dataset}/file.{group}.txt"匹配到文件101/file.A.txt,则{wildcards.dataset}就是101,{wildcards.group}就是A。
    2.expand。 expand("sorted_reads/{sample}.bam", sample=SAMPLES),将SAMPLES中的值依次录入到{}中去。

    相关文章

      网友评论

        本文标题:snakemake搭建生信分析流程

        本文链接:https://www.haomeiwen.com/subject/pqhcaqtx.html