美文网首页
snakemake杂记:多个转录组比对到多个基因组得到多个bam

snakemake杂记:多个转录组比对到多个基因组得到多个bam

作者: 小明的数据分析笔记本 | 来源:发表于2023-05-17 10:32 被阅读0次

    我的需求是:

    我有10个基因组,然后又12个转录组数据,然后将这个12个基因组数据分别比对到这个10个基因组,每个基因组得到12个bam文件,然后将每个基因组的12个bam文件合并 ,最终得到10个合并的bam文件

    前面比对的步骤没有遇到问题

    SAMPLES, = glob_wildcards("../../rnaseq/01.clean.reads/{sample}_clean_1.fq")
    GENOMES, = glob_wildcards("{genome}.fa.masked")
    
    GENOMES = [genome.split("/")[0] for genome in GENOMES if len(genome.split("/")) == 3]
    GENOMES.remove("D1")
    
    print(GENOMES)
    print(SAMPLES)
    
    print("Total genome size: ",len(GENOMES))
    print("Total sample size: ",len(SAMPLES))
    
    
    
    
    rule all:
        input:
            expand("{genome}/{genome}/ref_index.1.ht2",genome=GENOMES),
            expand("{genome}/{genome}/{sample}.sam",genome=GENOMES,sample=SAMPLES),
            expand("{genome}/{genome}/{sample}.sorted.bam.bai",genome=GENOMES,sample=SAMPLES),
            expand("{genome}/{genome}/merged.bam",genome=GENOMES),
            expand("{genome}/{genome}/merged.gtf",genome=GENOMES)
    
    
    rule hisat2_build:
        input:
            ref = "{genome}/{genome}/{genome}.fa.masked"
        output:
            "{genome}/{genome}/ref_index.1.ht2"
        threads:
            8
        resources:
            mem_mb = 48000
        params:
            "{genome}/{genome}/ref_index"
        shell:
            """
            hisat2-build {input.ref} {params} -p {threads}
            """
    
    rule hisat2_align:
        input:
            index = rules.hisat2_build.output,
            r1 = "../../rnaseq/01.clean.reads/{sample}_clean_1.fq",
            r2 = "../../rnaseq/01.clean.reads/{sample}_clean_2.fq"
        output:
            sam = "{genome}/{genome}/{sample}.sam"
        threads:
            8
        resources:
            mem_mb = 64000
        params:
            ref_index = rules.hisat2_build.params
        shell:
            """
            hisat2 -p {threads} --dta -x {params.ref_index} \
            -1 {input.r1} -2 {input.r2} -S {output.sam}
            """
    
    rule samtools_sort:
        input:
            sam = rules.hisat2_align.output.sam
        output:
            bam = "{genome}/{genome}/{sample}.sorted.bam"
        threads:
            8
        resources:
            mem_mb = 64000
        shell:
            """
            samtools sort -@ {threads} -O bam -o {output.bam} {input.sam}
            """
    
    rule samtools_index:
        input:
            rules.samtools_sort.output.bam
        output:
            "{genome}/{genome}/{sample}.sorted.bam.bai"
        threads:
            8
        resources:
            mem_mb = 24000
        shell:
            """
            samtools index {input}
            """
    

    到合并的步骤最开始的写法是

    rule samtools_merge:
        input:
            bams = expand(rules.samtools_sort.output.bam,genome=GENOMES,sample=SAMPLES),
            bai = expand(rules.samtools_index.output,genome=GENOMES,sample=SAMPLES)
        output:
            "{genome}/{genome}/merged.bam"
        threads:
            16
        resources:
            mem_mb = 64000
        shell:
            """
            samtools merge -@ {threads} {output} {input.bams}
            """
    

    这样写的问题是合并的时候每个基因组对应的是120个bam文件

    然后换成下面的写法

    rule samtools_merge:
        input:
            bams = expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES),
            bai = expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)
        output:
            "{genome}/{genome}/merged.bam"
        threads:
            16
        resources:
            mem_mb = 64000
        shell:
            """
            samtools merge -@ {threads} {output} {input.bams}
            """
    

    这个报错提示

    name 'wildcards' is not defined

    然后又换成下面的写法

    def getGenomes(wildcards):
      return wildcards.genome
    
    rule samtools_merge:
        input:
            bams = expand(rules.samtools_sort.output.bam,genome=getGenome,sample=SAMPLES),
            bai = expand(rules.samtools_index.output,genome=getGenome,sample=SAMPLES)
        output:
            "{genome}/{genome}/merged.bam"
        threads:
            16
        resources:
            mem_mb = 64000
        shell:
            """
            samtools merge -@ {threads} {output} {input.bams}
            """
    

    这个还是报错,报错内容忘记截图了,而且报错很诡异

    然后以关键词 搜索,找到了一个链接

    https://stackoverflow.com/questions/45508579/snakemake-wildcards-or-expand-command

    这里面提到

    'wildcards' is not "directly" defined in the input of a rule. You need to use a [function of 'wildcards'](http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#functions-as-input-files) instead. I'm not sure I understand exactly what you want to do, but you may try something like that.

    这里提供了一个写法

    def condition2tumorsamples(wildcards):
        return expand(
            "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
            sample=config['conditions'][wildcards.condition]['tumor'])
    
    def condition2normalsamples(wildcards):
        return expand(
            "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
            sample=config['conditions'][wildcards.condition]['normal'])
    
    rule gatk_RealignerTargetCreator:
        input:
            tumor = condition2tumorsamples,
            normal = condition2normalsamples,    
        output:
            "mapped_reads/merged_samples/{condition}.realign.intervals"
        # remainder of the rule here...
    

    安装这个思路我写我自己的

    def getGenome01(wildcards):
        return expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES)
    
    def getGenome02(wildcards):
        return expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)
    
    rule samtools_merge:
        input:
            bams = getGenome01,
            bai = getGenome02
        output:
            "{genome}/{genome}/merged.bam"
        threads:
            16
        resources:
            mem_mb = 64000
        shell:
            """
            samtools merge -@ {threads} {output} {input.bams}
            """
    

    这个是可以正常运行的

    推文记录的是自己的学习笔记,内容可能会存在错误,请大家批判着看,欢迎大家指出其中的错误

    欢迎大家关注我的公众号

    小明的数据分析笔记本

    小明的数据分析笔记本 公众号 主要分享:1、R语言和python做数据分析和数据可视化的简单小例子;2、园艺植物相关转录组学、基因组学、群体遗传学文献阅读笔记;3、生物信息学入门学习资料及自己的学习笔记!

    相关文章

      网友评论

          本文标题:snakemake杂记:多个转录组比对到多个基因组得到多个bam

          本文链接:https://www.haomeiwen.com/subject/fpkisdtx.html