snakemake杂记：多个转录组比对到多个基因组得到多个bam

作者: 小明的数据分析笔记本 | 来源:发表于2023-05-17 10:32 被阅读0次

将Marker与多个物种blast结果输出位CSV文件的pyth
sam文件转换为bam文件——SAMtools
IGV基因组浏览器打开BAM文件查看reads比对情况
生物数据格式 - wig
不同版本坐标转换工具CrossMap
4、RNAseq(4)--Hisat2进行序列比对及Samtoo
使用Mikado进行基因结构注释
如何从BAM文件中提取fastq
sam转bam文件报错
IGV的使用

我的需求是：

我有10个基因组，然后又12个转录组数据，然后将这个12个基因组数据分别比对到这个10个基因组，每个基因组得到12个bam文件，然后将每个基因组的12个bam文件合并，最终得到10个合并的bam文件

前面比对的步骤没有遇到问题

SAMPLES, = glob_wildcards("../../rnaseq/01.clean.reads/{sample}_clean_1.fq")
GENOMES, = glob_wildcards("{genome}.fa.masked")

GENOMES = [genome.split("/")[0] for genome in GENOMES if len(genome.split("/")) == 3]
GENOMES.remove("D1")

print(GENOMES)
print(SAMPLES)

print("Total genome size: ",len(GENOMES))
print("Total sample size: ",len(SAMPLES))




rule all:
    input:
        expand("{genome}/{genome}/ref_index.1.ht2",genome=GENOMES),
        expand("{genome}/{genome}/{sample}.sam",genome=GENOMES,sample=SAMPLES),
        expand("{genome}/{genome}/{sample}.sorted.bam.bai",genome=GENOMES,sample=SAMPLES),
        expand("{genome}/{genome}/merged.bam",genome=GENOMES),
        expand("{genome}/{genome}/merged.gtf",genome=GENOMES)


rule hisat2_build:
    input:
        ref = "{genome}/{genome}/{genome}.fa.masked"
    output:
        "{genome}/{genome}/ref_index.1.ht2"
    threads:
        8
    resources:
        mem_mb = 48000
    params:
        "{genome}/{genome}/ref_index"
    shell:
        """
        hisat2-build {input.ref} {params} -p {threads}
        """

rule hisat2_align:
    input:
        index = rules.hisat2_build.output,
        r1 = "../../rnaseq/01.clean.reads/{sample}_clean_1.fq",
        r2 = "../../rnaseq/01.clean.reads/{sample}_clean_2.fq"
    output:
        sam = "{genome}/{genome}/{sample}.sam"
    threads:
        8
    resources:
        mem_mb = 64000
    params:
        ref_index = rules.hisat2_build.params
    shell:
        """
        hisat2 -p {threads} --dta -x {params.ref_index} \
        -1 {input.r1} -2 {input.r2} -S {output.sam}
        """

rule samtools_sort:
    input:
        sam = rules.hisat2_align.output.sam
    output:
        bam = "{genome}/{genome}/{sample}.sorted.bam"
    threads:
        8
    resources:
        mem_mb = 64000
    shell:
        """
        samtools sort -@ {threads} -O bam -o {output.bam} {input.sam}
        """

rule samtools_index:
    input:
        rules.samtools_sort.output.bam
    output:
        "{genome}/{genome}/{sample}.sorted.bam.bai"
    threads:
        8
    resources:
        mem_mb = 24000
    shell:
        """
        samtools index {input}
        """

到合并的步骤最开始的写法是

rule samtools_merge:
    input:
        bams = expand(rules.samtools_sort.output.bam,genome=GENOMES,sample=SAMPLES),
        bai = expand(rules.samtools_index.output,genome=GENOMES,sample=SAMPLES)
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这样写的问题是合并的时候每个基因组对应的是120个bam文件

然后换成下面的写法

rule samtools_merge:
    input:
        bams = expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES),
        bai = expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这个报错提示

name 'wildcards' is not defined

然后又换成下面的写法

def getGenomes(wildcards):
  return wildcards.genome

rule samtools_merge:
    input:
        bams = expand(rules.samtools_sort.output.bam,genome=getGenome,sample=SAMPLES),
        bai = expand(rules.samtools_index.output,genome=getGenome,sample=SAMPLES)
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这个还是报错，报错内容忘记截图了，而且报错很诡异

然后以关键词搜索，找到了一个链接

https://stackoverflow.com/questions/45508579/snakemake-wildcards-or-expand-command

这里面提到

'wildcards' is not "directly" defined in the input of a rule. You need to use a [function of 'wildcards'](http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#functions-as-input-files) instead. I'm not sure I understand exactly what you want to do, but you may try something like that.

这里提供了一个写法

def condition2tumorsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['tumor'])

def condition2normalsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['normal'])

rule gatk_RealignerTargetCreator:
    input:
        tumor = condition2tumorsamples,
        normal = condition2normalsamples,    
    output:
        "mapped_reads/merged_samples/{condition}.realign.intervals"
    # remainder of the rule here...

安装这个思路我写我自己的

def getGenome01(wildcards):
    return expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES)

def getGenome02(wildcards):
    return expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)

rule samtools_merge:
    input:
        bams = getGenome01,
        bai = getGenome02
    output:
        "{genome}/{genome}/merged.bam"
    threads:
        16
    resources:
        mem_mb = 64000
    shell:
        """
        samtools merge -@ {threads} {output} {input.bams}
        """

这个是可以正常运行的