nf-core 一个把各个组学分析流程化发了 NBT 的牛逼东西

作者: 热衷组培的二货潜 | 来源:发表于2020-03-04 14:55 被阅读0次

链接：
The nf-core framework for community-curated bioinformatics pipelines

包括哪些流程？

nascent ：Nascent Transcription Processing Pipeline
proteomicslfq：UNDER CONSTRUCTION - Proteomics label-free quantification (LFQ) analysis pipeline
RNA-seq：RNA sequencing analysis pipeline using STAR, HISAT2 and Salmon with gene counts and quality control
ampliseq: 16S rRNA amplicon sequencing analysis workflow using QIIME2
CAGE-seq: CAGE-sequencing analysis pipeline with trimming, alignment and counting of CAGE tags.
nanoseq: UNDER CONSTRUCTION: Nanopore demultiplexing, QC and alignment pipeline
sRNA-seq: A small-RNA sequencing analysis pipeline
sarek: Analysis pipeline to detect germline or somatic variants from WGS / targeted sequencing
bacass: Simple bacterial assembly and annotation pipeline
ChIP-seq: ChIP-seq peak-calling, QC and differential analysis pipeline.
ATAC-seq: ATAC-seq peak-calling, QC and differential analysis pipeline
MNase-seq: UNDER CONSTRUCTION: MNase-seq analysis pipeline using BWA and DANPOS2.
SmartSeq2: A pipeline for processing single cell RNA-seq data generated with the SmartSeq2 protocol.
scrnaseq: A single-cell RNAseq pipeline for 10X genomics data
RNAfusion: RNA-seq analysis pipeline for detection gene-fusions
BS-seq: Methylation (Bisulfite-Sequencing) analysis pipeline using Bismark or bwa-meth + MethylDackel
lncpipe: UNDER DEVELOPMENT--- Analysis of long non-coding RNAs from RNA-seq datasets
Hi-C: Analysis of Chromosome Conformation Capture data (Hi-C)
GUIDE-Seq: UNDER CONSTRUCTION - CRISPR-Cas off-target identification using GUIDE-Seq
slamseq: UNDER CONSTRUCTION: Slam-seq processing and analysis pipeline
hlatyping: Precision HLA typing from next-generation sequencing data
exoseq : Please consider using/contributing to https://github.com/nf-core/sarek
.....
以上仅列出我想列出的流程，还有很多。下面我拆分一个流程（BS-seq），看看主要步骤：

从上图可以看到主要分为两种方法。一种从 Bismark，一种用 bwa-meth。

分析参考手册：

https://github.com/nf-core/methylseq/blob/master/docs/usage.md

# 输入文件
--reads 'path/to/data/sample_*_{1,2}.fastq' # 双端
--single_end --reads '*.fastq' #单端



# 参考基因组
## 内置的
人类：--genome GRCh37 和 --genome GRCh38
小鼠：--genome GRCm38

## 自己提供参考基因组序列
### Single multifasta for genome
--fasta /path/to/genome.fa

### Bismark index directory
--bismark_index /path/to/ref/BismarkIndex/

### bwa-meth index filename base
### where for example the index files are called:
### /path/to/ref/genome.fa.bwameth.c2t.bwt
--bwa_meth_index /path/to/ref/genome.fa

### Genome Fasta index file
--fasta_index /path/to/genome.fa.fai 

## 保存建立的索引信息
--save_reference



# 接头过滤设置
Parameter   5' R1 Trim  5' R2 Trim  3' R1 Trim  3' R2 Trim
--pbat                 6        9       6       9
--single_cell         6         6   6   6
--epignome        8             8   8   8
--accel           10    15  10  10
--zymo          10  15  10  10
--cegx          6   6   2   2

--rrbs # 征对于 RRBS 建库
--pbat # PBAT 建立，互补链，与参数 --directional （bismark 默认的方式）相反
--non_directional # 输出四种可能的链，注意 --single_cell 和 --zymo 模式自动设置了此参数
--skip_trimming # 如果输入文件为 clean data 可以此参数跳过这一步

--comprehensive # 输出 CHG/CHH 信息

--relax_mismatches and --num_mismatches # Bismark 默认 --num_mismatches 为 0.2，--relax_mismatches 则为 0.6；
# 0.6 will allow a penalty of bp * -0.6 - for 100bp reads, this is -60. Mismatches cost -6, gap opening -5 and gap extension -2. So, -60 would allow 10 mismatches or ~ 8 x 1-2bp indels.

--unmapped # 保存为比对上的 reads
--save_trimmed # 保存过滤掉的 reads

--min_depth # MethylDackel 输出甲基化信息掉的最小覆盖度
--meth_cutoff # Bimark 中 bismark_methylation_extractor 提取甲基化信息的最小覆盖度

--ignore_flags：MethylDackel 处理 BAM 文件时候忽略 SAM 文件 flag 值
--methyl_kit：MethylDackel 输出结果可直接输入 R 包 methylKit 进行后续分析

--known_splices：Bismark 运行 only works with `--aligner bismark_hisat`

--slamseq：Specify to run Bismark with the `--slam` flag to run bismark in SLAM-seq mode (only works with `--aligner bismark_hisat`)

--local_alignment：Specify to run Bismark with the --local flag to allow soft-clipping of reads. This should only be used with care in certain single-cell applications or PBAT libraries, which may produce chimeric read pairs

--max_memory：最大内存，例如 --max_memory '8.GB'
--max_cpus：--max_cpus 1

主要分析步骤提取

参数设置地方：

parameters.settings.json

00、基因组建立 index

# 第一种：Bismark
## $aligner 有两种：'--hisat2' : '--bowtie2'，注意这两的 index 不能互用
aligner = params.aligner == 'bismark_hisat' ? '--hisat2' : '--bowtie2' # 比对软件
slam = params.slamseq ? '--slam' : '' # 测序类型

mkdir BismarkIndex
cp $fasta BismarkIndex/
bismark_genome_preparation $aligner $slam BismarkIndex

# 第二种：bwameth
bwameth.py index $fasta

插播：Bismark 比对参数
https://github.com/FelixKrueger/Bismark
Bismark alignment modes

STEP 1 - FastQC

fastqc --quiet --threads $task.cpus $reads

STEP 2 - Trim Galore!

说明书：https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md

5' Trimming can be accomplished with Trim Galore using:

--clip_r1 <NUMBER> (Read 1) or

--clip_r2 <NUMBER> (Read 2)

3' Trimming can be accomplished with Trim Galore using:

--three_prime_clip_r1 <NUMBER> (Read 1) or

--three_prime_clip_r2 <NUMBER> (Read 2).

c_r1 = params.clip_r1 > 0 ? "--clip_r1 ${params.clip_r1}" : ''
c_r2 = params.clip_r2 > 0 ? "--clip_r2 ${params.clip_r2}" : ''
tpc_r1 = params.three_prime_clip_r1 > 0 ? "--three_prime_clip_r1 ${params.three_prime_clip_r1}" : ''
tpc_r2 = params.three_prime_clip_r2 > 0 ? "--three_prime_clip_r2 ${params.three_prime_clip_r2}" : ''
rrbs = params.rrbs ? "--rrbs" : '' # 有几种模式：`--rrbs`/`--non_directional`
# $reads 输入的 fq 文件
# 单端
trim_galore --fastqc --gzip $rrbs $c_r1 $tpc_r1 $reads

# 双端
trim_galore --paired --fastqc --gzip $rrbs $c_r1 $c_r2 $tpc_r1 $tpc_r2 $reads

使用 Bismark 流程

STEP 3.1 - align with Bismark

aligner = params.aligner == "bismark_hisat" ? "--hisat2" : "--bowtie2"
pbat = params.pbat ? "--pbat" : ''
non_directional = params.single_cell || params.zymo || params.non_directional ? "--non_directional" : ''
unmapped = params.unmapped ? "--unmapped" : ''
mismatches = params.relax_mismatches ? "--score_min L,0,-${params.num_mismatches}" : ''
soft_clipping = params.local_alignment ? "--local" : ''
splicesites = params.aligner == "bismark_hisat" && knownsplices.name != 'null' ? "--known-splicesite-infile <(hisat2_extract_splice_sites.py ${knownsplices})" : ''
multicore = "--multicore $ccore" # 这一步有根据的你的服务器计算能力计算，详情看原文， $ccore 表示线程数

bismark $input \\
            $aligner \\
            --bam $pbat $non_directional $unmapped $mismatches $multicore \\
            --genome $index \\
            $reads \\
            $soft_clipping \\
            $splicesites

STEP 4 - Bismark deduplicate

fq_type = params.single_end ? '-s' : '-p' # 单端 `-s`; 双端 `-p`

deduplicate_bismark $fq_type --bam $bam

STEP 5 - Bismark methylation extraction

插播：https://github.com/FelixKrueger/Bismark/tree/master/Docs

comprehensive = params.comprehensive ? '--comprehensive --merge_non_CpG' : ''
meth_cutoff = params.meth_cutoff ? "--cutoff ${params.meth_cutoff}" : ''
# 测试 CPU
ccore = ((task.cpus as int) / 10) as int
multicore = "--multicore $ccore"
# 测试内存
mbuffer = (task.memory as nextflow.util.MemoryUnit) - 2.GB  
buffer = "--buffer_size ${mbuffer.toGiga()}G"   # if( mbuffer.compareTo(4.GB) == 1 )

## 如果是植物请加上 --CX 不然不会输出 CHG\CHH 甲基化信息
# 单端
bismark_methylation_extractor $comprehensive $meth_cutoff \\
                $multicore $buffer \\
                --bedGraph \\
                --counts \\
                --gzip \\
                -s \\
                --report \\
                $bam


# 双端
bismark_methylation_extractor $comprehensive $meth_cutoff \\
                $multicore $buffer \\
                --ignore_r2 2 \\
                --ignore_3prime_r2 2 \\
                --bedGraph \\
                --counts \\
                --gzip \\
                -p \\
                --no_overlap \\
                --report \\
                $bam

STEP 6 - Bismark Sample Report

# set val(name), file(align_log), file(dedup_log), file(splitting_report), file(mbias) from ch_bismark_logs_for_bismark_report

bismark2report \\
            --alignment_report $align_log \\
            --dedup_report $dedup_log \\
            --splitting_report $splitting_report \\
            --mbias_report $mbias

STEP 7 - Bismark Summary Report

bismark2summary # 直接运行此就行

使用 bwa-meth 流程

STEP 3.1 - align with bwa-meth

fasta = bwa_meth_indices[0].toString() - '.bwameth' - '.c2t' - '.amb' - '.ann' - '.bwt' - '.pac' - '.sa'
prefix = reads[0].toString() - ~/(_R1)?(_trimmed)?(_val_1)?(\.fq)?(\.fastq)?(\.gz)?$/

bwameth.py \\
            --threads ${task.cpus} \\
            --reference $fasta \\
            $reads | samtools view -bS - > ${prefix}.bam

STEP 4.- samtools flagstat on samples

samtools sort $bam \\
            -@ ${task.cpus} $sort_mem \\
            -o ${bam.baseName}.sorted.bam

samtools index ${bam.baseName}.sorted.bam
samtools flagstat ${bam.baseName}.sorted.bam > ${bam.baseName}_flagstat_report.txt
samtools stats ${bam.baseName}.sorted.bam > ${bam.baseName}_stats_report.txt

STEP 5 - Mark duplicates with picard

picard -Xmx${avail_mem}g MarkDuplicates \\
                INPUT=$bam \\
                OUTPUT=${bam.baseName}.markDups.bam \\
                METRICS_FILE=${bam.baseName}.markDups_metrics.txt \\
                REMOVE_DUPLICATES=false \\
                ASSUME_SORTED=true \\
                PROGRAM_RECORD_ID='null' \\
                VALIDATION_STRINGENCY=LENIENT

samtools index ${bam.baseName}.markDups.bam

STEP 6 - extract methylation with MethylDackel

all_contexts = params.comprehensive ? '--CHG --CHH' : ''
min_depth = params.min_depth > 0 ? "--minDepth ${params.min_depth}" : ''
ignore_flags = params.ignore_flags ? "--ignoreFlags" : ''
methyl_kit = params.methyl_kit ? "--methylKit" : ''

MethylDackel extract $all_contexts $ignore_flags $methyl_kit $min_depth $fasta $bam
MethylDackel mbias $all_contexts $ignore_flags $fasta $bam ${bam.baseName} --txt > ${bam.baseName}_methyldackel.txt

STEP 8 - Qualimap

samtools sort $bam \\
        -@ ${task.cpus} $sort_mem \\
        -o ${bam.baseName}.sorted.bam

qualimap bamqc $gcref \\
        -bam ${bam.baseName}.sorted.bam \\
        -outdir ${bam.baseName}_qualimap \\
        --collect-overlap-pairs \\
        --java-mem-size=${task.memory.toGiga()}G \\
        -nt ${task.cpus}

STEP 9 - preseq

samtools sort $bam \\
        -@ ${task.cpus} $sort_mem \\
        -o ${bam.baseName}.sorted.bam

preseq lc_extrap -v -B ${bam.baseName}.sorted.bam -o ${bam.baseName}.ccurve.txt

STEP 10 - MultiQC

rtitle = custom_runName ? "--title \"$custom_runName\"" : ''
rfilename = custom_runName ? "--filename " + custom_runName.replaceAll('\\W','_').replaceAll('_+','_') + "_multiqc_report" : ''

multiqc -f $rtitle $rfilename --config $multiqc_config . \\
        -m custom_content -m picard -m qualimap -m bismark -m samtools -m preseq -m cutadapt -m fastqc

nf-core 一个把各个组学分析流程化发了 NBT 的牛逼东西

包括哪些流程？

分析参考手册：

主要分析步骤提取

参数设置地方：

00、基因组建立 index

STEP 1 - FastQC

STEP 2 - Trim Galore!

使用 Bismark 流程

STEP 3.1 - align with Bismark

STEP 4 - Bismark deduplicate

STEP 5 - Bismark methylation extraction

STEP 6 - Bismark Sample Report

STEP 7 - Bismark Summary Report

使用 bwa-meth 流程

STEP 3.1 - align with bwa-meth

STEP 4.- samtools flagstat on samples

STEP 5 - Mark duplicates with picard

STEP 6 - extract methylation with MethylDackel

STEP 8 - Qualimap

STEP 9 - preseq

STEP 10 - MultiQC

STEP 11 - Output Description HTML

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

BS-seq

生信分析工具包

生信

NGS

微生物信息学

my RNA-seq