1. 概述

高通量测序原则上是测序深度越高越好，但是受制于测序费用，我们就需要在地深度的测序数据中开辟一片最大利用率的分析策略。

分析流程我们主要使用 Snakemake 搭建。

2. Snakemake 简介

Snakemake 流程框架是基于 Python 搭建的，其优秀的通配能力使得我们可以很轻松的运行重复性的工作。
基本框架如下：

Snakemake 结构

Snakemake：主要流程文件
config：配置程序运行环境及所使用的部分文件
sample：配置样本信息
Script：放置流程需要的部分脚本
Result：结果目录

准备工作 Snakemake 代码：

import yaml
##========= Globals ==========
configfile: 'config.yaml'
## Set samples
FILES = yaml.load(open(config['SampleYaml']))
SAMPLES = sorted(FILES.keys())
## set output Dir 
WORKDIR = config["WorkDir"]
## Set reference file
DNA = config["dna"]
GTF = config["gtf"]
VCF = config["vcf"]
BWA_INDEX = config["bwaIndex"]

## ======== Rules ============
rule all:
    input:
        expand( WORKDIR + "Step3.Picard/{sample}.rmdup.bam", sample=SAMPLES ),
        expand( WORKDIR + "Step3.Picard/{sample}.rmdup.mtx", sample=SAMPLES )

## ======== Step 0 Prepare Rename the fastq file ======== 
rule renameFastq:
    input:
        lambda wildcards:FILES[wildcards.sample]
    output:
       R1 = WORKDIR + "Step0.Prepare/{sample}.R1.fq.gz",
       R2 = WORKDIR + "Step0.Prepare/{sample}.R2.fq.gz"
    run:
        shell("ln -sf {input[0]} {output.R1}")
        shell("ln -sf {input[1]} {output.R2}")

3. 数据分析

3.1 低质量数据过滤

从测序公司拿到的原始数据由于含有低质量数据、接头序列以及 PCR Duplicate 等对后续分析有影响的数据，所以需要首先进行过滤。之前的文章有提到，我们使用海普洛斯出品的 fastp 软件进行过滤。过滤后的 Clean Reads 进行下列分析。

DNA 数据与 RNA 数据我们在比对前都使用了相同的过滤标准：

1. 设置分数值 < 20 的碱基为低质量碱基
2. 低质量碱基最大允许比例为 5 %
3. 测序 Reads 中最多允许 5 个 N 碱基
4. 剪切掉接头序列后最短不少于 50 bp
5. 对低质量碱基进行矫正

使用 fastp 过滤的 Snakemake 代码：

##======== Step 1 raw fastq filter ========
rule fastqFilter:
    input:
        R1 = WORKDIR + "Step0.Prepare/{sample}.R1.fq.gz",
        R2 = WORKDIR + "Step0.Prepare/{sample}.R2.fq.gz"
    output:
        R1 = WORKDIR + "Step1.fastqFilter/{sample}/{sample}.R1.fq.gz",
        R2 = WORKDIR + "Step1.fastqFilter/{sample}/{sample}.R2.fq.gz",
        json = WORKDIR + "Step1.fastqFilter/{sample}/{sample}.json",
        html = WORKDIR + "Step1.fastqFilter/{sample}/{sample}.html"
    log:
        WORKDIR + "logs/Step1.fastqFilter/{sample}.fastqFilter.log"
    benchmark:
        WORKDIR + "benchmark/Step1.fastqFilter/{sample}.benchmark"
    params:
        "--detect_adapter_for_pe --qualified_quality_phred 20 --n_base_limit 5"
        "--unqualified_percent_limit 5 --length_required 50 --correction"
    threads:
        8
    shell:
        "fastp {params} -w {threads} -i {input.R1} -I {input.R2} -o {output.R1} "
        "-O {output.R2} -j {output.json} -h {output.html} 2> {log}")

过滤后获得 Clean Reads 用于后续分析。

具体见第二部分。