[10] 6 Peak calling & &

作者: 热衷组培的二货潜 | 来源:发表于2019-07-06 19:52 被阅读10次

[10] 6 Peak calling & &
网页设计 - 收藏集 - 掘金
焦点领动【功能升级】2020年4月第一周：新增全站AMP功能
第十二章这火，你得灭
AMP是如何提升性能？
AE表达式名词解释
AE表达式--抖动
AMP vs MIP
【黑苹果】关于CodecCommander定制过程中耳机杂音修复
在实体引用中，实体名称必须紧跟在'&'后面

6.3.3 实验分辨率

Peak caller 通常将邻近的 Peak 合并到较大的区域中，这可能导致分辨率损失。因此，将两种类型的区域分离开来是很重要的：Broad 与 sharp 区域。一方面，来自组蛋白 ChIP-seq 的数据显示信号的广泛富集，并且需要大区域的识别（示例代码见 章节 6.5 ）。因此，组蛋白修饰类型的 Peak finder 应将重点放在确定区域边界上，而不是确定 Peak summits 。 这是一项非平凡的任务，仍然是一项挑战，目前的 Peak caller 仍然难以在样本之间找到一致的组蛋白结合区域边界。这也可能是由于这种类型的信号的性质。
另一方面，来自于转录因子类型的 Sharp 信号类型富集，需要鉴定 narrow 区域的分析方法（示例代码见 章节 6.4 ）。由于调节区由多个 TF 结合位点组成（ ref8、ref9）， ChIP-seq 数据的分辨率对于区分它们是至关重要的。在最近的方法改进中，如 ChIP-exo （ ref10 ）和 ChIP-nexus （ ref11 ），实验分辨率增加到几个碱基对。因此，应用于TF ChIP-seq 数据的 Peak caller 应适用于这些没有 control 的新方法，并通过聚焦于 Peak summits 而不是整个 Peak regions 来利用实验分辨率。要确定哪种方法对 TF 数据最有效，最好的指标是 Peak summits 之间的最小距离，而不是 Peak regions 的大小（ ref12 ）。

6.3.4 新一代的 Peak caller

最近开发的方法确实考虑了上面讨论的一些细节。MACS 的升级版 MACS2 ，既能鉴定 Broad Peaks 有你鉴定 Sharp Peaks （示例代码见章节 6.5 ）。Peakzilla（ ref12 ）是专门开发用于在高分辨率下从转录因子 ChIP-seq 数据中鉴定 Peak（示例代码见章节 6.4 ）。HOMER （ ref13 ）, 最初设计为在 Peak 区域重新识别motif 的工具（示例代码见 章节 9 ）。也有其他类型的工具，如 findPeaks，用于鉴定不同类型的 ChIP-seq 的 Peak 。JAMM（ ref14 ）使用重复的样本来提高 Peak 宽度的分辨率和精度。GEM（ ref15 ）通过包括关于TF motif 的信息作为附加输入来识别高分辨率的 Peak ，然而，我们更喜欢使用 Motif 信息来验证Peak，而不是在识别步骤。已经专门开发了其他方法来比较不同的样本，在 章节 8 将对此进行讨论。

6.3.5 Post-processing

可以Post-processing处理步骤来去除在 Peak calling 过程中不能过滤的人为造成的 Peak。这样的 Peak 出现在基因组 blacklisted 区域中，这些区域在任何 ChIP-seq 数据集（ ChIP 和 Input ）中显示非常高的 reads 丰度，通常位于着丝粒和端粒，并已被 ENCODE 定义为几个物种（见章节 2.2.2 ）。我们还选择去除位于线粒体染色体（ chrM ）上的 Peak。

6.4 `Peakzilla`：转录因子类型数据

Peakzilla 专为 Sharp 的 TF ChIP-seq 数据而设计，以便在高分辨率下 call peak，即解析由 TF homotypic 结合产生的紧密间隔的Peak summits。Peakzilla 可以使用于没有对照control 的 ChIP-exo数据。简而言之，它使用来自高度富集区域的正向和反向reads 的双重分布来估计片段大小。然后，它使用滑动窗口直接扫描沿基因组的 reads 的双重分布来对区域进行评分。为此，它首先计算 ChIP 样本中的 reads 数减去 control 样本中的 reads 数（ 通过文库中比对的 reads 总数进行归一化 ）。此方法允许比变化倍数更改更好的排序。然后，它用 p-value 对这个原始分数进行加权，p-value值检查数据与预期的正向和反向 reads 的双高斯分布的匹配程度。这允许在不去除重复 reads 的情况下用 PCR 人工产物过滤掉位置，以及更好地识别准确的 Peak summits 位置。最后，通过交换 ChIP 和 control 样本计算经验 FDR，计算每个区域的富集倍数，并通过多重检验进行校正。默认情况下，它返回最小得分为 1 的峰值和 2 的富集倍数。唯一需要的输入是两个 ChIP 和 control 文件。

# Create directory for peaks
mkdir peaks
sample=NRF1_CHIP_WT_1
control=NRF1_INPUT_WT

## Break down of the commands (many intermediate files) 

# Run Peakzilla
bamToBed -i reads/${sample}.bam > reads/${sample}.bed

bamToBed -i reads/${control}.bam > reads/${control}.bed

peakzilla.py reads/${sample}.bed reads/${control}.bed -l peaks/${sample}_peakzilla_report.txt >
peaks/${sample}_peakzilla.tsv

# Removal of peaks on chrM (density is usually abnormally high along the complete chrM)

# Reorganisation of output into BED-like format: chromosome / start (0-based) / end / summit / score / fold_enrichment

awk -v OFS="\t" ’(NR>1&&$1!="chrM"){print $1,$2-1,$3,$5,$6,$9}’ peaks/${sample}_peakzilla.tsv > peaks/${sample}_peaks_peakzilla_tmp.bed

# Removal of peaks in blacklisted regions
intersectBed -v -a peaks/${sample}_peaks_peakzilla_tmp.bed
-b mm10_blacklist.bed.gz > peaks/${sample}_peakzilla.bed

# Remove intermediate files
rm reads/${sample}.bed reads/${control}.bed
peaks/${sample}_peakzilla.tsv
peaks/${sample}_peaks_peakzilla_tmp.bed

## 一行命令 (˜6 min)
peakzilla.py < (bamToBed -i reads/${sample}.bam) <
(bamToBed -i reads/${control}.bam) -l
peaks/${sample}_peakzilla_report.txt | awk -v OFS="\t"
’(NR>1&&$1!="chrM"){print $1,$2-1,$3,$5,$6,$9}’ |
intersectBed -v -a stdin -b mm10_blacklist.bed.gz >
peaks/${sample}_peaks_peakzilla.bed

6.5 `MACS2`：组蛋白修饰类型数据

MACS 是最早用来鉴定 ChIP-seq 数据的 Peaks 软件之一，并且仍然是最受欢迎的一种。最初，它被开发用来识别 sharp 的 Peak，但定义的 Peak 相对较宽。最新版本 MACS2 现在两者都可以鉴定。简而言之，它首先删除所有重复的 reads 。然后，使用来自高度富集区域的正向和反向 reads 的双重分布来估计片段大小。它将所有reads 扩展到估计的片段大小，并将 ChIP 和 Input 样本扩展到相同的测序深度，即标准化为文库中比对上的 reads 的总数。然后，它使用滑动窗口扫描沿基因组的片段分布，对区域进行评分。为此，它将 ChIP 中的富集程度与对照样本进行比较，并使用局部泊松分布计算显著性得分。最后，它使用 Benjamini-Hochberg 过程对多重检验进行校正。默认情况下，它返回最小 q-value 得分为 0.05 的 Peak。输入参数包括 ChIP 和 Input 文件的路径、文件格式、基因组大小、输出目录和样本名称。--broad 是用来鉴定 broad peaks。
```
# Run MACS2 (˜7 min)
# q-value threshold can be adapted with the option  --broad-cutoff (default 0.1)

macs2 callpeak -t reads/${sample}.bam -c
reads/${control}.bam -f BAM -g mm --outdir peaks -n
${sample} --broad 2> peaks/${sample}_macs2_report.txt

# Reorganisation of output into bed-like format: chromosome
/ start (0-based) / end / summit / score / fold_enrichment

cat peaks/${sample}_peaks.xls | grep -v "#" | 
awk -v OFS="\t" '(NR>2&&$1!="chrM"){print $1,$2,$3,$4,$6,$7}' |
intersectBed -v -a stdin -b mm10_blacklist.bed.gz >
peaks/${sample}_peaks_macs.bed

# Removal of intermediate files
rm peaks/${sample}_peaks.xls peaks/${sample}_model.r
peaks/${sample}_peaks.broadPeak
peaks/${sample}_peaks.gappedPeak
```

6.6 饱和度分析

为了检查样品的测序深度是否足以识别大多数结合区域，建议进行饱和分析。它涉及对 reads 的数量进行随机二次取样，并计算使用这些子集识别的峰值数量。然后根据使用的 reads 数绘制峰值数量。如果峰的数量显示饱和并达到一个平台，那么样品的测序足够深。NRF1 ChIP-seq 样本在 2000万 reads 时显示饱和（图 6.2 ）。

图 6.2 NRF1 ChIP-seq 饱和度分析

sample=NRF1_CHIP_WT_1
control=NRF1_INPUT_WT

## Number of peaks found
wc -l peaks/${sample}_peaks_peakzilla.bed

## Saturation analysis
# Does the number of peaks saturate when using all available reads

NB=$(samtools idxstats reads/${sample}.bam | awk '{t=t+$3}END{print t}') 

L=$sample
for subset in ‘seq 1000000 1000000 $NB‘;
do
L=$L"\t"$(peakzilla.py < (bamToBed -i         reads/${sample}.bam | shuf -n $subset
--random-source=reads/${sample}.bam) < (bamToBed -i
reads/${control}.bam) -l /dev/null | wc -l)
done

echo -e $L > peaks/${sample}_peakzilla_saturation_table.txt

# Open R
R

# Load saturation table
d = read.table(paste("peaks/",sample,
"_peakzilla_saturation_table.txt",sep=""),row.names=1)

# Generate the saturation plot
pdf(paste("peaks/",sample,"_peakzilla_saturation_plot.pdf", sep=""))

par(bg="white")
plot(seq(1,ncol(d),1),
     d[1,],
     type="l",
     xlab="Number of reads in million",
     ylab="Number of peakscalled",
     main=c("Saturation analysis",
            rownames(d)[1],
            paste("Total",d[1,ncol(d)],"peaks")))
dev.off()

# Close R
q()

ref1：Practical guidelines for the comprehensive analysis of ChIP-seq data.

ref2：How does multiple testing correction work?

ref3：Evaluation of algorithm performance in ChIP-seq peak detection

ref4：Model-based analysis of ChIP- Seq (MACS).

ref5：Design and analysis of ChIP-seq experiments for DNA-binding proteins.

ref6：Systematic evaluation of factors influencing ChIP-seq fidelity

ref7：Genome-wide mapping of in vivo protein-DNA interactions.

ref8：Homotypic clusters of transcription factor binding sites are a key component of human
promoters and enhancers

ref9：Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome.

ref10：Comprehensive genomewide protein-DNA interactions detected at single-nucleotide resolution.

ref11：ChIP-nexus enables improved detection of in vivo transcription factor binding footprints

ref12：Identification of transcription factor binding sites from ChIP-seq data at high resolution

ref13：Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities.

ref14：JAMM: a peak finder for joint analysis of NGS replicates

ref15：High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints.

[10] 6 Peak calling & &
6.3.3 实验分辨率 Peak caller 通常将邻近的 Peak 合并到较大的区域中，这可能导致分辨率损失。...
网页设计 - 收藏集 - 掘金
免费且又精致的 HTML/CSS 站点模板 - 前端 - 掘金&amp;amp;amp;amp;l...
焦点领动【功能升级】2020年4月第一周：新增全站AMP功能
新增全站AMP功能【AMP简介】AMP：Accelerated Mobile Pages，意思是移动页面加速。A...
第十二章这火，你得灭
“Ich bin Schnappi das kleine Krokodil Komm aus &amp...
AMP是如何提升性能？
AMP简介 Google推出移动网页加速项目AMPAMP介绍 AMP是如何提升性能？以下的优化点是AMP页面能被...
AE表达式名词解释
函数: - wiggle(freq, amp, octaves = 1, amp_mult = .5, t = t...
AE表达式--抖动
表达式:wiggle(freq, amp, octaves = 1, amp_mult = .5, t = tim...
AMP vs MIP
谷歌AMP https://amp.dev/ 百度MIP https://www.mipengine.org/ 谷...
【黑苹果】关于CodecCommander定制过程中耳机杂音修复
查找类似上图中的节点（关键字：Stereo Amp-In Amp-Out 或 IN OUT HP EAPD De...
在实体引用中，实体名称必须紧跟在'&'后面
’&’是非法字符，用 & 替代