一、背景知识

1、CHIP（Chromatin immunoprecipitation）

（1）ChIP-seq is an experiment that answers the question: is a protein bound to a piece of DNA or not? 简单来说就是DNA判断哪些片段能够结合到蛋白
（2）主要步骤：

break up the DNA into small pieces, around 100 base pairs in length (sonication)
wash the DNA with the target protein. The protein will bind to specific sequences of DNA. (enrichment)
use antibodies that bind specifically to the target protein to grab the DNA that has the protein attached to it
make more copies of the DNA (amplification) with the protein attached to it
sequence the DNA
https://noorsiddiqui.com/what-is-chip-seq-atac-seq/
chip

（3）当时的局限
Firstly ---- ends of the ChIP fragments.

当时的高通量测序还是处于发展初期，大多为单端测序（SE）；而且读长很短，只有25-50bp（注意文献里的tags即我们现在的reads）只能测到 the ends of the ChIP fragments.
由于上述原因无法直接获得蛋白结合位点的精准位置。
因此需要根据比对结果，适当偏移，合理预测蛋白位点。but a good tag to site distance estimate is often unknown to the user.

ends of the ChIP fragments

Secondly ---- sequencing and mapping biases
即测序比对的局部tags数因不同的测序过程，染色体环境（与蛋白结合无关）等而出现异常的，不符合预期的分布。例如：

sequencing and mapping biases；
chromatin structure ；
genome copy number variations.

虽然理论上可以设置control sample进行对比，去除噪音，可是作者通过查询文献发现结果并不是很好。

2、Poisson Distribution 泊松分布

Poisson Distribution是概率分布的一种， modeling the number of times an event occurs in an interval of time or space，公式如下：

Poisson Distribution

k---the number of occurrences.
λ---the expected rate of occurrences.
Pr---the probability of k occurrences given λ.

Poisson Distribution
一个卖水果罐头的实际应用例子，可以加深我们对于泊松分布的理解

二、MACS原理

1、Modeling the shift size of ChIP-Seq tags

思考
ChIP-DNA fragments are equally likely to be sequenced from both ends, the tag density around a true binding site should show a bimodal enrichment pattern.
即虽然只能测到比对片段的end，但根据随机分布在两端的bimodal enrichment pattern。
所以只要能识别分布两个end的富集，就可以推测到蛋白较为精确的结合位点。

a bimodal enrichment pattern

如下图Watson strand tags enriched upstream of binding and Crick strand tags enriched downstream.

a bimodal enrichment pattern

以下步骤是基于chip-seq tag map到reference genome的结果

首先Given a sonication size (bandwidth，100~300bp) and a high-confidence fold-enrichment (mfold)
bandwidth：一般长度就是蛋白结合位点的DNA片段大小
mfold：判断是否富集的阈值
然后MACS slides 2bandwidth windows across the genome to find regions with tags more than mfold enriched relative to a random tag genome distribution.

由于windows大小限制(2bandwidth)，因此一般一次也只能检测一个peak。

scanning all the genone之后，randomly samples 1,000 of these high-quality peaks, separates their positive and negative strand tags, and aligns them by the midpoint between their centers.
The distance between the modes of the two peaks in the alignment is defined as ‘d’ and represents the estimated fragment length.
MACS shifts all the tags by d/2 toward the 3’ ends to the most likely protein-DNA interaction sites.

关于Peak detection

（1）a uniform λBG

上面我们提到find regions with tags more than mfold enriched relative to a random tag genome distribution.
With the current genome coverage of most ChIP-Seq experients, tag distribution along the genome could be modeled by a Poisson distribution.
前面我们提到泊松分布公式，只取决于一个参数λ，在tag分布中，λ为平均测序深度；而λBG就是针对全部基因组的测序深度。

参数λ
find candidate peaks with a significant tag enrichment (Poisson distribution p-value based on λBG , default 10^-5 ).

（2）a dynamic λlocal

In the control samples, we often observe tag distributions with local fluctuations and biases.
即在对照组某一特定区域，tag确实存在波动与偏离，而不能称之为peak。
因此针对某一特定区域有必要分别讨论---a dynamic λlocal， defined for each candidate peak.
λlocal的取值λlocal = max(λBG, [λ1k], λ5k, λ10k).

当没有对照组时，不考虑λ1k

λlocal

In this way lambda captures the influence of local biases, and is robust against occasional low tag counts at small local regions.
MACS uses λlocal to calculate the p-value of each candidate peak and removes potential
false positives due to local biases (that is, peaks significantly under λ BG , but not under λ local ). 利用λlocal去除假阳性peak
the ratio between the ChIP-Seq tag count and λ local is reported as the fold_enrichment.

2、QC

Scaling libraries 文库归一化
For experiments with a control, MACS linearly scales the total control tag count to be the same as the total ChIP tag count.
去重 the same tag
Sometimes the same tag can be sequenced repeatedly, more times than expected from a random genome-wide tag distribution.
Such tags might arise from biases during ChIP-DNA amplification and sequencing library preparation, and are likely to add noise to the final peak calls.
MACS removes duplicate tags in excess of what is warranted by the sequencing depth.

3、计算FDR

For a ChIP-Seq experiment with controls, MACS empirically estimates the false discovery rate (FDR) for each detected peak.
MACS uses the same parameters to find ChIP peaks over control and control peaks over ChIP (that is, a sample swap).
The empirical FDR is defined as Number of control peaks / Number of ChIP peaks.

Q
1、how does the windows slides?
2、scanning all the genone之后，randomly samples 1,000 of these high-quality peaks, separates their positive and negative strand tags, and aligns them by the midpoint between their centers?
3、FDR，a sample swap?