DMRcalling之comb-p原理2019-12-02

作者: 机器人会画画 | 来源:发表于2019-12-02 21:47 被阅读0次

DMRcalling之comb-p原理2019-12-02
2019-12-02
饭评|集锦
MySQL索引及查询优化书目录
NIO 之 ByteBuffer实现原理
IO、NIO、AIO 内部原理分析
NIO 之 Selector实现原理
将日期转化为对应的周
Android-读取已安装应用列表
NIO 之 Channel实现原理

1.(https://www.biostars.org/p/54994/)
https://github.com/brentp/combined-pvalues/
各种高通量技术可生成全基因组数据，用于研究DNA结合，甲基化状态和组蛋白修饰等过程。这些技术，包括平铺阵列和基于序列的分析，会生成通常在整个基因组中自相关的数据，从而难以进行推断。在可能对数百万个站点进行多次测试校正后，可能会削弱各个区域的重要性。在此类研究中，可以在每个位置执行假设检验以生成用于评估感兴趣效果的P值。为此，Kechris等。（2010年）开发了一种方法，用于组合滑动窗口中的P值并考虑整个基因组的空间相关性。在此，我们利用软件构建这种方法，该软件允许整个基因组中的数据结构不均匀，更通用的自相关计算以及对峰（即富集的基因组区域）的多次测试校正，并适用于多种不同的技术

All programs within comb-p expect files in simple BED format (Kent et al., 2002) sorted by chromosome and start. Additional columns contain the P-value(s) of interest based on the study design and generated from any software or statistical test.

Autocorrelation
Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version of itself over successive time intervals. It is the same as calculating the correlation between two different time series, except autocorrelation uses the same time series twice: once in its original form and once lagged one or more time periods.
自相关是数学上表示给定时间序列与其在连续时间间隔内自身的滞后版本之间的相似度的信息。它与计算两个不同时间序列之间的相关性相同，只是自相关两次使用相同的时间序列：一次以其原始形式出现，一次则滞后一个或多个时间段。
这个网站是这么说的
https://www.statisticssolutions.com/autocorrelation/

A common method of testing for autocorrelation is the Durbin-Watson test. Statistical software such as SPSS may include the option of running the Durbin-Watson test when conducting a regression analysis. The Durbin-Watson tests produces a test statistic that ranges from 0 to 4. Values close to 2 (the middle of the range) suggest less autocorrelation, and values closer to 0 or 4 indicate greater positive or negative autocorrelation respectively.
自相关测试的常用方法是Durbin-Watson测试。诸如SPSS的统计软件可能包括在进行回归分析时运行Durbin-Watson检验的选项。 Durbin-Watson检验产生的检验统计量范围为0到4。值接近2（范围的中间值）表明自相关性较小，值接近0或4则分别表明正相关性或负相关性更大。

Once the ACF has been calculated, it can be used to perform the Stouffer–Liptak–Kechris correction (slk) where each P-value is adjusted according to adjacent P-values as weighted according to the ACF. The resulting BED file has an additional column containing the corrected P-value. A given P-value will be pulled lower if its neighbors also have low P-values (and little auto-correlation) and likely remain insignificant if the neighboring P-values are also high.

如果给定的P值的邻居的P值也很低（并且自相关很小），则该P值将被拉低；如果相邻的P值也很高，则该P值可能会变得无关紧要。

A q-value score based on the Benjamini–Hochberg false discovery (FDR) correction or on a null model from shuffled data may then be calculated. The peak-finding algorithm can then be used to find enrichment regions or peaks on the FDR q-value, the slk-corrected P-value or on the original P-value.

region_p程序报告slk校正的P值和Šidák（1967）单步多重测试校正。对于给定区域，Šidák校正中可能进行的测试数量是所有输入探针覆盖的总碱基数除以给定区域的大小。
简而言之，我们使用FDR q值定义区域的范围，然后使用原始P值的SLK校正来定义区域的重要性。

The corrected P-value reported by comb-p can be used as a filter to extract regions of interest; we calculated the enrichment ratio of the number of observed to expected Ci target genes at various comb-p-corrected P-value cutoffs. For a cutoff of 0.1, the enrichment is 2.41, this enrichment increases to 3.46 and 5.29 for more stringent cutoffs of 1e−3 and 1e−4, respectively.

Bisulfite-sequencing (BS-Seq) is also used to measure methylation across the genome. As another example of the flexibility of our method, we demonstrate a possible analysis on data described in Hsieh et al. (2009) from Arabidopsis thaliana using MethylCoder (Pedersen et al., 2011) to map the bisulfite-treated reads to the genome. At each site, we use Fisher’s exact test to obtain P-values for the counts of converted and un-converted cytosines between endosperm and embryo. We find DMRs between these two tissues associated with genes enriched for gene ontologies related to the ribosome (P = 1e−3).