蛋白质组学数据搜库及FDR的控制

作者: MJades | 来源:发表于2020-03-07 14:14 被阅读0次

蛋白质组学数据搜库及FDR的控制
蛋白质组学数据分析基础（三）
【宏蛋白质组】搜库工具简介与评估
蛋白乱七八糟
【蛋白质基因组】Proteogenomics方法介绍及分析思路
生信技能树蛋白组文章汇总
宏蛋白质组分析流程2020 | MetaProteomeAnal
蛋白质组学数据库
生物信息学相关数据库
蛋白质组DIA深度学习之谱图预测

Part 1. 蛋白质组学中，谱图搜库是如何实现的？

蛋白质组学中，各种软件对质谱得到的谱图进行搜库时通常是利用以下三种方法之一进行：

根据相关的质量信息获得部分或完整的肽序列（first implemented by PeptideSearch and graph theory based de novo methods）；

2.实验和计算得到的谱图的自相关性（最先应用于SEQUEST）；

3.计算观测到的理论碎片质量和实际碎片质量之间匹配上的数目来自于偶然的概率（Mascot中率先使用）。

针对Andromeda肽段搜索引擎做些介绍：
嵌入到MaxQuant中Andromeda肽段搜索引擎就是基于二项式分布概率对肽段-谱图进行打分的，同时利用该得分进行后续的分析，如：对肽段进行排序、确定肽段修饰的可能性；standalone Andromeda可以处理少量的谱图，每张谱图经处理后都可以得到对应的有得分的肽段列表和蛋白列表，没有严格的FDR的控制。
Andromeda的优势展现在：1.确定同一肽段的多种修饰；2.解析混合谱图。

Schematic of the peptide scoring algorithm

transfer to charge=1

Part 2. FDR的控制

当进行多重假设检验时，单次假设显著性测量值不足以评估整体性错误率；

when multiple independent statistical hypothesis tests are conducted, single hypothesis significance measures (like p-value) are neither sufficient nor amenable to extrapolation to calculate population error rate. This is a classic case of what is called as the multiple testing problem.

进行多重假设检验时，有不同的方法可以对显著性阈值进行校正；Benjamini-Hochberg法
False discovery rate (FDR) is a measure of the incorrect PSMs among all accepted PSMs (the rate of false positives in accepted hits).
Note: FDR 是对population error rate进行评估参数，它不能表示单张谱图的可信度。在FDR进行校正后，q-value是表示单张谱图可信度的参数

In the context of proteomics, it is a global estimate of the false positives present in the results obtained by a database search algorithm. There are many different strategies to estimate FDR like the nonparametric simple target-decoy (TD) database searches and parametric or semi-parametric mix- ture modeling approaches used in the Trans-proteomics pipeline (TPP).

The q-value of a PSM provides a direct measure of significance for a particular PSM with respect to the complete dataset and the risk accrued to the total accepted matches if that hit is deemed significant.

蛋白质组学是利用decoy database search对FDR进行评估，decoy database是将target database进行混排、随机或简单的反向排列得到的数据库。

The basic assumption made for target-decoy (TD) approach is that the number of false PSMs in decoy search will be equal to the number of false PSMs in target search above a given threshold score

TD database搜索的方式有两种，一种是一起搜库，另外一种是分开搜库。

Target-decoy database search

The number of false positives divided by the total hits allows for easy calculation of FDR.

PEP表示的是a PSM是错误的概率，也可以称作local FDR，但它表示的是单张PSM错误的概率。

Posterior error probability (PEP) is the probability of a PSM to be incorrect.

PEP和q-value、FDR的区别

While the q-value conveys the risk (error introduced) in the whole dataset if we accept the PSM at hand, the PEP on the other hand informs us whether the PSM is likely to be correct or not.

FDR can be calculated from PEP by integrating (summing up) all the PEPs. PEPs can be accurately calculated by using machine learning to learn the model parameters from labeled (correct and incorrect) training data. For any given score x, the PEP can be predicted from the model parameters. This strategy is used in PeptideProphet and ProteinProphet.

FDR Calculation Using ProteoStats. ProteoStata是用Perl语言写成的程序。在计算蛋白质组学数据时，ProteoStata需要配置在电脑中。

ProteoStats requires the data to be searched using separate TD approach as it can perform the TD competition after the search as suggested by Fitzgibbon et al..

TD searches are completed separately and results in the form of target and decoy top hits provided as input to ProteoStats. When the searches are conducted separately, all different FDR methods can be applied a posteriori, but if a concatenated search is used, only concatenated FDR method can be applied as the correspondence between TD top hits is lost. ProteoStats removes the pep- tides identical in decoy and target considering isoleucine and leucine as identical. The resulting TD sets are sorted separately on the basis of scores/e-values/p-values from best to worst and depending on the search strategy chosen the FDR, q-value, and receiver operating curve (ROC) are calculated.

FDR的计算方法

FDR的计算方法
FDR计算方法
计算过程（1）
计算过程（2）

peptide and protein FDR

The FDR for protein estimation is calculated as the ratio of the expected number of false-positive protein identifications (those that have a hit to the decoy database proteins) to that of the total number of protein identifications mapping to the target database at any threshold protein score. For protein FDR, MAYU software can be used which performs protein identification-level FDR on the basis of peptide identifications.

结合Proteome Discoverer 2.2中应用的算法，对一些细节进行解释。

2.2中，默认PSM的FDR计算是将target和decoy database分开计算的；
当要搜索的spectra或者要搜的蛋白数目较少时，FDR不起作用，因为匹配到database的数目会很少，很难给出有意义的统计值；
2.2中默认的decoy database是将protein sequence直接反转过来，但是注意以下两种情况不适合用这种decoy database：
a. peptide mass fingerprinting；
b. no-enzyme MS/MS searches, 尤其是dynamic modification;

PD 2.2中关于decoy database的说明
在PD2.2中 set up FDRs有两种： Percolator node and the Target Decoy PSM Validator node.

Percolator is a superior validation algorithm that uses a machine learning approach, but it requires a sufficient number of target and decoy matches that are not always available. In these cases, you can use the Target Decoy PSM Validator node. This node triggers a target and decoy search and calculates score thresholds to achieve the specified target false discovery rate (FDR). The derived score thresholds for the strict and relaxed FDR separate the identified PSMs into high-, medium-, and low-confidence identifications.

Percolator的限制

可以利用Maximum delta Cn减少PSM数目，从而影响PSM的FDR. 2.2中默认值是0.05. 在一般情况下，Top 1的score会很明显的大于其他被选择的PSM，但是当存在动态修饰时，匹配比较好的PSM的score会很接近；所以，在研究磷酸化时，应该适当的放大maximum delta Cn的值。
delta Cn
此外，还可以通过设置Maximum Rank parameter，Maximum Delta Mass parameter，Score and Threshold parameters对PSM进行筛选。