limma: topTable

作者: 浩瀚之宇 | 来源:发表于2018-11-29 16:16 被阅读0次

limma: topTable
Limma总结集锦
R语言初学笔记：差异表达基因
第2章序言
limma
R|FPKM、RPKM差异分析
limma差异分析
用R语言分析：RNAseq表达矩阵样本的差异性
limma快速差异分析工具！
GEO分析

adj.P.Val P-value after adjustment for multiple testing. This column is generally recommended as the primary statistic by which to interpret results. Genes with the smallest P-values will be the most reliable.
P.Value Raw P-value
t Moderated t-statistic (only available when two groups of Samples are defined)
B B-statistic or log-odds that the gene is differentially expressed (only available when two groups of Samples are defined)
logFC Log2-fold change between two experimental conditions (only available when two groups of Samples are defined)
F Moderated F-statistic combines the t-statistics for all the pair-wise comparisons into an overall test of significance for that gene (only available when more than two groups of Samples are defined)

Screenshot from 2018-11-29 16-13-08.png

x<-topTable(fit2, coef="SCC-HCS", number=10000, adjust.method="BH", sort.by="B", resort.by="M") #synonyms are ‘"M"’ for ‘"logFC"’

q值（即adj.P.Val值）

toptable
From limma v3.28.14 by Gordon Smyth

Table of Top Genes from Linear Model Fit

Extract a table of the top-ranked genes from a linear model fit.

Keywords
htest

Usage

topTable(fit, coef=NULL, number=10, genelist=fit $genes, adjust.method="BH", sort.by="B", resort.by=NULL, p.value=1, lfc=0, confint=FALSE) toptable(fit, coef=1, number=10, genelist=NULL, A=NULL, eb=NULL, adjust.method="BH", sort.by="B", resort.by=NULL, p.value=1, lfc=0, confint=FALSE, ...) topTableF(fit, number=10, genelist=fit$ genes, adjust.method="BH", sort.by="F", p.value=1, lfc=0)
topTreat(fit, coef=1, sort.by="p", resort.by=NULL, ...)

Arguments

fit
list containing a linear model fit produced by lmFit, lm.series, gls.series or mrlm. For topTable, fit should be an object of class MArrayLM as produced by lmFit and eBayes.
coef
column number or column name specifying which coefficient or contrast of the linear model is of interest. For topTable, can also be a vector of column subscripts, in which case the gene ranking is by F-statistic for that set of contrasts.
number
maximum number of genes to list
genelist
data frame or character vector containing gene information. For topTable only, this defaults to fit $genes. A matrix of A-values or vector of average A-values. For topTable only, this defaults to fit$ Amean.
eb
output list from ebayes(fit). If NULL, this will be automatically generated.
adjust.method
method used to adjust the p-values for multiple testing. Options, in increasing conservatism, include "none", "BH", "BY" and "holm". See p.adjust for the complete list of options. A NULL value will result in the default adjustment method, which is "BH".
sort.by
character string specifying statistic to rank genes by. Possible values for topTable and toptable are "logFC", "AveExpr", "t", "P", "p", "B" or "none". (Permitted synonyms are "M" for "logFC", "A" or "Amean" for "AveExpr", "T" for "t" and "p" for "P".) Possibilities for topTableF are "F" or "none". Possibilities for topTreat are as for topTable except for "B".
resort.by
character string specifying statistic to sort the selected genes by in the output data.frame. Possibilities are the same as for sort.by.
p.value
cutoff value for adjusted p-values. Only genes with lower p-values are listed.
lfc
minimum absolute log2-fold-change required. topTable and topTableF include only genes with (at least one) absolute log-fold-changes greater than lfc. topTreat does not remove genes but ranks genes by evidence that their log-fold-change exceeds lfc.
confint
logical, should confidence 95% intervals be output for logFC? Alternatively, can take a numeric value between zero and one specifying the confidence level required.
...
For toptable, other arguments are passed to ebayes (if eb=NULL). For topTreat, other arguments are passed to topTable.

Details

toptable is an earlier interface and is retained only for backward compatibility.

These functions summarize the linear model fit object produced by lmFit, lm.series, gls.series or mrlm by selecting the top-ranked genes for any given contrast. topTable and topTableF assume that the linear model fit has already been processed by eBayes. topTreat assumes that the fit has been processed by treat.

The p-values for the coefficient/contrast of interest are adjusted for multiple testing by a call to p.adjust. The "BH" method, which controls the expected false discovery rate (FDR) below the specified value, is the default adjustment method because it is the most likely to be appropriate for microarray studies. Note that the adjusted p-values from this method are bounds on the FDR rather than p-values in the usual sense. Because they relate to FDRs rather than rejection probabilities, they are sometimes called q-values. See help("p.adjust") for more information.

Note, if there is no good evidence for differential expression in the experiment, that it is quite possible for all the adjusted p-values to be large, even for all of them to be equal to one. It is quite possible for all the adjusted p-values to be equal to one if the smallest p-value is no smaller than 1/ngenes where ngenes is the number of genes with non-missing p-values.

The sort.by argument specifies the criterion used to select the top genes. The choices are: "logFC" to sort by the (absolute) coefficient representing the log-fold-change; "A" to sort by average expression level (over all arrays) in descending order; "T" or "t" for absolute t-statistic; "P" or "p" for p-values; or "B" for the lods or B-statistic.

Normally the genes appear in order of selection in the output table. If a different order is wanted, then the resort.by argument may be useful. For example, topTable(fit, sort.by="B", resort.by="logFC") selects the top genes according to log-odds of differential expression and then orders the selected genes by log-ratio in decreasing order. Or topTable(fit, sort.by="logFC", resort.by="logFC") would select the genes by absolute log-fold-change and then sort them from most positive to most negative.

topTableF ranks genes on the basis of moderated F-statistics for one or more coefficients. If topTable is called and coef has two or more elements, then the specified columns will be extracted from fit and topTableF called on the result. topTable with coef=NULL is the same as topTableF, unless the fitted model fit has only one column.

Toptable output for all probes in original (unsorted) order can be obtained by topTable(fit,sort="none",n=Inf). However write.fit or write may be preferable if the intention is to write the results to a file. A related method is as.data.frame(fit) which coerces an MArrayLM object to a data.frame.

By default number probes are listed. Alternatively, by specifying p.value and number=Inf, all genes with adjusted p-values below a specified value can be listed.

The argument lfc gives the ability to filter genes by log-fold change. This argument is not available for topTreat because treat already handles fold-change thresholding in a more sophisticated way.

Value

A dataframe with a row for the number top genes and the following columns:
genelist
one or more columns of probe annotation, if genelist was included as input
logFC
estimate of the log2-fold-change corresponding to the effect or contrast (for topTableF there may be several columns of log-fold-changes)
CI.L
left limit of confidence interval for logFC (if confint=TRUE or confint is numeric)
CI.R
right limit of confidence interval for logFC (if confint=TRUE or confint is numeric)
AveExpr
average log2-expression for the probe over all arrays and channels, same as Amean in the MarrayLM object
t
moderated t-statistic (omitted for topTableF)
F
moderated F-statistic (omitted for topTable unless more than one coef is specified)
P.Value
raw p-value
adj.P.Value
adjusted p-value or q-value
B
log-odds that the gene is differentially expressed (omitted for topTreat)

10.1 Summary Top-Tables
Limma provides functions topTable() and decideTests() which summarize the results of the linear model, perform hypothesis tests and adjust the p-values for multiple testing. Results include (log) fold changes, standard errors, t-statistics and p-values. The basic statistic used
for significance analysis is the moderated t-statistic, which is computed for each probe and
for each contrast. This has the same interpretation as an ordinary t-statistic except that the
standard errors have been moderated across genes, i.e., shrunk towards a common value, using
a simple Bayesian model. This has the effect of borrowing information from the ensemble of
genes to aid with inference about each individual gene [30]. Moderated t-statistics lead to
p-values in the same way that ordinary t-statistics do except that the degrees of freedom are
increased, reflecting the greater reliability associated with the smoothed standard errors. The
effectiveness of the moderated t approach has been demonstrated on test data sets for which
the differential expression status of each probe is known [11].
A number of summary statistics are presented by topTable() for the top genes and the
selected contrast. The M -value ( M ) is the value of the contrast. Usually this represents a log 2 -
fold change between two or more experimental conditions although sometimes it represents a
log 2 -expression level. The A-value ( A ) is the average log 2 -expression level for that gene across
all the arrays and channels in the experiment. Column t is the moderated t-statistic. Column
P.Value is the associated p-value and adj.P.Value is the p-value adjusted for multiple testing.
The most popular form of adjustment is "BH" which is Benjamini and Hochberg’s method to
control the false discovery rate [1]. The adjusted values are often called q-values if the intention
is to control or estimate the false discovery rate. The meaning of "BH" q-values is as follows.
If all genes with q-value below a threshold, say 0.05, are selected as differentially expressed,
then the expected proportion of false discoveries in the selected group is controlled to be less
than the threshold value, in this case 5%. This procedure is equivalent to the procedure of
Benjamini and Hochberg although the original paper did not formulate the method in terms
of adjusted p-values.
The B-statistic ( lods or B ) is the log-odds that the gene is differentially expressed [30,
Section 5]. Suppose for example that B = 1.5. The odds of differential expression is
52exp(1.5)=4.48, i.e, about four and a half to one. The probability that the gene is differ-
entially expressed is 4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this gene is
differentially expressed. A B-statistic of zero corresponds to a 50-50 chance that the gene
is differentially expressed. The B-statistic is automatically adjusted for multiple testing by
assuming that 1% of the genes, or some other percentage specified by the user in the call
to eBayes() , are expected to be differentially expressed. The p-values and B-statistics will
normally rank genes in the same order. In fact, if the data contains no missing values or
quality weights, then the order will be precisely the same.
As with all model-based methods, the p-values depend on normality and other mathemat-
ical assumptions which are never exactly true for microarray data. It has been argued that
the p-values are useful for ranking genes even in the presence of large deviations from the
assumptions [29, 27]. Benjamini and Hochberg’s control of the false discovery rate assumes
independence between genes, although Reiner et al [20] have argued that it works for many
forms of dependence as well. The B-statistic probabilities depend on the same assumptions
but require in addition a prior guess for the proportion of differentially expressed genes. The
p-values may be preferred to the B-statistics because they do not require this prior knowledge.
The eBayes() function computes one more useful statistic. The moderated F -statistic ( F )
combines the t-statistics for all the contrasts into an overall test of significance for that gene.
The F -statistic tests whether any of the contrasts are non-zero for that gene, i.e., whether
that gene is differentially expressed on any contrast. The denominator degrees of freedom is
the same as that of the moderated-t. Its p-value is stored as fit$F.p.value . It is similar to
the ordinary F -statistic from analysis of variance except that the denominator mean squares
are moderated across genes.
A frequently asked question relates to the occasional occurrence that all of the adjusted
p-values are equal to 1. This is not an error situation but rather an indication that there is
no evidence of differential expression in the data after adjusting for multiple testing. This
can occur even though many of the raw p-values may seem highly significant when taken as
individual values. This situation typically occurs when none of the raw p-values are less than
1/G, where G is the number of probes included in the fit. In that case the adjusted p-values
are typically equal to 1 using any of the adjustment methods except for adjust="none" .

limma: topTable
adj.P.Val P-value after adjustment for multiple testing...
Limma总结集锦
前言 limma的全称是：Linear Models for Microarray Data 需要阅读limma的...
R语言初学笔记：差异表达基因
setwd("E:/GSE25066")#环境设置 library(limma)#加载差异分析包limma #将分...
第2章序言
2.1 引用limma Limma执行了作者和合作伙伴的一系列方法论研究。在出版物中使用limma软件的结果时，请...
limma
limma的输入表达矩阵要求是经过log2处理之后的GEO中的Series Matrix File(s)通常是经过...
R|FPKM、RPKM差异分析
芯片数据差异分析，常规用limma进行差异分析，而对于RNA-seq数据，常用edgeR、DEseq2和limma...
limma差异分析
Q：该如何选择limma, DESeq2, edgeR A：各有各自应用的场景如果是芯片数据，一般选择limma...
用R语言分析：RNAseq表达矩阵样本的差异性
我们之前介绍了limma包，limma包是对基因芯片表达矩阵的分析，不能对逆转录RNAseq表达矩阵进行分析（因为...
limma快速差异分析工具！
01—研究背景今天给大家介绍的是用 R程序包limma[1]差异分析 limma包是2015年发表在Nuclei...
GEO分析
library(Biobase)library(GEOquery)library(limma)## load se...