最近在看差异分析当中原始read counts是如何被校正的,自然就不会放过差异分析的经典之一 —— edgeR
.
edgeR
使用的校正方法称为trimmed mean of M values (TMM),其前提假设为样本对照组和处理组间绝大多数基因表达不发生差异。
如何界定绝大多数基因这一点我个人还没有看到一个量化的指标,是50%还是80%才算绝大多数。
edgeR
的TMM校正方法其实在其分析流程中就是一句命令而已
y <- calcNormFactors(y)
但由于我十分好奇其背后计算的原理和计算过程,我便搜索一番,才发现无论是中文还是英文的帖子对于TMM的具体运算步骤及代码都没有很好的整理。因此,本文记录我通过阅读TMM提出的原文和edgeR
源码所了解到的TMM校正。
TMM校正示例
我们通过airway
数据展示实际TMM校正的过程
# TMM normalization
data("airway", package='airway')
set.seed(123)
counts1 <- as.data.frame(assay(airway)[sample(1:nrow(airway), 5000),1:4])
Step 1: 选择参考样本
TMM normalization首先需要选择一个参考样本,以它为基准进行校正。
默认下,参考样本的选择是通过比较每个样本的CPM (counts per million)的上四分位数与所有样本CPM的平均上四分位数之间的差值,找出差值最小的样本作为参考样本。
# library size of each samples
lib.size <- colSums(counts1)
# library size normalized read count for each gene in each sample
counts1.cpm <- t(t(counts1)/lib.size) * 1e6
# reference library for TMM normalization
f75 <- apply(counts1.cpm, 2, function(x) quantile(x, probs = 0.75))
# select the one with upper quartile closest to mean upper quartile
refColumn <- which.min(abs(f75 - mean(f75)))
> refColumn
SRR1039512
3
在这个例子中refColumn
是SRR1039512
在
edgeR::calcNormFactors()
中,我们也可以通过refColumn=
参数指定参考样本
Step 2: calculation of sample-reference pairwise M and A
接着,在参考样本和非参考样本间两两计算校正因子(normalization factors)。
我们首先需要计算参考样本和非参考样本间的Fold change (M)和平均表达量 (A)
注意这里使用的count都是校正过文库大小的
以下代码用到的变量名含义:
obs
: 非参考样本的原始count
ref
: 参考样本的原始count
libsize.obs
: 非参考样本的原始文库大小
libsize.ref
: 参考样本的原始文库大小
# 第一列是非参考样本
obs <- as.numeric(counts1[, 1])
ref <- as.numeric(counts1[, refColumn])
libsize.obs <- lib.size[1]
libsize.ref <- lib.size[refColumn]
分别计算M value和A value,为了对M value加权,我们还需要通过delta method估计渐近方差
# M value: log ratio of expression, accounting for library size
M <- log2((obs/libsize.obs)/(ref/libsize.ref))
# A value:absolute expression
A <- (log2(obs/libsize.obs) + log2(ref/libsize.ref))/2
# estimated asymptotic variance
v <- (libsize.obs - obs)/libsize.obs/obs + (libsize.ref - ref)/libsize.ref/ref
保留M和A值均为有限值的基因,并过滤极低表达量的基因
Acutoff = -1e10
# remove infinite values, cutoff based on A
fin <- is.finite(M) & is.finite(A) & (A > Acutoff)
M <- M[fin]
A <- A[fin]
v <- v[fin]
Step 3: trimmed mean of M values
接着,我们对M和A值进行双重截值,截掉M值排在前30%和后30%,A值排在前5%和后5%的基因,计算中间这部分基因M值的加权平均值
logratioTrim <- 0.3
sumTrim <- 0.05
# Double trim the upper and lower percentages of the data
# trim M values by 30% and A values by 5%
n <- length(M)
loM <- floor(n * logratioTrim) + 1
hiM <- n + 1 - loM
loA <- floor(n * sumTrim) + 1
hiA <- n + 1 - loA
keep <- (rank(M)>=loM & rank(M)<=hiM) & (rank(A)>=loA & rank(A)<=hiA)
# Weighted mean of M after trimming
f <- sum(M[keep]/v[keep], na.rm=TRUE) / sum(1/v[keep], na.rm=TRUE)
f <- 2^f
# Factors should multiple to one
f <- f/exp(mean(log(f)))
# Output
names(f) <- colnames(counts1)
上述步骤是计算第一个样本与参考样本的校正因子,下面我们通过一个循环计算所有样本的校正因子.
以下循环在选取refColumn
后开始
nsamples <- ncol(counts1)
logratioTrim <- 0.3
sumTrim <- 0.05
Acutoff = -1e10
f <- rep_len(NA_real_, nsamples)
for (i in 1:nsamples) {
obs <- as.numeric(counts1[, i])
ref <- as.numeric(counts1[, refColumn])
libsize.obs <- lib.size[i]
libsize.ref <- lib.size[refColumn]
# M value: log ratio of expression, accounting for library size
M <- log2((obs/libsize.obs)/(ref/libsize.ref))
# A value:absolute expression
A <- (log2(obs/libsize.obs) + log2(ref/libsize.ref))/2
# estimated asymptotic variance
v <- (libsize.obs - obs)/libsize.obs/obs + (libsize.ref - ref)/libsize.ref/ref
# remove infinite values, cutoff based on A
fin <- is.finite(M) & is.finite(A) & (A > Acutoff)
M <- M[fin]
A <- A[fin]
v <- v[fin]
# Double trim the upper and lower percentages of the data
# trim M values by 30% and A values by 5%
n <- length(M)
loM <- floor(n * logratioTrim) + 1
hiM <- n + 1 - loM
loA <- floor(n * sumTrim) + 1
hiA <- n + 1 - loA
keep <- (rank(M)>=loM & rank(M)<=hiM) & (rank(A)>=loA & rank(A)<=hiA)
# Weighted mean of M after trimming
f[i] <- sum(M[keep]/v[keep], na.rm=TRUE) / sum(1/v[keep], na.rm=TRUE)
f[i] <- 2^f[i]
}
# Factors should multiple to one
f <- f/exp(mean(log(f)))
# Output
names(f) <- colnames(counts1)
f
#SRR1039508 SRR1039509 SRR1039512 SRR1039513
# 1.0225375 1.0111616 0.9920515 0.9749133
我们与edgeR::calcNormFactors()
比较一下
library(edgeR)
y <- DGEList(counts = counts1)
y <- calcNormFactors(y)
nf <- y$samples$norm.factors
f == nf
#SRR1039508 SRR1039509 SRR1039512 SRR1039513
# TRUE TRUE TRUE TRUE
Step 4: TMM normalized counts
最后,我们获取TMM校正后的read counts。实际上,上述计算的校正因子是对文库大小的校正,edgeR
再利用校正后的文库大小对read counts进行校正。
# TMM normalized counts
counts1.tmm <- t(t(counts1) / (lib.size * f)) * 1e6
counts1.tmm <- round(counts1.tmm, 4)
head(counts1.tmm)
SRR1039508 | SRR1039509 | SRR1039512 | SRR1039513 | |
---|---|---|---|---|
ENSG00000260166 | 0.0000 | 0.0000 | 0.000 | 0.000 |
ENSG00000266931 | 0.0000 | 0.0000 | 0.000 | 0.000 |
ENSG00000104774 | 1375.7505 | 1510.2195 | 1431.252 | 1301.323 |
ENSG00000267583 | 0.0000 | 0.7924 | 0.000 | 0.000 |
ENSG00000227581 | 0.7095 | 0.0000 | 0.000 | 0.000 |
ENSG00000227317 | 0.0000 | 0.0000 | 0.000 | 0.000 |
# by edgeR
counts.tmm.edger <- round(cpm(y), 4)
identical(counts1.tmm, counts.tmm.edger)
# [1] TRUE
以上就是TMM校正的计算过程。
完。
Ref:
A scaling normalization method for differential expression analysis of RNA-seq data: https://doi.org/10.1186/gb-2010-11-3-r25
calcNormFactors.R: https://rdrr.io/bioc/edgeR/src/R/calcNormFactors.R
Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq: https://www.reneshbedre.com/blog/expression_units.html
edgeR提供的TMM归一化算法详解: https://cloud.tencent.com/developer/article/1625225
网友评论