从表达量矩阵求各处理组的中位数

作者: 城管大队哈队长 | 来源:发表于2020-01-20 20:44 被阅读0次

从表达量矩阵求各处理组的中位数
从表达量矩阵求各处理组的平均值（勘误版）
32统计基础- 分位数和百分位数
常用函数
R语言中级作业
线代--线性系统求解的应用---求解一个矩阵的逆
GEOquery
脚本 | R | 每个基因在不同处理下表达量的相关系数
特征值，特征向量，标准正交向量组与numpy
「转录组」WGCNA实战原理两不误

在上文中，我提到了用表达量矩阵求各处理组的平均数，但后来发现我的代码其实还可以再精简下。

# 跟原来一样的测试数据
set.seed(19960203)
library(magrittr) # 只是为了单纯调用%>%这个管道操作而已

expr_df <- data.frame(Control_R1 = rnorm(5,mean = 10),
                      Control_R2 = rnorm(5,mean = 10),
                      Control_R3 = rnorm(5,mean = 10),
                      Treat_R1 = rnorm(5,mean = 20),
                      Treat_R2 = rnorm(5,mean = 20),
                      Other_R1 = rnorm(5,mean = 30),
                      Other_R2 = rnorm(5,mean = 30),
                      Other_R3 = rnorm(5,mean = 30),
                      Other_R4 = rnorm(5,mean = 30))
rownames(expr_df) <- paste0("Gene",1:5)
expr_df

> expr_df
      Control_R1 Control_R2 Control_R3 Treat_R1 Treat_R2 Other_R1 Other_R2 Other_R3 Other_R4
Gene1  10.764324   9.428356  10.044180 20.54518 19.12997 29.55600 29.70370 31.64527 29.70927
Gene2   9.342808   9.345432  10.048242 21.24643 19.99557 29.66273 31.06946 28.56245 30.76298
Gene3   9.642546   8.777341  10.029212 18.77438 17.73295 29.96400 30.16638 29.81993 31.63248
Gene4  10.743541   8.764679   9.297456 20.41032 20.62758 29.48314 30.68438 31.92357 28.57588
Gene5  10.199939   9.814605   8.767410 20.91134 17.49593 28.59100 31.29823 30.26129 29.95323

我之前使用rowSums这个函数来求各处理的总和，但后来想到有rowSums应该也有rowMeans，结果base里面还真有。所以其实我们用 rowMeans 就可以直接求平均值了……

# 之前的求平均结果
tissue <- colnames(expr_df) %>% gsub("_R[0-9]","",.)
tissue_type <- unique(tissue)
sapply(tissue_type, function(x){
  rowSums(expr_df[,x == tissue])
}) -> expr_df_merge

> t(t(expr_df_merge) / tissue_freq)
        Control     Treat    Other
Gene1 10.078953  9.918787 60.30712
Gene2  9.578828 10.310502 60.02881
Gene3  9.483033  9.126831 60.79140
Gene4  9.601892 10.259474 60.33349
Gene5  9.593985  9.601819 60.05188

# 用rowMeans求平均值
> sapply(tissue_type, function(x){
+   rowMeans(expr_df[,x == tissue])
+ })
        Control    Treat    Other
Gene1 10.078953 19.83757 30.15356
Gene2  9.578828 20.62100 30.01440
Gene3  9.483033 18.25366 30.39570
Gene4  9.601892 20.51895 30.16674
Gene5  9.593985 19.20364 30.02594

在下面评论那里有老哥提问出如何求中位数，其实由rowMeans和rowSums我们就可以想到，只要我们有能对数据框的行做批量操作的函数，就可以做很多操作了。然后我就想到了apply。

其实真正精髓的是 expr_df[,x == tissue] 这个取对应处理的列的操作，有了他我们才能不管是否处理之间重复是否相等，或者处理内的重复是否连在一起。

# 用apply求均值
> sapply(tissue_type, function(x){
+   apply(expr_df[,x == tissue],
+         MARGIN = 1,
+         FUN = mean)
+ })
        Control    Treat    Other
Gene1 10.078953 19.83757 30.15356
Gene2  9.578828 20.62100 30.01440
Gene3  9.483033 18.25366 30.39570
Gene4  9.601892 20.51895 30.16674
Gene5  9.593985 19.20364 30.02594

# 这里的
+   apply(expr_df[,x == tissue],
+         MARGIN = 1,
+         FUN = mean)
# 其实就等价于
rowMeans(expr_df[,x == tissue])

对应的，我们只要更改下apply里面的FUN 就行了。

# 求中位数
> sapply(tissue_type, function(x){
+   apply(expr_df[,x == tissue],
+         MARGIN = 1,
+         FUN = median)
+ })
        Control    Treat    Other
Gene1 10.044180 19.83757 29.70649
Gene2  9.345432 20.62100 30.21286
Gene3  9.642546 18.25366 30.06519
Gene4  9.297456 20.51895 30.08376
Gene5  9.814605 19.20364 30.10726

# 检验下
> median(as.numeric(expr_df[1,1:3]))
[1] 10.04418
> median(as.numeric(expr_df[1,4:5]))
[1] 19.83757
> median(as.numeric(expr_df[1,6:9]))
[1] 29.70649

# 求分位数
# 0.5其实还是中位数
# FUN以后的参数都是传入FUN里面的，所以probs其实是传入quantile函数里面的
> sapply(tissue_type, function(x){
+   apply(expr_df[,x == tissue],
+         MARGIN = 1,
+         FUN = quantile,
+         probs = c(0.5))
+ })
        Control    Treat    Other
Gene1 10.044180 19.83757 29.70649
Gene2  9.345432 20.62100 30.21286
Gene3  9.642546 18.25366 30.06519
Gene4  9.297456 20.51895 30.08376
Gene5  9.814605 19.20364 30.10726

其实我对与apply函数族的理解并不是非常地深刻，包括sapply在这里为什么会出现这个结果。还有比如说
> sapply(tissue_type, function(x){
+   apply(expr_df[,x == tissue],
+         MARGIN = 1,
+         FUN = quantile,
+         probs = c(0.1,0.5))
+ })
        Control    Treat    Other
 [1,]  9.551521 19.27149 29.60031
 [2,] 10.044180 19.83757 29.70649
 [3,]  9.343333 20.12066 28.89253
 [4,]  9.345432 20.62100 30.21286
 [5,]  8.950382 17.83709 29.86315
 [6,]  9.642546 18.25366 30.06519
 [7,]  8.871235 20.43204 28.84806
 [8,]  9.297456 20.51895 30.08376
 [9,]  8.976849 17.83747 28.99967
[10,]  9.814605 19.20364 30.10726
的结果我也不是很能向大家解释清楚，在以后等我对R语言的理解进一步加深以后我应该还会再来探讨这些问题的……