Seurat包的打分函数AddModuleScore

作者: 单细胞空间交响乐 | 来源:发表于2020-09-28 17:30 被阅读0次

Seurat包的打分函数AddModuleScore
Seurat的打分函数AddMouduleScore
使用ggplot2优化Seurat绘图
如何改造inferCNV的热图？
单细胞转录组3大R包系列学习笔记
2021-06-07 R查看版本
使用Shiny搭建基于Seurat包的单细胞数据可视化平台
R包hdf5r安装报错-load filed
单细胞组学系列学习笔记
环境配置笔记12-解决SeuratData包InstallDat

在单细胞数据分析的过程中，Seurat包提供了一个为一个基因集打分的函数AddModuleScore（自定基因集），为基因集进行打分常见的富集分析软件GSVA，今天我们来看看Seurat这个函数的用法和意义。

library(Seurat)
?AddModuleScore

给出的解释说明是Calculate module scores for feature expression programs in single cells，也就是计算单个细胞在一个基因集的score值。
描述是这样的

Calculate the average expression levels of each program (cluster) on 
single cell level, subtracted by the aggregated expression of control 
feature sets. All analyzed features are binned based on averaged 
expression, and the control features are randomly selected from each bin.

看完这个解释，云里雾里，不知所云。
再来看看用法和参数：

Usage:

     AddModuleScore(
       object,
       features,
       pool = NULL,
       nbin = 24,
       ctrl = 100,
       k = FALSE,
       assay = NULL,
       name = "Cluster",
       seed = 1,
       search = FALSE,
       ...
     )
Arguments:

  object: Seurat object

features: Feature expression programs in list

    pool: List of features to check expression levels agains, defaults
          to ‘rownames(x = object)’

    nbin: Number of bins of aggregate expression levels for all
          analyzed features

    ctrl: Number of control features selected from the same bin per
          analyzed feature

       k: Use feature clusters returned from DoKMeans

   assay: Name of assay to use

    name: Name for the expression programs

    seed: Set a random seed. If NULL, seed is not set.
   search: Search for symbol synonyms for features in ‘features’ that
          don't match features in ‘object’? Searches the HGNC's gene
          names database; see ‘UpdateSymbolList’ for more details

     ...: Extra parameters passed to ‘UpdateSymbolList’

计算结果会返回一个数值，有正有负，今天我们的任务就是来参透这个函数，既然看解释不明白，那就只能看原函数了。

function (object, features, pool = NULL, nbin = 24, ctrl = 100, 
    k = FALSE, assay = NULL, name = "Cluster", seed = 1, search = FALSE, 
    ...) 
{
    if (!is.null(x = seed)) {
        set.seed(seed = seed)
    }
    assay.old <- DefaultAssay(object = object)
    assay <- assay %||% assay.old
    DefaultAssay(object = object) <- assay
    assay.data <- GetAssayData(object = object)
    features.old <- features
    if (k) {
        .NotYetUsed(arg = "k")
        features <- list()
        for (i in as.numeric(x = names(x = table(object@kmeans.obj[[1]]$cluster)))) {
            features[[i]] <- names(x = which(x = object@kmeans.obj[[1]]$cluster == 
                i))
        }
        cluster.length <- length(x = features)
    }
    else {
        if (is.null(x = features)) {
            stop("Missing input feature list")
        }
        features <- lapply(X = features, FUN = function(x) {
            missing.features <- setdiff(x = x, y = rownames(x = object))
            if (length(x = missing.features) > 0) {
                warning("The following features are not present in the object: ", 
                  paste(missing.features, collapse = ", "), ifelse(test = search, 
                    yes = ", attempting to find updated synonyms", 
                    no = ", not searching for symbol synonyms"), 
                  call. = FALSE, immediate. = TRUE)
                if (search) {
                  tryCatch(expr = {
                    updated.features <- UpdateSymbolList(symbols = missing.features, 
                      ...)
                    names(x = updated.features) <- missing.features
                    for (miss in names(x = updated.features)) {
                      index <- which(x == miss)
                      x[index] <- updated.features[miss]
                    }
                  }, error = function(...) {
                    warning("Could not reach HGNC's gene names database", 
                      call. = FALSE, immediate. = TRUE)
                  })
                  missing.features <- setdiff(x = x, y = rownames(x = object))
                  if (length(x = missing.features) > 0) {
                    warning("The following features are still not present in the object: ", 
                      paste(missing.features, collapse = ", "), 
                      call. = FALSE, immediate. = TRUE)
                  }
                }
            }
            return(intersect(x = x, y = rownames(x = object)))
        })
        cluster.length <- length(x = features)
    }
    if (!all(LengthCheck(values = features))) {
        warning(paste("Could not find enough features in the object from the following feature lists:", 
            paste(names(x = which(x = !LengthCheck(values = features)))), 
            "Attempting to match case..."))
        features <- lapply(X = features.old, FUN = CaseMatch, 
            match = rownames(x = object))
    }
    if (!all(LengthCheck(values = features))) {
        stop(paste("The following feature lists do not have enough features present in the object:", 
            paste(names(x = which(x = !LengthCheck(values = features)))), 
            "exiting..."))
    }
    pool <- pool %||% rownames(x = object)
    data.avg <- Matrix::rowMeans(x = assay.data[pool, ])
    data.avg <- data.avg[order(data.avg)]
    data.cut <- cut_number(x = data.avg + rnorm(n = length(data.avg))/1e+30, 
        n = nbin, labels = FALSE, right = FALSE)
    names(x = data.cut) <- names(x = data.avg)
    ctrl.use <- vector(mode = "list", length = cluster.length)
    for (i in 1:cluster.length) {
        features.use <- features[[i]]
        for (j in 1:length(x = features.use)) {
            ctrl.use[[i]] <- c(ctrl.use[[i]], names(x = sample(x = data.cut[which(x = data.cut == 
                data.cut[features.use[j]])], size = ctrl, replace = FALSE)))
        }
    }
    ctrl.use <- lapply(X = ctrl.use, FUN = unique)
    ctrl.scores <- matrix(data = numeric(length = 1L), nrow = length(x = ctrl.use), 
        ncol = ncol(x = object))
    for (i in 1:length(ctrl.use)) {
        features.use <- ctrl.use[[i]]
        ctrl.scores[i, ] <- Matrix::colMeans(x = assay.data[features.use, 
            ])
    }
    features.scores <- matrix(data = numeric(length = 1L), nrow = cluster.length, 
        ncol = ncol(x = object))
    for (i in 1:cluster.length) {
        features.use <- features[[i]]
        data.use <- assay.data[features.use, , drop = FALSE]
        features.scores[i, ] <- Matrix::colMeans(x = data.use)
    }
    features.scores.use <- features.scores - ctrl.scores
    rownames(x = features.scores.use) <- paste0(name, 1:cluster.length)
    features.scores.use <- as.data.frame(x = t(x = features.scores.use))
    rownames(x = features.scores.use) <- colnames(x = object)
    object[[colnames(x = features.scores.use)]] <- features.scores.use
    CheckGC()
    DefaultAssay(object = object) <- assay.old
    return(object)
}

我们来一步一步解析这个函数
我们先用它的默认值

library(Seurat)
load(Rdata)
assay.old <- DefaultAssay(object = object)  ###结果为RNA
assay <- assay %||% assay.old   ### %||%的用法可自行查询，这里还是显示RNA
DefaultAssay(object = object) <- assay 
assay.data <- GetAssayData(object = object)
features.old <- features  ## 运行到这里，得到矩阵信息assay.data和gene list。
###这里我们慢慢运行
cluster.length <- length(x = features)
missing.features <- setdiff(x = features, y = rownames(x = object))  ##setdiff函数的用法 我们这里显然是不会有不同的。
###接下来我们就可以直接跳到
pool <- pool %||% rownames(x = object)
data.avg <- Matrix::rowMeans(x = assay.data[pool, ])
data.avg <- data.avg[order(data.avg)]    ###排序，从小到大
data.cut <- cut_number(x = data.avg + rnorm(n = length(data.avg))/1e+30, 
n = nbin, labels = FALSE, right = FALSE)
###rnorm(n, mean = 0, sd = 1) n 为产生随机值个数(长度),mean 是平均数, sd 是标准差 ，也就是说这里将数据切成了nbin份（由小到大）。
names(x = data.cut) <- names(x = data.avg)
ctrl.use <- vector(mode = "list", length = cluster.length)  ##空的列表
 for (i in 1:cluster.length) {
        features.use <- features[[i]]
        for (j in 1:length(x = features.use)) {
            ctrl.use[[i]] <- c(ctrl.use[[i]], names(x = sample(x = data.cut[which(x = data.cut == 
                data.cut[features.use[j]])], size = ctrl, replace = FALSE)))
        }
    }
ctrl.use <- lapply(X = ctrl.use, FUN = unique)
ctrl.scores <- matrix(data = numeric(length = 1L), nrow = length(x = ctrl.use), 
        ncol = ncol(x = object))   ####空的
for (i in 1:length(ctrl.use)) {
        features.use <- ctrl.use[[i]]
        ctrl.scores[i, ] <- Matrix::colMeans(x = assay.data[features.use, 
            ])
    }       #######细胞的平均值
features.scores.use <- features.scores - ctrl.scores
    rownames(x = features.scores.use) <- paste0(name, 1:cluster.length)
    features.scores.use <- as.data.frame(x = t(x = features.scores.use))
    rownames(x = features.scores.use) <- colnames(x = object)
    object[[colnames(x = features.scores.use)]] <- features.scores.use
    CheckGC()
    DefaultAssay(object = object) <- assay.old

走到这里，相信大家应该都明白了，也就是我们感兴趣的基因，抽出来，每一个细胞算一个这些基因表达的平均值，
背景基因的平均值在于找每个基因的所在的bin，在该bin内随机抽取相应的ctrl个基因作为背景，最后所有的目标基因算一个平均值，所有的背景基因算一个平均值，两者相减就是该gene set 的score值。
至于生物学意义，仁者见仁智者见智了。

Seurat包的打分函数AddModuleScore
在单细胞数据分析的过程中，Seurat包提供了一个为一个基因集打分的函数AddModuleScore（自定基因集）...
Seurat的打分函数AddMouduleScore
在读张泽民老师发表的新冠文献的时候看到计算免疫细胞的cytokine score或inflammatory sco...
使用ggplot2优化Seurat绘图
Seurat是分析单细胞数据一个非常好用的包，自带非常优秀的绘图函数，见Seurat绘图函数总结[https://...
如何改造inferCNV的热图？
刘小泽写于2020.1.3关于自己改造函数，之前介绍过如何改造Seurat包的DoHeatmap函数？这次问题又...
单细胞转录组3大R包系列学习笔记
Seurat包学习笔记 Seurat包学习笔记（一）：Guided Clustering TutorialSeur...
2021-06-07 R查看版本
查看R安装包的版本packageVersion("Seurat") 加载包并查看library(Seurat)se...
使用Shiny搭建基于Seurat包的单细胞数据可视化平台
image ▼Seurat包系列学习笔记▼：Seurat包学习笔记（一）：Guided Clustering Tu...
R包hdf5r安装报错-load filed
问题在linux上安装R包Seurat[https://satijalab.org/seurat/]的依赖包 h...
单细胞组学系列学习笔记
scRNA-seq数据分析 Seurat包学习笔记 Seurat包学习笔记（一）：Guided Clusterin...
环境配置笔记12-解决SeuratData包InstallDat
近期Seurat官网国内上不去，因此在服务器上也无法使用SeuratData包的InstallData()函数，有...