Seurat包的打分函数AddModuleScore

作者: 单细胞空间交响乐 | 来源:发表于2020-09-28 17:30 被阅读0次

    在单细胞数据分析的过程中,Seurat包提供了一个为一个基因集打分的函数AddModuleScore(自定基因集),为基因集进行打分常见的富集分析软件GSVA,今天我们来看看Seurat这个函数的用法和意义。

    library(Seurat)
    ?AddModuleScore
    

    给出的解释说明是Calculate module scores for feature expression programs in single cells,也就是计算单个细胞在一个基因集的score值。
    描述是这样的

    Calculate the average expression levels of each program (cluster) on 
    single cell level, subtracted by the aggregated expression of control 
    feature sets. All analyzed features are binned based on averaged 
    expression, and the control features are randomly selected from each bin.
    

    看完这个解释,云里雾里,不知所云。
    再来看看用法和参数:

    Usage:
    
         AddModuleScore(
           object,
           features,
           pool = NULL,
           nbin = 24,
           ctrl = 100,
           k = FALSE,
           assay = NULL,
           name = "Cluster",
           seed = 1,
           search = FALSE,
           ...
         )
    Arguments:
    
      object: Seurat object
    
    features: Feature expression programs in list
    
        pool: List of features to check expression levels agains, defaults
              to ‘rownames(x = object)’
    
        nbin: Number of bins of aggregate expression levels for all
              analyzed features
    
        ctrl: Number of control features selected from the same bin per
              analyzed feature
    
           k: Use feature clusters returned from DoKMeans
    
       assay: Name of assay to use
    
        name: Name for the expression programs
    
        seed: Set a random seed. If NULL, seed is not set.
       search: Search for symbol synonyms for features in ‘features’ that
              don't match features in ‘object’? Searches the HGNC's gene
              names database; see ‘UpdateSymbolList’ for more details
    
         ...: Extra parameters passed to ‘UpdateSymbolList’
    

    计算结果会返回一个数值,有正有负,今天我们的任务就是来参透这个函数,既然看解释不明白,那就只能看原函数了。

    function (object, features, pool = NULL, nbin = 24, ctrl = 100, 
        k = FALSE, assay = NULL, name = "Cluster", seed = 1, search = FALSE, 
        ...) 
    {
        if (!is.null(x = seed)) {
            set.seed(seed = seed)
        }
        assay.old <- DefaultAssay(object = object)
        assay <- assay %||% assay.old
        DefaultAssay(object = object) <- assay
        assay.data <- GetAssayData(object = object)
        features.old <- features
        if (k) {
            .NotYetUsed(arg = "k")
            features <- list()
            for (i in as.numeric(x = names(x = table(object@kmeans.obj[[1]]$cluster)))) {
                features[[i]] <- names(x = which(x = object@kmeans.obj[[1]]$cluster == 
                    i))
            }
            cluster.length <- length(x = features)
        }
        else {
            if (is.null(x = features)) {
                stop("Missing input feature list")
            }
            features <- lapply(X = features, FUN = function(x) {
                missing.features <- setdiff(x = x, y = rownames(x = object))
                if (length(x = missing.features) > 0) {
                    warning("The following features are not present in the object: ", 
                      paste(missing.features, collapse = ", "), ifelse(test = search, 
                        yes = ", attempting to find updated synonyms", 
                        no = ", not searching for symbol synonyms"), 
                      call. = FALSE, immediate. = TRUE)
                    if (search) {
                      tryCatch(expr = {
                        updated.features <- UpdateSymbolList(symbols = missing.features, 
                          ...)
                        names(x = updated.features) <- missing.features
                        for (miss in names(x = updated.features)) {
                          index <- which(x == miss)
                          x[index] <- updated.features[miss]
                        }
                      }, error = function(...) {
                        warning("Could not reach HGNC's gene names database", 
                          call. = FALSE, immediate. = TRUE)
                      })
                      missing.features <- setdiff(x = x, y = rownames(x = object))
                      if (length(x = missing.features) > 0) {
                        warning("The following features are still not present in the object: ", 
                          paste(missing.features, collapse = ", "), 
                          call. = FALSE, immediate. = TRUE)
                      }
                    }
                }
                return(intersect(x = x, y = rownames(x = object)))
            })
            cluster.length <- length(x = features)
        }
        if (!all(LengthCheck(values = features))) {
            warning(paste("Could not find enough features in the object from the following feature lists:", 
                paste(names(x = which(x = !LengthCheck(values = features)))), 
                "Attempting to match case..."))
            features <- lapply(X = features.old, FUN = CaseMatch, 
                match = rownames(x = object))
        }
        if (!all(LengthCheck(values = features))) {
            stop(paste("The following feature lists do not have enough features present in the object:", 
                paste(names(x = which(x = !LengthCheck(values = features)))), 
                "exiting..."))
        }
        pool <- pool %||% rownames(x = object)
        data.avg <- Matrix::rowMeans(x = assay.data[pool, ])
        data.avg <- data.avg[order(data.avg)]
        data.cut <- cut_number(x = data.avg + rnorm(n = length(data.avg))/1e+30, 
            n = nbin, labels = FALSE, right = FALSE)
        names(x = data.cut) <- names(x = data.avg)
        ctrl.use <- vector(mode = "list", length = cluster.length)
        for (i in 1:cluster.length) {
            features.use <- features[[i]]
            for (j in 1:length(x = features.use)) {
                ctrl.use[[i]] <- c(ctrl.use[[i]], names(x = sample(x = data.cut[which(x = data.cut == 
                    data.cut[features.use[j]])], size = ctrl, replace = FALSE)))
            }
        }
        ctrl.use <- lapply(X = ctrl.use, FUN = unique)
        ctrl.scores <- matrix(data = numeric(length = 1L), nrow = length(x = ctrl.use), 
            ncol = ncol(x = object))
        for (i in 1:length(ctrl.use)) {
            features.use <- ctrl.use[[i]]
            ctrl.scores[i, ] <- Matrix::colMeans(x = assay.data[features.use, 
                ])
        }
        features.scores <- matrix(data = numeric(length = 1L), nrow = cluster.length, 
            ncol = ncol(x = object))
        for (i in 1:cluster.length) {
            features.use <- features[[i]]
            data.use <- assay.data[features.use, , drop = FALSE]
            features.scores[i, ] <- Matrix::colMeans(x = data.use)
        }
        features.scores.use <- features.scores - ctrl.scores
        rownames(x = features.scores.use) <- paste0(name, 1:cluster.length)
        features.scores.use <- as.data.frame(x = t(x = features.scores.use))
        rownames(x = features.scores.use) <- colnames(x = object)
        object[[colnames(x = features.scores.use)]] <- features.scores.use
        CheckGC()
        DefaultAssay(object = object) <- assay.old
        return(object)
    }
    

    我们来一步一步解析这个函数
    我们先用它的默认值

    library(Seurat)
    load(Rdata)
    assay.old <- DefaultAssay(object = object)  ###结果为RNA
    assay <- assay %||% assay.old   ### %||%的用法可自行查询,这里还是显示RNA
    DefaultAssay(object = object) <- assay 
    assay.data <- GetAssayData(object = object)
    features.old <- features  ## 运行到这里,得到矩阵信息assay.data和gene list。
    ###这里我们慢慢运行
    cluster.length <- length(x = features)
    missing.features <- setdiff(x = features, y = rownames(x = object))  ##setdiff函数的用法 我们这里显然是不会有不同的。
    ###接下来我们就可以直接跳到
    pool <- pool %||% rownames(x = object)
    data.avg <- Matrix::rowMeans(x = assay.data[pool, ])
    data.avg <- data.avg[order(data.avg)]    ###排序,从小到大
    data.cut <- cut_number(x = data.avg + rnorm(n = length(data.avg))/1e+30, 
    n = nbin, labels = FALSE, right = FALSE)
    ###rnorm(n, mean = 0, sd = 1) n 为产生随机值个数(长度),mean 是平均数, sd 是标准差 ,也就是说这里将数据切成了nbin份(由小到大)。
    names(x = data.cut) <- names(x = data.avg)
    ctrl.use <- vector(mode = "list", length = cluster.length)  ##空的列表
     for (i in 1:cluster.length) {
            features.use <- features[[i]]
            for (j in 1:length(x = features.use)) {
                ctrl.use[[i]] <- c(ctrl.use[[i]], names(x = sample(x = data.cut[which(x = data.cut == 
                    data.cut[features.use[j]])], size = ctrl, replace = FALSE)))
            }
        }
    ctrl.use <- lapply(X = ctrl.use, FUN = unique)
    ctrl.scores <- matrix(data = numeric(length = 1L), nrow = length(x = ctrl.use), 
            ncol = ncol(x = object))   ####空的
    for (i in 1:length(ctrl.use)) {
            features.use <- ctrl.use[[i]]
            ctrl.scores[i, ] <- Matrix::colMeans(x = assay.data[features.use, 
                ])
        }       #######细胞的平均值
    features.scores.use <- features.scores - ctrl.scores
        rownames(x = features.scores.use) <- paste0(name, 1:cluster.length)
        features.scores.use <- as.data.frame(x = t(x = features.scores.use))
        rownames(x = features.scores.use) <- colnames(x = object)
        object[[colnames(x = features.scores.use)]] <- features.scores.use
        CheckGC()
        DefaultAssay(object = object) <- assay.old
    

    走到这里,相信大家应该都明白了,也就是我们感兴趣的基因,抽出来,每一个细胞算一个这些基因表达的平均值,
    背景基因的平均值在于找每个基因的所在的bin,在该bin内随机抽取相应的ctrl个基因作为背景,最后所有的目标基因算一个平均值,所有的背景基因算一个平均值,两者相减就是该gene set 的score值。
    至于生物学意义,仁者见仁智者见智了。

    相关文章

      网友评论

        本文标题:Seurat包的打分函数AddModuleScore

        本文链接:https://www.haomeiwen.com/subject/gnnduktx.html