美文网首页
单细胞数据挖掘3-clustering

单细胞数据挖掘3-clustering

作者: 大吉岭猹 | 来源:发表于2020-05-19 22:13 被阅读0次

    1. 写在前面

    2. Normalization, variance stabilization, and regression of unwanted variation for each sample

    • 使用的是 Seurat 中的方法 sctransform

    2.1. Cell cycle scoring

    • sctransform 会默认抹平测序深度差异,但是细胞周期差异也需要抹除,因此在 sctransform 之前先检查一下细胞周期的情况
    # 先进行一个粗略的归一化(除以每个细胞的总count再取自然对数)
    seurat_phase <- NormalizeData(filtered_seurat)
    
    ## 拿到细胞周期相关基因list
    ## for HS
    # # Load cell cycle markers
    # load("data/cycle.rda")
    
    ## for MM
    # cc_file <-
    #   getURL("https://raw.githubusercontent.com/hbc/tinyatlas/master/cell_cycle/Mus_musculus.csv")
    # cell_cycle_genes <- read.csv(text = cc_file)
    # 可以采用上面的方式直接从网页拿,实际操作的时候服务器的网出了点问题,所以就在PC上直接下好了,很小的文件
    cell_cycle_genes = read.csv("data/Mus_musculus.csv")
    g2m_genes = cell_cycle_genes[cell_cycle_genes[,1]=="G2/M",2]
    s_genes = cell_cycle_genes[cell_cycle_genes[,1]=="S",2]
    
    # ID转换
    library(clusterProfiler)
    library(org.Mm.eg.db)
    keytypes(org.Mm.eg.db)
    g2m_symbol = bitr(g2m_genes, fromType="ENSEMBL", toType="SYMBOL", OrgDb="org.Mm.eg.db")
    duplicated(g2m_symbol[,2])
    s_symbol = bitr(s_genes, fromType="ENSEMBL", toType="SYMBOL", OrgDb="org.Mm.eg.db")
    
    # Score cells for cell cycle
    seurat_phase <- CellCycleScoring(seurat_phase,
                                     g2m.features = g2m_symbol[,2],
                                     s.features = s_symbol[,2])
    
    # View cell cycle scores and phases assigned to cells
    View(seurat_phase@meta.data)
    
    • 在对每个细胞进行一个细胞周期评分之后,需要判断细胞周期是否是基因表达量差异的主要因素,因此需要进行 PCA
    • 在 PCA 之前,需先选取 most variable features,然后对数据进行 标准化,使得任一基因在所有细胞中的表达量均值为 0,方差为 1
    • 然后对三种细胞周期的细胞分别进行 PCA
    # Identify the most variable genes
    seurat_phase <- FindVariableFeatures(seurat_phase,
                                         selection.method = "vst",
                                         nfeatures = 2000,
                                         verbose = FALSE)
    
    # Scale the counts
    seurat_phase <- ScaleData(seurat_phase)
    
    # Perform PCA
    seurat_phase <- RunPCA(seurat_phase)
    
    # Plot the PCA colored by cell cycle phase
    DimPlot(seurat_phase,
            reduction = "pca",
            group.by= "Phase",
            split.by = "Phase")
    
    • 可以看到 G1 期和非 G1 期有差异,说明细胞周期基因是高变成分,后面的 SCTransform 中需要把细胞周期差异去掉

    2.2. SCTransform

    • 首先将 Seurat 对象按 sample 分开,分别对每个 sample 的细胞进行 NormalizeData, CellCycleScoring, SCTransform
    options(future.globals.maxSize = 4000 * 1024^2)
    # Split seurat object by condition to perform cell cycle scoring and SCT on all samples
    split_seurat <- SplitObject(filtered_seurat, split.by = "sample")
    
    split_seurat <- split_seurat[c("sigaf1", "sigag1", "sigah1")]
    
    for (i in 1:length(split_seurat)) {
      split_seurat[[i]] <- NormalizeData(split_seurat[[i]], verbose = TRUE)
      split_seurat[[i]] <- CellCycleScoring(split_seurat[[i]],
                                            g2m.features = g2m_symbol[,2],
                                            s.features = s_symbol[,2])
      split_seurat[[i]] <- SCTransform(split_seurat[[i]],
                                       vars.to.regress = c("mitoRatio","S.Score","G2M.Score"))
    }
    
    • 该流程中注意:若细胞数较大(如远大于15000),则可以将 SCTransform() 函数中的 variable.features.n 参数设置得大一些(默认3000)
    • 可以看到差别还是很大的,需要整合。

    3. Integration of the samples using shared highly variable genes (optional, but recommended)

    # Select the most variable features to use for integration
    integ_features <- SelectIntegrationFeatures(object.list = split_seurat,
                                                nfeatures = 3000)
    # Prepare the SCT list object for integration
    split_seurat <- PrepSCTIntegration(object.list = split_seurat,
                                       anchor.features = integ_features)
    # Find best buddies - can take a while to run
    integ_anchors <- FindIntegrationAnchors(object.list = split_seurat,
                                            normalization.method = "SCT",
                                            anchor.features = integ_features)
    seurat_integrated <- IntegrateData(anchorset = integ_anchors,
                                       normalization.method = "SCT")
    # Save integrated seurat object
    save(seurat_integrated, file = "data/seurat_integrated.Rdata")
    
    # 可视化看一下效果
    # Run PCA
    seurat_integrated <- RunPCA(object = seurat_integrated)
    
    # Plot PCA
    PCAPlot(seurat_integrated,
            split.by = "sample")
    
    # Run UMAP
    seurat_integrated <- RunUMAP(seurat_integrated,
                                 dims = 1:40,
                                 reduction = "pca")
    
    # Plot UMAP
    DimPlot(seurat_integrated)
    DimPlot(seurat_integrated,
            split.by = "sample")
    

    4. Clustering cells based on top PCs (metagenes)

    4.1. 选择高变 PC

    • 常规两种方法
    • 一是以最高变的 PC 和细胞作为行和列画热图,看从哪个 PC 开始界线变得模糊,这个方法的缺点是如果要看大量 PC 的话很不方便
    # Explore heatmap of PCs
    DimHeatmap(seurat_integrated,
               dims = 1:9,
               cells = 500,
               balanced = TRUE)
    
    • 二是画 elbow plot,看拐点出现的位置,这个方法的缺点是没有明确的拐点标准,非常主观化
    # Plot the elbow plot
    ElbowPlot(object = seurat_integrated,
              ndims = 40)
    
    • 但是其实挑选 PC 这一步只对于一些老的归一化方法来说是必须的,由于我们使用 SCTransform 进行 normalization,已经排除了技术原因(非生物学原因)产生的 PC,理论上我们选择越多的 PC,分群聚类时就会有越多有意义的差异被考虑在内,只是计算时间会更长,所以我们选择前 40 个 PC

    4.2. 分群聚类

    # Determine the K-nearest neighbor graph
    seurat_integrated <- FindNeighbors(object = seurat_integrated,
                                       dims = 1:40)
    
    # Determine the clusters for various resolutions
    seurat_integrated <- FindClusters(object = seurat_integrated,
                                      resolution = c(0.4, 0.6, 0.8, 1.0, 1.4))
    
    # Explore resolutions
    seurat_integrated@meta.data %>%
      View()
    
    • 关于 resolustion 的选择,对于细胞数在 3000-5000 之间的数据集,一般可以采用 0.4-1.4 之间的分辨率
    • 然后进行可视化看聚类效果,两种常用的降维方法:t-distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) techniques
    # Assign identity of clusters
    Idents(object = seurat_integrated) <- "integrated_snn_res.0.8"
    
    # Plot the UMAP
    DimPlot(seurat_integrated,
            reduction = "umap",
            label = TRUE,
            label.size = 6)
    
    • res = 0.8
    • res = 0.4

    5. Exploration of quality control metrics

    determine whether clusters are unbalanced wrt UMIs, genes, cell cycle, mitochondrial content, samples, etc.

    5.1. 按样本分离聚类

    # Extract identity and sample information from seurat object to determine the number of cells per cluster per sample
    n_cells <- FetchData(seurat_integrated,
                         vars = c("ident", "orig.ident")) %>%
      dplyr::count(ident, orig.ident) %>%
      tidyr::spread(ident, n)
    
    # View table
    View(n_cells)
    
    # UMAP of cells in each cluster by sample
    DimPlot(seurat_integrated,
            label = TRUE,
            split.by = "sample")  + NoLegend()
    

    5.2. 按细胞周期分离聚类

    # Explore whether clusters segregate by cell cycle phase
    DimPlot(seurat_integrated,
            label = TRUE,
            split.by = "Phase")  + NoLegend()
    
    • 结果不是非常理想

    5.3. 其他想剔除的因素

    # Determine metrics to plot present in seurat_integrated@meta.data
    metrics <-  c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")
    
    FeaturePlot(seurat_integrated,
                reduction = "umap",
                features = metrics,
                pt.size = 0.4,
                sort.cell = TRUE,
                min.cutoff = 'q10',
                label = TRUE)
    
    • 可以看到分布不那么均匀

    5.4. 尝试不做 SCTransform 进行对比

    • filtered_seurat 开始
    ## NO SCT, NO intergration
    # 先进行一个粗略的归一化(除以每个细胞的总count再取自然对数)
    seurat_phase <- NormalizeData(filtered_seurat)
    
    cell_cycle_genes = read.csv("data/Mus_musculus.csv")
    g2m_genes = cell_cycle_genes[cell_cycle_genes[,1]=="G2/M",2]
    s_genes = cell_cycle_genes[cell_cycle_genes[,1]=="S",2]
    
    # ID转换
    library(clusterProfiler)
    library(org.Mm.eg.db)
    keytypes(org.Mm.eg.db)
    g2m_symbol = bitr(g2m_genes, fromType="ENSEMBL", toType="SYMBOL", OrgDb="org.Mm.eg.db")
    duplicated(g2m_symbol[,2])
    s_symbol = bitr(s_genes, fromType="ENSEMBL", toType="SYMBOL", OrgDb="org.Mm.eg.db")
    
    # Score cells for cell cycle
    seurat_phase <- CellCycleScoring(seurat_phase,
                                     g2m.features = g2m_symbol[,2],
                                     s.features = s_symbol[,2])
    
    # Identify the most variable genes
    seurat_phase <- FindVariableFeatures(seurat_phase,
                                         selection.method = "vst",
                                         nfeatures = 2000,
                                         verbose = FALSE)
    
    # Scale the counts
    seurat_phase <- ScaleData(seurat_phase)
    
    # Perform PCA
    seurat_phase <- RunPCA(seurat_phase)
    
    # Run UMAP
    seurat_phase <- RunUMAP(seurat_phase,
                                 dims = 1:40,
                                 reduction = "pca")
    # UMAP of cells in each cluster by sample
    DimPlot(seurat_phase,
            label = TRUE,
            split.by = "sample")  + NoLegend()
    
    DimPlot(seurat_phase,
            label = TRUE,
            split.by = "Phase")  + NoLegend()
    
    # Determine metrics to plot present in seurat_integrated@meta.data
    metrics <-  c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")
    
    FeaturePlot(seurat_phase,
                reduction = "umap",
                features = metrics,
                pt.size = 0.4,
                sort.cell = TRUE,
                min.cutoff = 'q10',
                label = TRUE)
    
    • 可以看到 SCTransform 是有效果的,但效果并不好

    5.5. 查看各 PC 在各亚群中的表达情况

    # Defining the information in the seurat object of interest
    columns <- c(paste0("PC_", 1:16),
                 "ident",
                 "UMAP_1", "UMAP_2")
    
    # Extracting this data from the seurat object
    pc_data <- FetchData(seurat_integrated,
                         vars = columns)
    
    # Adding cluster label to center of cluster on UMAP
    umap_label <- FetchData(seurat_integrated,
                            vars = c("ident", "UMAP_1", "UMAP_2"))  %>%
      group_by(ident) %>%
      summarise(x=mean(UMAP_1), y=mean(UMAP_2))
    
    # Plotting a UMAP plot for each of the PCs
    map(paste0("PC_", 1:16), function(pc){
      ggplot(pc_data,
             aes(UMAP_1, UMAP_2)) +
        geom_point(aes_string(color=pc),
                   alpha = 0.7) +
        scale_color_gradient(guide = FALSE,
                             low = "grey90",
                             high = "blue")  +
        geom_text(data=umap_label,
                  aes(label=ident, x, y)) +
        ggtitle(pc)
    }) %>%
      plot_grid(plotlist = .)
    
    
    • PC1 在 4 群
    • PC2 在 5/8 群
    • PC3 在 6 群
    • PC4 在除了 5/7 外的群
    • PC5 在 6/11/13/15/20 群
    • PC6 在除了 12/14/15 外的群
    • PC9 在 4/8 群
    • PC15 在 14 群

    6. Searching for expected cell types using known cell type-specific gene markers

    • 如果一种细胞类型有多个 marker,则需要找到表达所有 marker 的亚群
    • 如果一个亚群包含多种已知细胞类型,则需要提高分辨率或提高 PC 数来将不同细胞类型区分开

    友情宣传

    相关文章

      网友评论

          本文标题:单细胞数据挖掘3-clustering

          本文链接:https://www.haomeiwen.com/subject/pgniohtx.html