单细胞转录组-2

作者: nnlrl | 来源:发表于2019-07-26 11:44 被阅读37次

    1 介绍

    · 上次练习总结了一下smart seq数据前期分析过程,这次我们来看一下基于10x平台的单细胞测序分析流程

    10x测序过程

    · 10x技术通过向每个细胞添加barcode(细胞或者凝胶微珠身份证号码)和UMI(每个DNA分子标签的身份证号码)的方式对微滴中的单个细胞进行测序。

    2 分析流程

    · 对于10x数据来说上游分析一般选用10x公司自带的Cell Ranger,集成比对定量等很多功能,一步分析即可得到表达矩阵,聚类效果,网页版总结等结果以进行后续分析。后续可视化也可以使用10x公司的windows软件Cell Browser查看细胞分群以及细胞表达量等情况。

    上游结果文件

    Cell Browser可视化输入文件需要.cloupe文件,而后续分析需要使用barcodes.tsv.gz,features.tsv.gz和matrix.mtx.gz这三个矩阵信息。

    · 后续分析使用集成R包Seurat可以直接读取10x Cell Ranger结果文件并进行后续分析。

    · 牛津大学的Rahul Satija等开发的Seurat,最早公布在Nature biotechnology, 2015,文章是;Spatial reconstruction of single-cell gene expression data , 在2017年进行了非常大的改动,所以重新在biorxiv发表了文章在Integrated analysis of single cell transcriptomic data across conditions, technologies, and species 。功能涵盖了scRNA-seq的QC、过滤、标准化、批次效应、PCA、tSNE亚群聚类分析、差异基因分析、
    亚群特异性标志物鉴定等等等。

    2 代码展示

    首先使用linux环境下的Cell Ranger进行上游分析

    2.1 上游分析获得表达矩阵

    首先进行构建索引,我们需要准备相应物种的参考基因组fasta序列,以及基因组注释文件gtf/gff3文件。

    #一步构建索引
    cellranger  mkref --nthreads=4 --memgb=16 --genome=chicken  --fasta=chicken.fna  --genes=chicken.gtf
    ##--genome指定导出的索引文件夹名称,--fasta参考基因组文件,--genes基因组注释文件
    
    #比对定量
    /software/cellranger-3.0.2/cellranger count --id=chicken --fastqs=fastq/chicken/Rawdata/  --sample=CK  --transcriptome=ref/chicken/ 
    --localcores=10  --mempercore=10
    ##--fastqs指定测序文件,--sample指定sample name,--transcriptome指定索引文件路径,--localcores指定cpu,--mempercore内存
    

    CK_S1_L001_R1_001.fastq.gz,CK_S1_L001_R2_001.fastq.gz这里来看一下10x测序文件的命名方式,[sample name]S1_L00[Lane Number][Read Type]_001.fastq.gz。这里sample name指的是CK,Read Type有三种,I1代表cell-barcode,I2代表UMI,R2代表reads。

    2.2 后续分析

    [数据链接fprw]https://pan.baidu.com/s/1NoSPh1lfKsPOnIwdInWmtg
    导入相应软件包

    ###R版本3.5.1
    ###Seurat版本3.0
    setwd('')
    library(Seurat)
    library(dplyr)
    library(Matrix)
    
    # Load the PBMC dataset
    #外周血单个核细胞,是指外周血中具有单个核的细胞,包含淋巴细胞、单核细胞等
    list.files("hg19")
    
    pbmc.data <- Read10X(data.dir = "hg19")
    
    
    ##########################################################################################################
    

    1、数据质控

    # Initialize the Seurat object with the raw (non-normalized data).  Keep all
    # genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at
    # least 200 detected genes
    pbmc <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200, 
                               project = "10X_PBMC")
    pbmc
    
    An object of class Seurat 
    13714 features across 2700 samples within 1 assay 
    Active assay: RNA (13714 features)
    
    ###进行一系列的QC步骤##############################
    VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
    pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 )
    
    #可以看到这里选择的QC标准是 200~2500基因范围内,以及线粒体基因表达占比小于5%的才保留。
    
    ###########################################################################################################
    

    2、标准化-normalization

    pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
    #这里默认根据细胞测序文库大小进行normalization,简单的做一个log转换即可。
    
    
    pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", 
                          scale.factor = 10000)
    
    

    3、去除那些技术噪音,批处理效果(批次效应)

    #rDetection of variable genes across the single cells
    pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
    ##这里seurat使用vst的方法去除批次效应,还可以使用了limma包的removebatcheffect
    
    # Identify the 10 most highly variable genes
    top10 <- head(VariableFeatures(pbmc), 10)
    
    # plot variable features with and without labels
    plot1 <- VariableFeaturePlot(pbmc)
    plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
    CombinePlots(plots = list(plot1, plot2))
    
    #############################################################################################################
    
    2000 variable genes

    4、PCA分析

    Scaling the data
    all.genes <- rownames(pbmc)
    pbmc <- ScaleData(pbmc, features = all.genes)
    pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
    
    print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
    
    PC_ 1 
    Positive:  MALAT1, LTB, IL32, IL7R, CD2 
    Negative:  CST3, TYROBP, LST1, AIF1, FTL 
    PC_ 2 
    Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
    Negative:  NKG7, CST7, PRF1, GZMA, GZMB 
    PC_ 3 
    Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
    Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
    PC_ 4 
    Positive:  HLA-DQA1, HIST1H2AC, PF4, CD79B, CD79A 
    Negative:  VIM, IL7R, S100A6, S100A8, S100A4 
    PC_ 5 
    Positive:  GZMB, NKG7, FGFBP2, GNLY, CCL4 
    Negative:  LTB, IL7R, VIM, AQP3, CYTIP 
    
    #对PCA分析结果可以进行一系列的可视化: VizDimReduction, DimPlot, and DimHeatmap
    # Examine and visualize PCA results a few different ways
    VizDimLoadings(pbmc, dims = 1:6, reduction = "pca")
    
    DimPlot(pbmc, reduction = "pca")
    
    DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
    
    DimHeatmap(pbmc, dims = 1:15, cells = 100, balanced = TRUE)
    
    #确定维度
    #主成分分析结束后需要确定哪些主成分所代表的基因可以进入下游分析,这里可以使用JackStraw做重抽样分析。可以用JackStrawPlot可视化看看哪些主成分可以进行下游分析。
    pbmc <- JackStraw(pbmc, num.replicate = 100)
    pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
    
    JackStrawPlot(pbmc, dims = 1:15)
    
    #当然,也可以用最经典的碎石图来确定主成分。
    ElbowPlot(pbmc)
    
    #这个确定主成分是非常有挑战性的: - The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. - The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. - The third is a heuristic that is commonly used, and can be calculated instantly.
    #在本例子里面,3种方法结果差异不大,可以在PC7~10直接挑选。
    
    ####################################################################
    
    JackStrawPlot

    5、tSNE亚群聚类分析

    #Cluster the cells
    # save.SNN = T saves the SNN so that the clustering algorithm can be rerun using the same graph
    # but with a different resolution value (see docs for full details)
    pbmc <- FindNeighbors(pbmc, dims = 1:10)
    pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, resolution = 0.6, print.output = 0, save.SNN = TRUE)
    
    #A useful feature in Seurat v2.0 is the ability to recall the parameters that were used in the latest function calls for commonly used functions. For FindClusters, we provide the function PrintFindClustersParams to print a nicely formatted formatted summary of the parameters that were chosen.
    PrintFindClustersParams(object = pbmc)
    # While we do provide function-specific printing functions, the more general function to 
    # print calculation parameters is PrintCalcParams(). 
    
    # Look at cluster IDs of the first 5 cells
    head(Idents(pbmc), 5)
    
    #Run non-linear dimensional reduction (UMAP/tSNE)
    # If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
    # 'umap-learn')
    pbmc <- RunUMAP(pbmc, dims = 1:10)
    
    # note that you can set `label = TRUE` or use the LabelClusters function to help label
    # individual clusters
    DimPlot(pbmc, reduction = "umap",pt.size =1)
    
    #同样也是一个函数,这个结果也可以像PCA分析一下挑选合适的PC进行下游分析。
    pbmc <- RunTSNE(object = pbmc, dims.use = 1:8, do.fast = TRUE)
    # note that you can set do.label=T to help label individual clusters
    TSNEPlot(object = pbmc,pt.size =3)
    
    #这一步很耗时,可以保存该对象,便于重复,以及分享交流
    saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")
    
    ##########################################################################
    
    UMAP
    官网说UMAP聚类效果要好于t-SNE,这里需要注意UMAP需要安装python

    6、差异基因分析

    #Finding differentially expressed genes (cluster biomarkers)
    #差异分析在seurat包里面被封装成了函数:FindMarkers,有一系列参数可以选择,然后有4种找差异基因的算法:
    #ROC test (“roc”)
    #t-test (“t”)
    #LRT test based on zero-inflated data (“bimod”, default)
    #LRT test based on tobit-censoring models (“tobit”)
    # find all markers of cluster 1
    cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
    head(cluster1.markers, n = 5)
    
                   p_val avg_logFC pct.1 pct.2     p_val_adj
    S100A9  0.000000e+00  3.786159 0.994 0.215  0.000000e+00
    S100A8  0.000000e+00  3.782868 0.971 0.121  0.000000e+00
    LGALS2  0.000000e+00  2.561847 0.898 0.061  0.000000e+00
    FCN1    0.000000e+00  2.363926 0.953 0.148  0.000000e+00
    CD14   4.097329e-294  1.952818 0.662 0.029 5.619077e-290
    
    # find all markers distinguishing cluster 5 from clusters 0 and 3
    cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
    head(cluster5.markers, n = 5)
    
                          p_val avg_logFC pct.1 pct.2     p_val_adj
    FCGR3A        1.090652e-202  2.966785 0.988 0.040 1.495720e-198
    CFD           3.774787e-188  2.369271 0.938 0.038 5.176743e-184
    IFITM3        7.065911e-188  2.675234 0.975 0.052 9.690190e-184
    CD68          3.512271e-184  2.093088 0.920 0.035 4.816729e-180
    RP11-290F20.3 3.783316e-183  1.898087 0.852 0.017 5.188439e-179
    
    # find markers for every cluster compared to all remaining cells, report only the positive ones
    pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
    pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
    
    # A tibble: 20 x 7
    # Groups:   cluster [10]
           p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene    
           <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
     1 2.25e- 83     0.863 0.455 0.113 3.09e- 79 0       CCR7    
     2 6.91e- 33     0.757 0.259 0.084 9.48e- 29 0       LDLRAP1 
     3 0.            3.79  0.994 0.215 0.        1       S100A9  
     4 0.            3.78  0.971 0.121 0.        1       S100A8  
     5 4.42e- 82     0.865 0.984 0.646 6.06e- 78 2       LTB     
     6 1.19e- 57     0.872 0.428 0.114 1.63e- 53 2       AQP3    
     7 0.            2.98  0.934 0.041 0.        3       CD79A   
     8 2.92e-275     2.49  0.62  0.022 4.01e-271 3       TCL1A   
     9 1.84e-191     2.11  0.619 0.054 2.52e-187 4       GZMK    
    10 2.19e-190     2.03  0.957 0.238 3.00e-186 4       CCL5    
    11 5.03e-192     2.31  0.988 0.132 6.89e-188 5       FCGR3A  
    12 1.07e-125     2.09  1     0.314 1.46e-121 5       LST1    
    13 4.93e-269     3.38  0.986 0.071 6.77e-265 6       GZMB    
    14 1.41e-177     3.40  0.959 0.134 1.93e-173 6       GNLY    
    15 1.61e-  7     0.872 0.128 0.423 2.21e-  3 7       CAP1    
    16 1.01e-  3     1.05  0.103 0.263 1.00e+  0 7       NDUFA2  
    17 1.19e-266     2.66  0.833 0.009 1.63e-262 8       FCER1A  
    18 2.53e- 23     1.93  1     0.507 3.47e- 19 8       HLA-DPB1
    19 1.32e-181     4.94  0.933 0.011 1.82e-177 9       PF4     
    20 2.61e-117     5.87  1     0.024 3.58e-113 9       PPBP    
    ##每一个cluster中top2差异基因
    
    #值得注意的是: The ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).
    cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
    
    #######################################################
    
    #同时,该包提供了一系列可视化方法来检查差异分析的结果的可靠性:
    #VlnPlot (shows expression probability distributions across clusters)
    #FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are our most commonly used visualizations
    #JoyPlot, CellPlot, and DotPlot
    
    VlnPlot(pbmc, features = c("CCR7","S100A9","LTB", "CD79A"))
    
    # you can plot raw counts as well
    VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
    
    FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", 
                                   "CD8A"))
    #DoHeatmap generates an expression heatmap for given cells and genes. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
    top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
    DoHeatmap(pbmc, features = top10$gene) + NoLegend()
    
    
    new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
                         "NK", "DC", "Platelet","B")
    names(new.cluster.ids) <- levels(pbmc)
    pbmc <- RenameIdents(pbmc, new.cluster.ids)
    DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 1) + NoLegend()
    
    
    # Further subdivisions within cell types
    # First lets stash our identities for later
    pbmc[["ClusterNames_0.6"]] <- Idents(object = pbmc)
    
    # Note that if you set save.snn=T above, you don't need to recalculate the SNN, and can simply put: 
    # pbmc <- FindClusters(pbmc,resolution = 0.8)
    pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:11, resolution = 0.6, print.output = FALSE)
    
    # Demonstration of how to plot two tSNE plots side by side, and how to color points based on different criteria
    plot1 <- TSNEPlot(object = pbmc, do.return = TRUE, no.legend = TRUE, do.label = TRUE)
    plot2 <- TSNEPlot(object = pbmc, do.return = TRUE, group.by = "ClusterNames_0.6", no.legend = TRUE, do.label = TRUE)
    plot_grid(plot1, plot2)
    
    ################################################################
    

    7、亚群特异性标志物鉴

    # Find discriminating markers
    tcell.markers <- FindMarkers(object = pbmc, ident.1 = 0, ident.2 = 1)
    
    # Most of the markers tend to be expressed in C1 (i.e. S100A4). However, we can see that CCR7 is upregulated in 
    # C0, strongly indicating that we can differentiate memory from naive CD4 cells.
    # cols.use demarcates the color palette from low to high expression
    FeaturePlot(object = pbmc, features = c("S100A4", "CCR7"), cols= c("green", "blue"))
    
    
    
    ##########################################################################################################
    
    

    2.3 伪时间序列分析

    ################################monocle###################################################################
    ################################monocle#############################################################
    ##########################################################################################################
    ##伪时间分析
    
    #加载程辑包:
    source("http://bioconductor.org/biocLite.R")
    biocLite("monocle")
    library(monocle)
    
    ##1、数据导入
    pbmc <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200, 
                               project = "10X_PBMC")
    pbmc
    
    #spleen_monocle <- importCDS(seurat_lung, import_all = TRUE)
    
    data <- as(as.matrix(pbmc@assays$RNA@counts), 'sparseMatrix')
    pd <- new('AnnotatedDataFrame', data = pbmc@meta.data)
    fData <- data.frame(gene_short_name = row.names(data), row.names = row.names(data))
    fd <- new('AnnotatedDataFrame', data = fData)
    
    #Construct monocle cds
    clustered_spleen_monocle <- newCellDataSet(data,
                                  phenoData = pd,
                                  featureData = fd,
                                  lowerDetectionLimit = 0.5,
                                  expressionFamily = negbinomial.size())
    
    head(pData(clustered_spleen_monocle))
    
                   orig.ident nCount_RNA nFeature_RNA Size_Factor
    AAACATACAACCAC   10X_PBMC       2419          779          NA
    AAACATTGAGCTAC   10X_PBMC       4903         1352          NA
    AAACATTGATCAGC   10X_PBMC       3147         1129          NA
    AAACCGTGCTTCCG   10X_PBMC       2639          960          NA
    AAACCGTGTATGCG   10X_PBMC        980          521          NA
    AAACGCACTGGTAC   10X_PBMC       2163          781          NA
    
    names(pData(clustered_spleen_monocle))[names(pData(clustered_spleen_monocle))=="res.0.6"]="Cluster"
    
    clustered_spleen_monocle <- estimateSizeFactors(clustered_spleen_monocle)
    clustered_spleen_monocle <- estimateDispersions(clustered_spleen_monocle)
    
    HSMM<-clustered_spleen_monocle 
    ## 归一化 
    HSMM <- estimateSizeFactors(HSMM)
    HSMM <- estimateDispersions(HSMM)
    #Filtering low-quality cells
    HSMM <- detectGenes(HSMM, min_expr = 3 )
    print(head(fData(HSMM)))
    
                  gene_short_name num_cells_expressed
    AL627309.1         AL627309.1                   0
    AP006222.2         AP006222.2                   0
    RP11-206L10.2   RP11-206L10.2                   0
    RP11-206L10.9   RP11-206L10.9                   0
    LINC00115           LINC00115                   0
    NOC2L                   NOC2L                   1
    
    
    expressed_genes <- row.names(subset(fData(HSMM),num_cells_expressed >= 10))
                                                                          
    print(head(pData(HSMM)))
    
                   orig.ident nCount_RNA nFeature_RNA Size_Factor num_genes_expressed
    AAACATACAACCAC   10X_PBMC       2419          779   1.1204557                 126
    AAACATTGAGCTAC   10X_PBMC       4903         1352   2.2710187                 181
    AAACATTGATCAGC   10X_PBMC       3147         1129   1.4576577                 149
    AAACCGTGCTTCCG   10X_PBMC       2639          960   1.2223574                 153
    AAACCGTGTATGCG   10X_PBMC        980          521   0.4539258                  34
    AAACGCACTGGTAC   10X_PBMC       2163          781   1.0018791                 103
    
    #细胞分类(Classifying)
    CDC20 <- row.names(subset(fData(HSMM), gene_short_name == "CDC20"))
    GABPB2 <- row.names(subset(fData(HSMM),gene_short_name == "GABPB2"))
    
    cth <- newCellTypeHierarchy()
    cth <- addCellType(cth, "CDC20", classify_func =
                         function(x) { x[CDC20,] >= 1 })
    cth <- addCellType(cth, "GABPB2", classify_func = function(x){ x[GABPB2,] >= 1 })
    
    HSMM <- classifyCells(HSMM, cth, 0.1)
    table(pData(HSMM)$CellType)
    
    #Clustering cells without marker genes 
    #红线表示单片基于这种关系对色散的期望。我们标记用于聚类的基因用黑点表示,其他的用灰点表示。
    disp_table <- dispersionTable(HSMM)
    unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1)
    HSMM <- setOrderingFilter(HSMM, unsup_clustering_genes$gene_id)
    plot_ordering_genes(HSMM)
    
    # Plots the percentage of variance explained by the each component based on PCA from the normalized expression data using the same procedure used in reduceDimension function.
    # HSMM@auxClusteringData[["tSNE"]]$variance_explained <- NULL
    plot_pc_variance_explained(HSMM, return_all = F) # norm_method='log'
    
    
    HSMM <- reduceDimension(HSMM, max_components = 2, num_dim = 10,reduction_method = 'tSNE', verbose = T)
    #Remove noise by PCA ...
    #Reduce dimension by tSNE ...
    HSMM <- clusterCells(HSMM, num_clusters = 2)
    #Distance cutoff calculated to 2.589424 
    plot_cell_clusters(HSMM, 1, 2, color = "CellType",markers = c("CDC20", "GABPB2"))
    
    ##降维
    HSMM <- reduceDimension(HSMM, max_components = 2,
                            method = 'DDRTree')
    
    # 将细胞按照伪时间排序
    clustered_spleen_monocle<-HSMM
    clustered_spleen_monocle <- orderCells(clustered_spleen_monocle)
    colnames(pData(clustered_spleen_monocle))
    plot_cell_trajectory(clustered_spleen_monocle, color_by = "Cluster")
    plot_cell_trajectory(clustered_spleen_monocle, color_by = "CellType")
    plot_cell_trajectory(clustered_spleen_monocle, color_by = "State")
    
    

    相关文章

      网友评论

        本文标题:单细胞转录组-2

        本文链接:https://www.haomeiwen.com/subject/ixkcrctx.html