目前,单细胞主流的四种数据结构分别是Bioconductor主导的
,Seurat中的
格式,scanpy中的
格式,以及大型数据存储的
格式。
1. 主要结构组成
![](https://img.haomeiwen.com/i28812576/ca4aab199f8b69fa.png)
Bioconductor 项目的主要优势之一在于使用通用数据基础设施来增强跨包的互操作性。用户应该能够使用不同 Bioconductor 包中的功能来分析数据,而无需在格式之间进行转换。SingleCellExperiment包,充当了70 多个单细胞相关 Bioconductor 包之间数据交换的通用货币,其实现了一个数据结构,用于存储单细胞数据的所有方面 - 逐个细胞的表达数据、每个细胞的meta数据和每个基因的注释 - 并以同步的方式进行操作
1)(蓝色部分),包含主要数据,例如测序计数矩阵(原始计数矩阵以及数据的标准化版本),其中行对应于特征(基因),列对应于样本(细胞);
2)(橙色部分),样本(细胞)的注释信息,例如样本名称、批次信息、分组信息、表达概况信息等;
3)(绿色部分),特征(基因)的注释信息,例如表达概况、不同类基因名ID;
4)(紫色部分),每个细胞的降维特征信息,主要有PCA、tSNE、uMAP三类
2. 构建sce
2.1 主要数据-assays
# 安装并加载所需的R包
# BiocManager::install('SingleCellExperiment')
# BiocManager::install('scater')
library(SingleCellExperiment)
library(scater)
# 创建一个sce对象只需要一个assays
Data <- read.csv("~/scp_gex_matrix_raw.csv", header = T)
counts <- as.matrix(Data)
sce <- SingleCellExperiment(assays = list(counts = counts))
# 构建好assays的核心后,可继续进行拓展,使用scater包进行标准化(具体为先进行文库因子校正,再log2转换)
sce = scater::logNormCounts(sce)
assays(sce)
## List of length 2
## names(2): counts logcounts
★ 标准化的作用主要是去除细胞间或者样本文库间的差异,是所有的细胞或者样本具有了可比性
2.2 细胞信息-colData
cell_metadata <- read.delim("~/scp_meta_updated.txt", sep = "\t", header = T, row.names = 1) %>% data.frame()
# 通过直接构建的方式添加
sce <- SingleCellExperiment(assays = list(counts = counts),
colData = cell_metadata)
# 也可以后续添加
colData(sce) <- DataFrame(cell_metadata)
# 查看
colData(sce)
## DataFrame with 126351 rows and 9 columns
## Cell_Type Cell_State Cohort biosample_id donor_id species species__ontology_label disease disease__ontology_label
## <character> <character> <character> <character> <character> <character> <character> <character> <character>
## AAACCTGAGACGCTTT-1 T TS2 Control CD45 P18F NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGAGCAATCTC-1 B BS1 Leuk-UTI CD45 P20 NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGAGTTTAGGA-1 T TS2 Control CD45 P18F NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGCAAAGTGCG-1 Mono MS2 Control CD45 P18F NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGCAAGTACCT-1 Mono MS2 Control CD45 P17H NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## ... ... ... ... ... ... ... ... ... ...
## TTTGTCAAGAGGGCTT-35 T TS1 Bac-SEP CD45 E16 NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## TTTGTCAAGTGCTGCC-35 Mono MS1 Bac-SEP CD45 E16 NCBITaxon_9606 Homo sapiens PATO_0000461 ## normal
## TTTGTCAAGTGTTTGC-35 Mono MS2 Bac-SEP CD45 E16 NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## TTTGTCACAGCTTAAC-35 Mono MS1 Bac-SEP CD45 E16 NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## TTTGTCAGTTATGCGT-35 Mono MS1 Bac-SEP CD45 E16 NCBITaxon_9606 Homo sapiens PATO_0000461 normal
2.3 基因信息-rowData
同细胞/样本一样,基因也有自己的注释信息,它是一个数据框的结构,可通过rowData(sce)添加
2.4 细胞降维信息-reducedDims
reducedDims(sce)
## List of length 0
## names(0):
#利用scater包进行PCA、tSNE、UMAP降维
sce <- runPCA(sce)
sce <- runTSNE(sce)
sce <- runUMAP(sce)
reducedDims(sce)
## List of length 3
## names(3): PCA TSNE UMAP
head(reducedDim(sce, "PCA")[,1:2])
## PC1 PC2
## AAACCTGAGACGCTTT-1 6.086591 -7.4568784
## AAACCTGAGCAATCTC-1 4.711782 -5.2917814
## AAACCTGAGTTTAGGA-1 2.349309 -3.6703625
## AAACCTGCAAAGTGCG-1 -4.814992 -3.2960406
## AAACCTGCAAGTACCT-1 -4.832494 0.9035845
## AAACCTGCACGAAATA-1 -7.219861 5.4233799
3. sce子集化和组合
SingleCellExperiment对表达式数据的行或列的操作与关联的注释是同步的
sce
## class: SingleCellExperiment
## dim: 22858 126351
## metadata(0):
## assays(2): counts logcounts
## rownames(22858): RP11-34P13.7 RP11-34P13.8 ... AC233755.1 AC240274.1
## rowData names(0):
## colnames(126351): AAACCTGAGACGCTTT-1 AAACCTGAGCAATCTC-1 ... TTTGTCACAGCTTAAC-35 TTTGTCAGTTATGCGT-35
## colData names(9): Cell_Type Cell_State ... disease disease__ontology_label
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: NULL
## altExpNames(0):
# 筛选细胞
first.5 <- sce[,c(1:5)]
colData(first.5)
## DataFrame with 5 rows and 9 columns
## Cell_Type Cell_State Cohort biosample_id donor_id species species__ontology_label disease disease__ontology_label
## <character> <character> <character> <character> <character> <character> <character> <character> <character>
## AAACCTGAGACGCTTT-1 T TS2 Control CD45 P18F NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGAGCAATCTC-1 B BS1 Leuk-UTI CD45 P20 NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGAGTTTAGGA-1 T TS2 Control CD45 P18F NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGCAAAGTGCG-1 Mono MS2 Control CD45 P18F NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGCAAGTACCT-1 Mono MS2 Control CD45 P17H NCBITaxon_9606 Homo sapiens PATO_0000461 normal
# 根据注释信息筛选
Mono.only <- sce[, sce$Cell_Type == "Mono"]
head(colData(Mono.only))
## DataFrame with 6 rows and 9 columns
## Cell_Type Cell_State Cohort biosample_id donor_id species species__ontology_label disease disease__ontology_label
## <character> <character> <character> <character> <character> <character> <character> <character> <character>
## AAACCTGCAAAGTGCG-1 Mono MS2 Control CD45 P18F NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGCAAGTACCT-1 Mono MS2 Control CD45 P17H NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGCACGAAATA-1 Mono MS4 Leuk-UTI CD45 P18 NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGCAGGACCCT-1 Mono MS2 Control CD45 P17H NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGCAGGGTATG-1 Mono MS4 Leuk-UTI CD45 P20 NCBITaxon_9606 Homo sapiens PATO_0000461 normal
## AAACCTGCATTGGGCC-1 Mono MS4 Leuk-UTI CD45 P20 NCBITaxon_9606 Homo sapiens PATO_0000461 normal
# 使用cbind()按列组合对象,假设涉及的所有对象具有相同的行注释值和兼容的列注释字段(同理,可使用rbind()按行组合对象)
ncol(counts(Mono.only))
## 58557
sce2 <- cbind(Mono.only, Mono.only)
ncol(counts(sce2))
## [1] 117114
4. 对比Seurat数据结构
library(Seurat)
# 从SingleCellExperiment对象转换为Seurat对象
scRNA = as.Seurat(sce)
scRNA
## An object of class Seurat
## 22858 features across 126351 samples within 1 assay
## Active assay: originalexp (22858 features, 0 variable features)
## 3 dimensional reductions calculated: PCA, TSNE, UMAP
数据 | SingleCellExperiment数据结构 | Seurat数据结构 |
---|---|---|
原始数据 | sce@assays$counts | scRNA@assays$originalexp@counts |
经过标准化的数据 | sce@assays$logcounts | scRNA@assays$originalexp@data |
数据集的统计信息 | sce@colData | scRNA@meta.data |
细胞降维信息 | sce@int_colData$reducedDims | scRNA@reductions |
网友评论