美文网首页 生物信息学分析
loomR介绍及使用指南

loomR介绍及使用指南

作者: JeremyL | 来源:发表于2021-01-25 20:57 被阅读0次

    随着单细胞技术的发展,数据量增加使得计算需求呈指数增长。分析单细胞数据时,使用稀100000个细胞的系数矩阵处理对于Seurat 来说就很有挑战性。HDF5 格式现在被用于储存

    生物大数据,单细胞可以储存上百万个细胞的数据。

    Linnarson实验室开发了基于HDF5的数据结构-loom loompy,用于储存单细胞数据以及数据相关的属性信息。并且,他们还发布了一个工具loompy

    satijalab实验室开发了一个基于R的loom工具-loomR

    #安装

    • loomR 需要预安装hdf5r
    System Command
    OS X (using Homebrew) brew install hdf5
    Debian-based systems (including Ubuntu) sudo apt-get install libhdf5-dev
    Systems supporting yum and RPMs sudo yum install hdf5-devel
    install.packages("hdf5r")
    或
    devtools::install_github("hhoeflin/hdf5r")
    
    • 安装loomR
    # Install devtools from CRAN
    install.packages("devtools")
    devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")
    
    # Load loomR
    library(loomR)
    

    #loom对象介绍

    loomR 内置基于R6的loom对象。R6 对象与S4 类似,R6 使用代替@ (i.e. `my.objectfield.name), R6方法也可以直接调用(i.e.my.object$method()`)。

    # 连接loom对象

    标准的R对象时将数据加载到内存中,loom对象只是建立一个与磁盘文件的连接。使用loomR::create可以创建一个loom文件,或者使用Convert将Seurat 对象转换成loom文件。

    # Connect to the loom file in read/write mode
    lfile <- connect(filename = "pbmc.loom", mode = "r+")
    lfile
    
    ## Class: loom
    ## Filename: /home/paul/Documents/Satija/pbmc.loom
    ## Access type: H5F_ACC_RDWR
    ## Attributes: version, chunks
    ## Listing:
    ##        name    obj_type dataset.dims dataset.type_class
    ##   col_attrs   H5I_GROUP         <NA>               <NA>
    ##  col_graphs   H5I_GROUP         <NA>               <NA>
    ##      layers   H5I_GROUP         <NA>               <NA>
    ##      matrix H5I_DATASET 2700 x 13714          H5T_FLOAT
    ##   row_attrs   H5I_GROUP         <NA>               <NA>
    ##  row_graphs   H5I_GROUP         <NA>               <NA>
    

    一个loom包含6各部分,一个数据集(matrix),以及5个组layers, row_attrs, col_attrs, row_graphs, and col_graphs

    • matrix: n个基因m个细胞
    • layersmatrix处理后的数据,例如标准化后的数据。
    • row_attrs:基因得到metadata
    • col_attrs: 细胞metadata
    • row_graphs col_graphs

    # 对loom数据进行操作

    • 使用[[或$符号
    # Viewing the `matrix` dataset with the double subset [[ operator You can
    # also use the $ sigil, i.e. lfile$matrix
    lfile[["matrix"]]
    ## Class: H5D
    ## Dataset: /matrix
    ## Filename: /home/paul/Documents/Satija/pbmc.loom
    ## Access type: H5F_ACC_RDWR
    ## Datatype: H5T_IEEE_F64LE
    ## Space: Type=Simple     Dims=2700 x 13714     Maxdims=Inf x Inf
    ## Chunk: 32 x 32
    
    • 查看组内的数据集
    # Viewing a dataset in the 'col_attrs' group with the double subset [[
    # operator and full UNIX-style path
    lfile[["col_attrs/cell_names"]]
    
    ## Class: H5D
    ## Dataset: /col_attrs/cell_names
    ## Filename: /home/paul/Documents/Satija/pbmc.loom
    ## Access type: H5F_ACC_RDWR
    ## Datatype: H5T_STRING {
    ##       STRSIZE H5T_VARIABLE;
    ##       STRPAD H5T_STR_NULLTERM;
    ##       CSET H5T_CSET_ASCII;
    ##       CTYPE H5T_C_S1;
    ##    }
    ## Space: Type=Simple     Dims=2700     Maxdims=Inf
    ## Chunk: 1024
    
    # Viewing a dataset in the 'row_attrs' group with S3 $ chaining
    lfile$row.attrs$gene_names
    ## Class: H5D
    ## Dataset: /row_attrs/gene_names
    ## Filename: /home/paul/Documents/Satija/pbmc.loom
    ## Access type: H5F_ACC_RDWR
    ## Datatype: H5T_STRING {
    ##       STRSIZE H5T_VARIABLE;
    ##       STRPAD H5T_STR_NULLTERM;
    ##       CSET H5T_CSET_ASCII;
    ##       CTYPE H5T_C_S1;
    ##    }
    ## Space: Type=Simple     Dims=13714     Maxdims=Inf
    ## Chunk: 1024
    

    上面返回都是数据的描述,如果想获取真实的数据,需要使用[;[,]表示返回所有数据,或着使用索引获取对应的数据,这样就可以避免将所有数据导入内存。

    # Access the upper left corner of the data matrix
    lfile[["matrix"]][1:5, 1:5]
    
    ##      [,1] [,2] [,3] [,4] [,5]
    ## [1,]    0    0    0    0    0
    ## [2,]    0    0    0    0    0
    ## [3,]    0    0    0    0    0
    ## [4,]    0    0    0    0    0
    ## [5,]    0    0    0    0    0
    
    # Access the full data matrix (here, using the $ instead of the [[ operator
    # to access the matrix)
    full.matrix <- lfile$matrix[, ]
    dim(x = full.matrix)
    
    ## [1]  2700 13714
    
    # Access all gene names
    gene.names <- lfile[["row_attrs/gene_names"]][]
    head(x = gene.names)
    
    ## [1] "AL627309.1"    "AP006222.2"    "RP11-206L10.2" "RP11-206L10.9"
    ## [5] "LINC00115"     "NOC2L"
    
    • loom对象有一个get.attribute.df()方法,可以获取各种metadata 信息整合成数据框(data frame)
    • get.attribute.df()首先需要一个方向,1(行,基因), 2(列,细胞);然后是一个metadata 数据名字组成的列表。
    # Pull three bits of metadata from the column attributes
    attrs <- c("nUMI", "nGene", "orig.ident")
    attr.df <- lfile$get.attribute.df(MARGIN = 2, attribute.names = attrs)
    head(x = attr.df)
    
    ##                nUMI nGene    orig.ident
    ## AAACATACAACCAC 2419   779 SeuratProject
    ## AAACATTGAGCTAC 4903  1352 SeuratProject
    ## AAACATTGATCAGC 3147  1129 SeuratProject
    ## AAACCGTGCTTCCG 2639   960 SeuratProject
    ## AAACCGTGTATGCG  980   521 SeuratProject
    ## AAACGCACTGGTAC 2163   781 SeuratProject
    

    #loomR的Matrices

    问了提高效率,HDF5库对底层数据矩阵的访问进行了转置。基因储存在列,细胞储存在行;但是LoomR中,row.attrs还是表示基因,col.attrs表示细胞;这点容易让人混淆。

    # Print the number of genes
    lfile[["row_attrs/gene_names"]]$dims
    
    ## [1] 13714
    
    # Is the number of genes the same as the second dimension (typically
    # columns) for the matrix?
    lfile[["row_attrs/gene_names"]]$dims == lfile[["matrix"]]$dims[2]
    
    ## [1] TRUE
    
    # For the sake of consistency within the single-cell community, we've
    # reversed the dimensions for the `shape` field.  As such, the number of
    # genes is stored in `lfile$shape[1]`; the number of cells is stored in the
    # second field
    lfile[["row_attrs/gene_names"]]$dims == lfile$shape[1]
    
    ## [1] TRUE
    
    • 获取部分基因或细胞数据:
    # Pull gene expression data for all genes, for the first 5 cells Note that
    # we're using the row position for cells
    data.subset <- lfile[["matrix"]][1:5, ]
    dim(x = data.subset)
    
    ## [1]     5 13714
    
    # You can transpose this matrix if you wish to restore the standard
    # orientation
    data.subset <- t(x = data.subset)
    dim(x = data.subset)
    
    ## [1] 13714     5
    
    # Pull gene expression data for the gene MS4A1 Note that we're using the
    # column position for genes
    data.gene <- lfile[["matrix"]][, lfile$row.attrs$gene_names[] == "MS4A1"]
    head(x = data.gene)
    
    ## [1] 0 6 0 0 0 0
    

    #添加数据到loom

    • add.layer()
    • add.row.attribute()
    • add.col.attribute()

    这些方法只需要一个命名的矩阵或者向量列表。

    # Generate random ENSEMBL IDs for demonstration purposes
    ensembl.ids <- paste0("ENSG0000", 1:length(x = lfile$row.attrs$gene_names[]))
    # Use add.row.attribute to add the IDs Note that if you want to overwrite an
    # existing value, set overwrite = TRUE
    lfile$add.row.attribute(list(ensembl.id = ensembl.ids), overwrite = TRUE)
    lfile[["row_attrs"]]
    ## Class: H5Group
    ## Filename: /home/paul/Documents/Satija/pbmc.loom
    ## Group: /row_attrs
    ## Listing:
    ##        name    obj_type dataset.dims dataset.type_class
    ##  ensembl.id H5I_DATASET        13714         H5T_STRING
    ##  gene_names H5I_DATASET        13714         H5T_STRING
    
    # Find the ENSEMBL ID for TCL1A
    lfile[["row_attrs/ensembl.id"]][lfile$row.attrs$gene_names[] == "TCL1A"]
    
    ## [1] "ENSG00009584"
    

    #Chunk-based iteration

    处理大文件,可以每次读取部分数据,进行处理(迭代的方法)。loomR 内置了map或apply方法,可以每次对读入的数据进行处理。

    • map方法:不需要一次读取文件所有内容到内存中,但是返回的结果在内存中。
    • 计算每个细胞的UMI数
    # Map rowSums to `matrix`, using 500 cells at a time, returning a vector
    nUMI_map <- lfile$map(FUN = rowSums, MARGIN = 2, chunk.size = 500, dataset.use = "matrix", 
        display.progress = FALSE)
    # How long is nUMI_map? It should be equal to the number of cells in
    # `matrix`
    length(x = nUMI_map) == lfile$matrix$dims[1]
    ## [1] TRUE
    
      • MARGIN参数: 1表示行,或者基因;2表示列,或者细胞
    • apply,与map方法的差异在于,数据处理结果储存到loom 文件中(在磁盘,不在内存);因此消耗的内存只是运算部分。指定结果储存的文件名就可以了。

    # Apply rowSums to `matrix`, using 500 cells at a time, storing in
    # `col_attrs/umi_apply`
    lfile$apply(name = "col_attrs/umi_apply", FUN = rowSums, MARGIN = 2, chunk.size = 500, 
        dataset.use = "matrix", display.progress = FALSE, overwrite = TRUE)
    lfile$col.attrs$umi_apply
    
    ## Class: H5D
    ## Dataset: /col_attrs/umi_apply
    ## Filename: /home/paul/Documents/Satija/pbmc.loom
    ## Access type: H5F_ACC_RDWR
    ## Datatype: H5T_IEEE_F64LE
    ## Space: Type=Simple     Dims=2700     Maxdims=Inf
    ## Chunk: 1024
    
    # Ensure that all values are the same as doing a non-chunked calculation
    all(lfile$col.attrs$umi_apply[] == rowSums(x = lfile$matrix[, ]))
    
    ## [1] TRUE
    
    • Selective-chunking

    上面以到的数据处理要么是部分细胞(所有基因),部分基因(所有细胞)。但是也可以在基因和细胞同时进行选择,使用index.use参数就可以了。

    #Closing loom objects

    loom对象是文件的连接,写入到文件完成之后,必需关闭文件。

    lfile$close_all()
    

    #loomR and Seurat

    loom只是支持gitHub ‘loom’ branch的Seurat

    devtools::install_github(repo = "satijalab/seurat", ref = "loom")
    library(Seurat)
    

    #Creating a loom object from a Seurat object (converting between Seurat and loom)

    • Seurat对象转换为 loom 文件
    # Load the pbmc_small dataset included in Seurat
    data("pbmc_small")
    pbmc_small
    
    ## An object of class seurat in project SeuratProject 
    ##  230 genes across 80 samples.
    
    # Convert from Seurat to loom Convert takes and object in 'from', a name of
    # a class in 'to', and, for conversions to loom, a filename
    pfile <- Convert(from = pbmc_small, to = "loom", filename = "pbmc_small.loom", 
        display.progress = FALSE)
    pfile
    
    ## Class: loom
    ## Filename: /home/paul/Documents/Satija/pbmc_small.loom
    ## Access type: H5F_ACC_RDWR
    ## Attributes: version, chunks
    ## Listing:
    ##        name    obj_type dataset.dims dataset.type_class
    ##   col_attrs   H5I_GROUP         <NA>               <NA>
    ##  col_graphs   H5I_GROUP         <NA>               <NA>
    ##      layers   H5I_GROUP         <NA>               <NA>
    ##      matrix H5I_DATASET     80 x 230          H5T_FLOAT
    ##   row_attrs   H5I_GROUP         <NA>               <NA>
    ##  row_graphs   H5I_GROUP         <NA>               <NA>
    

    #Seurat standard workflow

    # Normalize data, and find variable genes using the typical Seurat workflow
    pbmc_small <- NormalizeData(object = pbmc_small, display.progress = FALSE)
    pbmc_small <- FindVariableGenes(object = pbmc_small, display.progress = FALSE, 
        do.plot = FALSE)
    head(x = pbmc_small@hvg.info)
    
    ##         gene.mean gene.dispersion gene.dispersion.scaled
    ## PCMT1    3.942220        7.751848               2.808417
    ## PPBP     5.555949        7.652876               1.216898
    ## LYAR     4.231004        7.577377               1.528749
    ## VDAC3    4.128322        7.383980               1.296982
    ## KHDRBS1  3.562833        7.367928               2.476809
    ## IGLL5    3.758330        7.319567               2.018088
    
    # Run the same workflow using the loom object
    NormalizeData(object = pfile, overwrite = TRUE, display.progress = FALSE)
    FindVariableGenes(object = pfile, overwrite = TRUE, display.progress = FALSE)
    # Normalized data goes into the 'norm_data' layer, variable gene information
    # goes into 'row_attrs' Are the results equal?
    par(mfrow = c(1, 2))
    plot(x = t(x = pfile$layers$norm_data[, ]), y = pbmc_small@data, main = "Normalized Data", 
        xlab = "loom", ylab = "Seurat")
    plot(x = pfile$row.attrs$gene_means[], y = pbmc_small@hvg.info[pfile$row.attrs$gene_names[], 
        "gene.mean"], main = "Gene Means", xlab = "loom", ylab = "Seurat")
    
    download.png
    • close loom 对象
    pfile$close_all()
    

    #原文

    Introduction to loomR
    mojaveazure/loomR
    Guided Clustering of the Mouse Cell Atlas: loom edition

    相关文章

      网友评论

        本文标题:loomR介绍及使用指南

        本文链接:https://www.haomeiwen.com/subject/menizktx.html