GEO芯片数据分析(1)-GEOquery 包使用指南

作者: JeremyL | 来源:发表于2020-06-01 11:59 被阅读0次

    GEOquery 包使用指南

    # 1 GEO 介绍

    GEO(The NCBI Gene Expression Omnibus)是NCBI专门储存高通量测序的库。如基于芯片数据(mRNA、DNA、蛋白丰度),蛋白质质谱数据和高通量测序数据。
    GEO数据主要有4种基本类型。Sample, Platform 和 Series是由作者上传的数据,dataset是由GEO官方从做和提交的数据整理出来的。

    ## 1.1 Platforms
    GEO 号:GPLxxx。
    芯片的组成信息,例如 cDNAs, oligonucleotide probesets, ORFs, antibodies 。或者其它定量检测平台信息,例如SAGE tags, peptides。

    ## 1.2 Samples
    GEO 号: GSMxxx

    描述单个样本信息,处理步骤、处理条件以及实验测得的结果。一个样本可能属于多个研究(Series)。

    ## 1.3 Series
    GEO 号:GSExxx

    涉及同一个研究的记录,包括处理过的数据、总结和分析;信息可以从GSEMatrix文件解析快速得到。

    ##1.4 Datasets
    GEO 号:GDSxxx

    一套经过整理的GEO 数据集。每套数据都是可以进行生物学或者统计学上比较的样本,是GEO自带工具进行数据分析和展示的基础。一个 GDS数据集来自同一个平台,数据分析和标准化都具有一致性。

    # 2 GEOquery快速上手

    #GEOquery包安装:
    if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    BiocManager::install("GEOquery")
    
    #使用
    library("GEOquery")
    gsm <- getGEO(filename=system.file("extdata/GSM11805.txt.gz",package="GEOquery"))
    

    getGEO函数可以从GEO官网获取数据或者将固定格式数据解析为R格式的数据。

    # 3 GEOquery 数据结构

    GEOquery 数据结构大致分为两类。第一种是GDS, GPL和GSM,他们的操作和数据类型差不多;第二种是GSE,GSE数据是由GSM和GPL整合而成。

    ## 3.1 GDS, GSM 和 GPL

    这些数据类组成

    • 一个元数据头(与the SOFT 格式的头差不多)
    • 一个GEODataTable(数据)。

    可以使用show()查看这些数据类。

    • 查看 gsm metadata
    head(Meta(gsm))
    
    ## $channel_count
    ## [1] "1"
    ## 
    ## $comment
    ## [1] "Raw data provided as supplementary file"
    ## 
    ## $contact_address
    ## [1] "715 Albany Street, E613B"
    ## 
    ## $contact_city
    ## [1] "Boston"
    ## 
    ## $contact_country
    ## [1] "USA"
    ## 
    ## $contact_department
    ## [1] "Genetics and Genomics"
    
    • 查看 gsm data
    Table(gsm)[1:5,]
    ##           ID_REF  VALUE ABS_CALL
    ## 1 AFFX-BioB-5_at  953.9        P
    ## 2 AFFX-BioB-M_at 2982.8        P
    ## 3 AFFX-BioB-3_at 1657.9        P
    ## 4 AFFX-BioC-5_at 2652.7        P
    ## 5 AFFX-BioC-3_at 2019.5        P
    
    • 查看GSM数据列的含义
    Columns(gsm)
    ##     Column
    ## 1   ID_REF
    ## 2    VALUE
    ## 3 ABS_CALL
    ##                                                                  Description
    ## 1                                                                           
    ## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
    ## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065
    
    head(Meta(gds))
    Table(gds)[1:5,]
    Columns(gds)
    Columns(gds)[1,]
    

    ##3.2 GSE类

    GSE类组成:

    • 一个元数据头
    • GPLList,列表,包含GPL信息
    • GSMList,列表,包含GSM信息
      • 包含GEODataTable
    gse <- getGEO("GSE781",GSEMatrix=FALSE)
    
    head(Meta(gse))
    ## $contact_address
    ## [1] "715 Albany Street, E613B"
    ## 
    ## $contact_city
    ## [1] "Boston"
    ## 
    ## $contact_country
    ## [1] "USA"
    ## 
    ## $contact_department
    ## [1] "Genetics and Genomics"
    ## 
    ## $contact_email
    ## [1] "mlenburg@bu.edu"
    ## 
    ## $contact_fax
    ## [1] "617-414-1646"
    
    # names of all the GSM objects contained in the GSE
    names(GSMList(gse))
    ##  [1] "GSM11805" "GSM11810" "GSM11814" "GSM11815" "GSM11823" "GSM11827"
    ##  [7] "GSM11830" "GSM11832" "GSM12067" "GSM12069" "GSM12075" "GSM12078"
    ## [13] "GSM12079" "GSM12083" "GSM12098" "GSM12099" "GSM12100" "GSM12101"
    ## [19] "GSM12105" "GSM12106" "GSM12268" "GSM12269" "GSM12270" "GSM12274"
    ## [25] "GSM12283" "GSM12287" "GSM12298" "GSM12299" "GSM12300" "GSM12301"
    ## [31] "GSM12399" "GSM12412" "GSM12444" "GSM12448"
    
    # and get the first GSM object on the list
    GSMList(gse)[[1]]
    ## An object of class "GSM"
    ## channel_count 
    ## [1] "1"
    ## comment 
    ## [1] "Raw data provided as supplementary file"
    ## contact_address 
    ## [1] "715 Albany Street, E613B"
    ## contact_city 
    ## [1] "Boston"
    ## contact_country 
    ## [1] "USA"
    ## contact_department 
    ## [1] "Genetics and Genomics"
    ## contact_email 
    ## [1] "mlenburg@bu.edu"
    ## contact_fax 
    ## [1] "617-414-1646"
    ## contact_institute 
    ## [1] "Boston University School of Medicine"
    ## contact_name 
    ## [1] "Marc,E.,Lenburg"
    ## contact_phone 
    ## [1] "617-414-1375"
    ## contact_state 
    ## [1] "MA"
    ## contact_web_link 
    ## [1] "http://gg.bu.edu"
    ## contact_zip/postal_code 
    ## [1] "02130"
    ## data_row_count 
    ## [1] "22283"
    ## description 
    ## [1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrman Grade = 3; Adjacent Tumor Capsule Penetration = true; Adjacent Tumor Perinephric Fat Invasion = true; Adjacent Tumor Renal Sinus Invasion = false; Adjacent Tumor Renal Vein Invasion = true; Scaling Target = 500; Scaling Factor = 7.09; Raw Q = 2.39; Noise = 2.60; Background = 55.24."
    ## [2] "Keywords = kidney"                                                                                                                                                                                                                                                                                                                                                                           
    ## [3] "Keywords = renal"                                                                                                                                                                                                                                                                                                                                                                            
    ## [4] "Keywords = RCC"                                                                                                                                                                                                                                                                                                                                                                              
    ## [5] "Keywords = carcinoma"                                                                                                                                                                                                                                                                                                                                                                        
    ## [6] "Keywords = cancer"                                                                                                                                                                                                                                                                                                                                                                           
    ## [7] "Lot batch = 2004638"                                                                                                                                                                                                                                                                                                                                                                         
    ## geo_accession 
    ## [1] "GSM11805"
    ## last_update_date 
    ## [1] "May 28 2005"
    ## molecule_ch1 
    ## [1] "total RNA"
    ## organism_ch1 
    ## [1] "Homo sapiens"
    ## platform_id 
    ## [1] "GPL96"
    ## series_id 
    ## [1] "GSE781"
    ## source_name_ch1 
    ## [1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
    ## status 
    ## [1] "Public on Nov 25 2003"
    ## submission_date 
    ## [1] "Oct 20 2003"
    ## supplementary_file 
    ## [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM11nnn/GSM11805/GSM11805.CEL.gz"
    ## title 
    ## [1] "N035 Normal Human Kidney U133A"
    ## type 
    ## [1] "RNA"
    ## An object of class "GEODataTable"
    ## ****** Column Descriptions ******
    ##     Column
    ## 1   ID_REF
    ## 2    VALUE
    ## 3 ABS_CALL
    ##                                                                  Description
    ## 1                                                                           
    ## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
    ## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065
    ## ****** Data Table ******
    ##           ID_REF  VALUE ABS_CALL
    ## 1 AFFX-BioB-5_at  953.9        P
    ## 2 AFFX-BioB-M_at 2982.8        P
    ## 3 AFFX-BioB-3_at 1657.9        P
    ## 4 AFFX-BioC-5_at 2652.7        P
    ## 5 AFFX-BioC-3_at 2019.5        P
    ## 22278 more rows ...
    
    
    # and the names of the GPLs represented
    names(GPLList(gse))
    ## [1] "GPL96" "GPL97"
    

    # 4 Converting to BioConductor ExpressionSets and limma MALists

    GEO datasets与limma 数据结构MAList 和Biobase数据结构 ExpressionSet比较相似。可以相互转换:

    • GDS2MA 和 GDS2eSet

    ## 4.1 Getting GSE Series Matrix files as an ExpressionSet
    GEO Series是一套实验数据的集合,有SOFT,MINiML格式文件,以及一个 Series Matrix File(s)文本。Series Matrix File是tab-delimited text,getGEO函数可以解析,解析结果就是ExpressionSets。

    # Note that GSEMatrix=TRUE is the default
    # GSEMatrix=TRUE下载GSE2553_series_matrix.txt.gz和GPL1977.soft文件
    # GSEMatrix=F下载GSE2553.soft.gz
    gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
    show(gse2553)
    
    show(gse2553)
    ## $GSE2553_series_matrix.txt.gz
    ## ExpressionSet (storageMode: lockedEnvironment)
    ## assayData: 12600 features, 181 samples 
    ##   element names: exprs 
    ## protocolData: none
    ## phenoData
    ##   sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
    ##   varLabels: title geo_accession ... data_row_count (30 total)
    ##   varMetadata: labelDescription
    ## featureData
    ##   featureNames: 1 2 ... 12600 (12600 total)
    ##   fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
    ##   fvarMetadata: Column Description labelDescription
    ## experimentData: use 'experimentData(object)'
    ##   pubMedIds: 16230383 
    ## Annotation: GPL1977
    
    phenoData(gse2553[[1]]))
    pData(gse2553[[1]]))
    featureData(gse2553[[1]]))
    fData(gse2553[[1]]))
    experimentData(gse2553[[1]]))
    

    一个GSE下如果存在多个GPL测序,筛选特定的GPL数据;GSE会有多个列表gset[[idx]]

    show(pData(gse2553[[1]])[1:5,c(1,6,8)])
    ##                                                                  title type
    ## GSM48681                      Patient sample ST18, Dermatofibrosarcoma  RNA
    ## GSM48682                           Patient sample ST410, Ewing Sarcoma  RNA
    ## GSM48683                            Patient sample ST130, Sarcoma, NOS  RNA
    ## GSM48684 Patient sample ST293, Malignant Peripheral Nerve Sheath Tumor  RNA
    ## GSM48685                             Patient sample ST367, Liposarcoma  RNA
    ##                                  source_name_ch1
    ## GSM48681                     Dermatofibrosarcoma
    ## GSM48682                           Ewing Sarcoma
    ## GSM48683                            Sarcoma, NOS
    ## GSM48684 Malignant Peripheral Nerve Sheath Tumor
    ## GSM48685                             Liposarcoma
    

    ##4.2 Converting GDS to an ExpressionSet

    eset <- GDS2eSet(gds,do.log2=TRUE)
    
    eset
    ## ExpressionSet (storageMode: lockedEnvironment)
    ## assayData: 22645 features, 17 samples 
    ##   element names: exprs 
    ## protocolData: none
    ## phenoData
    ##   sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
    ##   varLabels: sample disease.state individual description
    ##   varMetadata: labelDescription
    ## featureData
    ##   featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645 total)
    ##   fvarLabels: ID Gene title ... GO:Component ID (21 total)
    ##   fvarMetadata: Column labelDescription
    ## experimentData: use 'experimentData(object)'
    ##   pubMedIds: 14641932 
    ## Annotation:
    
    pData(eset)[,1:3]
    ##            sample disease.state individual
    ## GSM11815 GSM11815           RCC        035
    ## GSM11832 GSM11832           RCC        023
    ## GSM12069 GSM12069           RCC        001
    ## GSM12083 GSM12083           RCC        005
    ## GSM12101 GSM12101           RCC        011
    ## GSM12106 GSM12106           RCC        032
    ## GSM12274 GSM12274           RCC          2
    ## GSM12299 GSM12299           RCC          3
    ## GSM12412 GSM12412           RCC          4
    ## GSM11810 GSM11810        normal        035
    ## GSM11827 GSM11827        normal        023
    ## GSM12078 GSM12078        normal        001
    ## GSM12099 GSM12099        normal        005
    ## GSM12269 GSM12269        normal          1
    ## GSM12287 GSM12287        normal          2
    ## GSM12301 GSM12301        normal          3
    ## GSM12448 GSM12448        normal          4
    

    ##4.3 Converting GDS to an MAList
    ExpressionSet不包含注释信息,getGEO可以帮助我们获取。

    #get the platform from the GDS metadata
    Meta(gds)$platform
    ## [1] "GPL97"
    
    #So use this information in a call to getGEO
    gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))
    

    与ExpressionSet不同,the limma MAList包含基因注释信息。上面的gpl包含注释信息。

    MAList不仅包含数据,还包含样本信息,和注释信息。

    MA <- GDS2MA(gds,GPL=gpl)
    class(MA)## 
    [1] "MAList"
    ## attr(,"package")
    ## [1] "limma"
    

    4.4 Converting GSE to an ExpressionSet
    GSE转换成ExpressionSet

    gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform_id})
    head(gsmplatforms)
    
    ## $GSM11805
    ## [1] "GPL96"
    ## 
    ## $GSM11810
    ## [1] "GPL97"
    ## 
    ## $GSM11814
    ## [1] "GPL96"
    ## 
    ## $GSM11815
    ## [1] "GPL97"
    ## 
    ## $GSM11823
    ## [1] "GPL96"
    ## 
    ## $GSM11827
    ## [1] "GPL97"
    

    这个GSE包含两个GPLs,GPL96 和 GPL97。

    筛选使用GPL96 的GSM。

    gsmlist = Filter(function(gsm) {Meta(gsm)$platform_id=='GPL96'},GSMList(gse))
    length(gsmlist)
    ## [1] 17
    
    Table(gsmlist[[1]])[1:5,]
    ##           ID_REF  VALUE ABS_CALL
    ## 1 AFFX-BioB-5_at  953.9        P
    ## 2 AFFX-BioB-M_at 2982.8        P
    ## 3 AFFX-BioB-3_at 1657.9        P
    ## 4 AFFX-BioC-5_at 2652.7        P
    ## 5 AFFX-BioC-3_at 2019.5        P
    

    the single measurement for each array is called the VALUE column

    # and get the column descriptions
    Columns(gsmlist[[1]])[1:5,]
    

    获取表达矩阵:

    # get the probeset ordering
    probesets <- Table(GPLList(gse)[[1]])$ID
    colnames(Table(GPLList(gse)[[1]]))
    [1] "ID"                               "GB_ACC"                           "SPOT_ID"                         
    [4] "Species Scientific Name"          "Annotation Date"                  "Sequence Type"                   
    [7] "Sequence Source"                  "Target Description"               "Representative Public ID"        
    [10] "Gene Title"                       "Gene Symbol"                      "ENTREZ_GENE_ID"                  
    [13] "RefSeq Transcript ID"             "Gene Ontology Biological Process" "Gene Ontology Cellular Component"
    [16] "Gene Ontology Molecular Function"
    
    # make the data matrix from the VALUE columns from each GSM
    # being careful to match the order of the probesets in the platform
    # with those in the GSMs
    data.matrix <- do.call('cbind',lapply(gsmlist,function(x) 
    {tab <- Table(x)
    mymatch <- match(probesets,tab$ID_REF)
    return(tab$VALUE[mymatch])
    }))
    data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
    data.matrix <- log2(data.matrix)
    data.matrix[1:5,]
    
    ##       GSM11805  GSM11814  GSM11823  GSM11830  GSM12067  GSM12075  GSM12079
    ## [1,] 10.926963 11.105254 11.275019 11.438636 11.424376 11.222795 11.469845
    ## [2,]  5.749534  7.908092  7.093814  7.514122  7.901470  6.407693  5.165912
    ## [3,]  7.066089  7.750205  7.244126  7.962896  7.337176  6.569856  7.477354
    ## [4,] 12.660353 12.479755 12.215897 11.458355 11.397568 12.529870 12.240046
    ## [5,]  6.195741  6.061776  6.565293  6.583459  6.877744  6.652486  3.981853
    ##       GSM12098  GSM12100  GSM12105  GSM12268  GSM12270  GSM12283  GSM12298
    ## [1,] 10.823367 10.835971 10.810893 11.062653 10.323055 11.181028 11.566387
    ## [2,]  6.556123  8.207014  6.816344  6.563768  7.353147  5.770829  6.912889
    ## [3,]  7.708739  7.428779  7.754888  7.126188  8.742815  7.339850  7.602142
    ## [4,] 12.336534 11.762839 11.237509 12.412490 11.213408 12.678380 12.232901
    ## [5,]  5.501439  6.247928  6.017922  6.525129  6.683696  5.918863  5.837943
    ##       GSM12300  GSM12399  GSM12444
    ## [1,] 11.078151 11.535178 11.105450
    ## [2,]  4.812498  7.471675  7.488644
    ## [3,]  7.383704  7.432959  7.381110
    ## [4,] 12.090939 11.421802 12.172834
    ## [5,]  6.281698  5.419539  5.469235
    

    构造ExpressionSet

    require(Biobase)
    # go through the necessary steps to make a compliant ExpressionSet
    rownames(data.matrix) <- probesets
    colnames(data.matrix) <- names(gsmlist)
    pdata <- data.frame(samples=names(gsmlist))
    rownames(pdata) <- names(gsmlist)
    pheno <- as(pdata,"AnnotatedDataFrame")
    eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
    eset2
    ## ExpressionSet (storageMode: lockedEnvironment)
    ## assayData: 22283 features, 17 samples 
    ##   element names: exprs 
    ## protocolData: none
    ## phenoData
    ##   sampleNames: GSM11805 GSM11814 ... GSM12444 (17 total)
    ##   varLabels: samples
    ##   varMetadata: labelDescription
    ## featureData: none
    ## experimentData: use 'experimentData(object)'
    ## Annotation:
    

    5 Accessing Raw Data from GEO

    • getGEOSuppFiles函数

    6 Use Cases

    ##6.1 Getting all Series Records for a Given Platform

    • 获取GPL相关的所有GSE 和 GSM accessions
    gpl97 <- getGEO('GPL97')
    Meta(gpl97)$title
    ## [1] "[HG-U133B] Affymetrix Human Genome U133B Array"
    head(Meta(gpl97)$series_id)
    ## [1] "GSE362" "GSE473" "GSE620" "GSE674" "GSE781" "GSE907"
    length(Meta(gpl97)$series_id)
    ## [1] 165
    head(Meta(gpl97)$sample_id)
    ## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930" "GSM3932"
    length(Meta(gpl97)$sample_id)
    ## [1] 7917
    
    gsmids <- Meta(gpl97)$sample_id
    gsmlist <- sapply(gsmids[1:5],getGEO)
    names(gsmlist)
    ## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930"
    

    英文版原文见:[Using the GEOquery Package

    相关文章

      网友评论

        本文标题:GEO芯片数据分析(1)-GEOquery 包使用指南

        本文链接:https://www.haomeiwen.com/subject/wtvjzhtx.html