美文网首页GEO-芯片数据
aglient芯片原始数据处理

aglient芯片原始数据处理

作者: 小洁忘了怎么分身 | 来源:发表于2020-07-19 15:47 被阅读0次

    本文讲的是aglient芯片原始数据的处理,参考资料是limma 的userguide文档。GEO数据库下载的表达矩阵不符合预期,比如是空的,或者是有负值的,那我们就处理一下它的原始数据。aglient的芯片应用也很广泛,举个OSCC的栗子:GSE23558,跟着学习学习。

    1.下载和读取数据

    1.1获取临床信息数据

    从前,提到GEO数据下载,我们只有GEOquery,神功盖世,但是死于网速。后来就有了中国人寄几的GEO镜像,AnnoProbe包。还没有正式发表,就已经初露锋芒了,因简单易学,下载迅速,在我们的粉丝圈子里很受欢迎。

    rm(list = ls())
    library(stringr)
    library(AnnoProbe)
    library(GEOquery)
    library(limma)
    gse = "GSE23558"
    geoChina(gse)
    
    ## you can also use getGEO from GEOquery, by 
    ## getGEO("GSE23558", destdir=".", AnnotGPL = F, getGPL = F)
    
    ## $GSE23558_series_matrix.txt.gz
    ## ExpressionSet (storageMode: lockedEnvironment)
    ## assayData: 41000 features, 32 samples 
    ##   element names: exprs 
    ## protocolData: none
    ## phenoData
    ##   sampleNames: GSM577914 GSM577915 ... GSM577945 (32 total)
    ##   varLabels: title geo_accession ... tissue:ch1 (42 total)
    ##   varMetadata: labelDescription
    ## featureData: none
    ## experimentData: use 'experimentData(object)'
    ##   pubMedIds: 22072328
    ## 28433800 
    ## Annotation: GPL6480
    

    提供一个GSE编号就可以下载啦。因为表达矩阵是处理过的,我们不要,所以只提取临床信息表格,从中获得分组信息。

    load("GSE23558_eSet.Rdata")
    pd <- pData(gset[[1]])
    

    调整pd的行名与文件读取的顺序一致,并定义分组信息。

    raw_dir = "rawdata/GSE23558_RAW/"
    raw_datas = paste0(raw_dir,"/",dir(raw_dir))
    
    #调整pd与rawdata的顺序一致
    raw_order = str_extract(raw_datas,"GSM\\d*")
    pd = pd[match(raw_order,rownames(pd)),]
    
    group_list <- ifelse(stringr::str_detect(pd$title,"HealthyControl"),"normal","tumor")
    group_list <- factor(group_list, levels=c("normal","tumor"))
    

    1.2 读取原始数据

    x <- read.maimages(raw_datas,
                       source="agilent", 
                       green.only=TRUE,
                       other.columns = "gIsWellAboveBG")
    
    ## Read rawdata/GSE23558_RAW//GSM577914.txt 
    ## Read rawdata/GSE23558_RAW//GSM577915.txt 
    ## Read rawdata/GSE23558_RAW//GSM577916.txt 
    ## Read rawdata/GSE23558_RAW//GSM577917.txt 
    ## Read rawdata/GSE23558_RAW//GSM577918.txt 
    ## Read rawdata/GSE23558_RAW//GSM577919.txt 
    ## Read rawdata/GSE23558_RAW//GSM577920.txt 
    ## Read rawdata/GSE23558_RAW//GSM577921.txt 
    ## Read rawdata/GSE23558_RAW//GSM577922.txt 
    ## Read rawdata/GSE23558_RAW//GSM577923.txt 
    ## Read rawdata/GSE23558_RAW//GSM577924.txt 
    ## Read rawdata/GSE23558_RAW//GSM577925.txt 
    ## Read rawdata/GSE23558_RAW//GSM577926.txt 
    ## Read rawdata/GSE23558_RAW//GSM577927.txt 
    ## Read rawdata/GSE23558_RAW//GSM577928.txt 
    ## Read rawdata/GSE23558_RAW//GSM577929.txt 
    ## Read rawdata/GSE23558_RAW//GSM577930.txt 
    ## Read rawdata/GSE23558_RAW//GSM577931.txt 
    ## Read rawdata/GSE23558_RAW//GSM577932.txt 
    ## Read rawdata/GSE23558_RAW//GSM577933.txt 
    ## Read rawdata/GSE23558_RAW//GSM577934.txt 
    ## Read rawdata/GSE23558_RAW//GSM577935.txt 
    ## Read rawdata/GSE23558_RAW//GSM577936.txt 
    ## Read rawdata/GSE23558_RAW//GSM577937.txt 
    ## Read rawdata/GSE23558_RAW//GSM577938.txt 
    ## Read rawdata/GSE23558_RAW//GSM577939.txt 
    ## Read rawdata/GSE23558_RAW//GSM577940.txt 
    ## Read rawdata/GSE23558_RAW//GSM577941.txt 
    ## Read rawdata/GSE23558_RAW//GSM577942.txt 
    ## Read rawdata/GSE23558_RAW//GSM577943.txt 
    ## Read rawdata/GSE23558_RAW//GSM577944.txt 
    ## Read rawdata/GSE23558_RAW//GSM577945.txt
    
    dim(x)
    
    ## [1] 45015    32
    

    2.背景校正和标准化

    y <- backgroundCorrect(x, method="normexp")
    
    ## Array 1 corrected
    ## Array 2 corrected
    ## Array 3 corrected
    ## Array 4 corrected
    ## Array 5 corrected
    ## Array 6 corrected
    ## Array 7 corrected
    ## Array 8 corrected
    ## Array 9 corrected
    ## Array 10 corrected
    ## Array 11 corrected
    ## Array 12 corrected
    ## Array 13 corrected
    ## Array 14 corrected
    ## Array 15 corrected
    ## Array 16 corrected
    ## Array 17 corrected
    ## Array 18 corrected
    ## Array 19 corrected
    ## Array 20 corrected
    ## Array 21 corrected
    ## Array 22 corrected
    ## Array 23 corrected
    ## Array 24 corrected
    ## Array 25 corrected
    ## Array 26 corrected
    ## Array 27 corrected
    ## Array 28 corrected
    ## Array 29 corrected
    ## Array 30 corrected
    ## Array 31 corrected
    ## Array 32 corrected
    
    y <- normalizeBetweenArrays(y, method="quantile")
    class(y)
    
    ## [1] "EList"
    ## attr(,"package")
    ## [1] "limma"
    

    3. 基因过滤

    • 去除对照探针
    • 去除匹配不到genesymbol的探针
    • 去除不表达的探针,去除的标准是:至少在一半样本中高于背景,通过y(other)gIsWellAboveBG来判断。
    • 我自己加上了一个,测到多次的基因,只保留一个探针。
    Control <- y$genes$ControlType==1L;table(Control)
    
    ## Control
    ## FALSE  TRUE 
    ## 43529  1486
    
    NoSymbol <- is.na(y$genes$GeneName);table(NoSymbol)
    
    ## NoSymbol
    ## FALSE 
    ## 45015
    
    IsExpr <- rowSums(y$other$gIsWellAboveBG > 0) >= 16;table(IsExpr)
    
    ## IsExpr
    ## FALSE  TRUE 
    ## 13088 31927
    
    Isdup <- duplicated(y$genes$GeneName);table(Isdup)
    
    ## Isdup
    ## FALSE  TRUE 
    ## 30328 14687
    
    yfilt <- y[!Control & !NoSymbol & IsExpr & !Isdup, ]
    dim(yfilt)
    
    ## [1] 20650    32
    

    可以看到,过滤后剩下了2万多个探针。

    4.得到表达矩阵

    exp = yfilt@.Data[[1]]
    boxplot(exp)
    
    image.png
    exp[1:2,1:2]
    
    ##      rawdata/GSE23558_RAW//GSM577914 rawdata/GSE23558_RAW//GSM577915
    ## [1,]                        9.284154                       11.473334
    ## [2,]                        7.341236                        7.474406
    

    得到的表达矩阵没问题,但行名和列名均有问题。行名应该是探针名,列名是样本名,调整一下。

    4.1获得样本名

    colnames(exp) = str_extract(colnames(exp),"GSM\\d*")
    exp[1:2,1:2]
    
    ##      GSM577914 GSM577915
    ## [1,]  9.284154 11.473334
    ## [2,]  7.341236  7.474406
    

    4.2 获得基因名

    limma文档里写的是用了注释R包,在本例的原文件是里有探针注释的,这里直接使用。

    可以直接将exp的行名转为基因名。行名不能重复,所以先去重

    anno = yfilt$genes
    nrow(anno);nrow(exp)
    
    ## [1] 20650
    
    ## [1] 20650
    
    rownames(exp)=rownames(anno)
    ids = unique(anno$GeneName)
    exp = exp[!duplicated(anno$GeneName),]
    rownames(exp) = ids
    exp[1:4,1:4]
    
    ##             GSM577914 GSM577915 GSM577916 GSM577917
    ## APOBEC3B     9.284154 11.473334 10.439071 11.661000
    ## A_32_P77178  7.341236  7.474406  7.310818  7.397149
    ## ATP11B       9.963452  8.915621 10.193873  9.321954
    ## DNAJA1      13.469790 13.201078 12.827357 13.389431
    

    至此,得到了标准的表达矩阵。后面要做什么就看你啦,这就相当于修复了一下数据库里那个被标准化过的表达矩阵。

    5.差异分析

    design <- model.matrix(~group_list)
    fit <- lmFit(exp,design)
    fit <- eBayes(fit,trend=TRUE,robust=TRUE)
    summary(decideTests(fit))
    
    ##        (Intercept) group_listtumor
    ## Down             0            2102
    ## NotSig           0           16928
    ## Up           20650            1620
    
    deg = topTable(fit,coef=2,n=dim(y)[1])
    boxplot(exp[rownames(deg)[1],]~group_list)
    
    image.png
    save(exp,group_list,deg,gse,file = paste0(gse,"deg.Rdata"))
    

    相关文章

      网友评论

        本文标题:aglient芯片原始数据处理

        本文链接:https://www.haomeiwen.com/subject/iipkeqtx.html