美文网首页R语言初学Linux 生信分析相关
一篇数据挖掘文章的图表复现-1

一篇数据挖掘文章的图表复现-1

作者: 小洁忘了怎么分身 | 来源:发表于2022-04-07 15:26 被阅读0次

    0.文章

    文章标题:Characterization of an endoplasmic reticulum stress‐related signature to evaluate immune features and predict prognosis in glioma
    期刊:J Cell Mol Med
    影响因子:5.3
    虽然是个小小的5分文章,但涉及到的分析非常丰富,图表也很多样,我把他的数据拿来做例子进行分析,开启一段更新咯。
    流程图

    构建模型部分的主要结果:


    图A是四个数据集中,用两种算法(log-rank test和单因素cox)p<0.05 选出来的基因。B和C是lasso回归的两张经典图表。


    图D是结合逐步回归法,选出的16个基因构成的多因素cox模型。可以看到C-index值是0.86。
    这篇文章的后续用了两个数据做测试集,效果都很好,那是因为一开始的时候筛选构建lasso回归模型所使用的基因就是从四个数据集、两种算法取出来的交集,这些基因都是与生存关系非常密切的,所以用他们构建出来的模型,在这四个数据集里表现都不会差的,至于这算不算投机取巧嘛,无法定论,自己衡量就好。

    这个系列将会有好几篇,今天先更新第一篇,整理数据。

    1.数据整理的说明

    1.1 下载的数据

    四个数据的表达矩阵、生存信息表格。

    TCGA的LGG和GBM是分开的。后续用代码合并到一起。

    1.2 表达矩阵的要求

    (1)就要是取过log、没有异常值的矩阵,标准化过的也可以,因为不做差异分析。

    (2)如果是转录组数据,要log之后的标准化数据。(因为这里只做生存分析,不做差异分析,所以没有必要找count数据)

    (3)行名都转换成gene symbol

    (4)只要tumor样本,不要normal

    1.3 临床信息的要求

    (1)要有每个样本对应的生存时间和结局事件,并用代码调整顺序,与表达矩阵的样本顺序相同,严格一一对应。

    (2)为了后续代码统一,把生存时间和结局事件列名统一起来,都叫“time”和“event”。

    (3)至于时间的单位,反正是分别计算的,统不统一无所谓。

    1.4 关于数据来源的说明

    数据来源于哪个数据库,不是主要矛盾。重点是分清楚是芯片数据还是转录组数据,是芯片的话要能找到探针注释。如果是转录组数据,则需要分清楚是否标准化,进行了哪种标准化。反正是分别分析,又不需要合并,所以直接使用即可。

    2.TCGA的RNA-seq数据

    Table S1显示样本数量691,TCGA的GBM数据173样本,LGG有529样本。

    2.1 TCGA表达矩阵

    exp_gbm = read.table("TCGA-GBM.htseq_fpkm.tsv.gz",
                         header = T,row.names = 1,check.names = F)
    exp_lgg = read.table("TCGA-LGG.htseq_fpkm.tsv.gz",
                         header = T,row.names = 1,check.names = F)
    exp1 = as.matrix(cbind(exp_gbm,exp_lgg))
    #行名转换为gene symbol
    library(stringr)
    library(AnnoProbe)
    rownames(exp1) = str_split(rownames(exp1),"\\.",simplify = T)[,1];head(rownames(exp1))
    
    ## [1] "ENSG00000242268" "ENSG00000270112" "ENSG00000167578" "ENSG00000273842"
    ## [5] "ENSG00000078237" "ENSG00000146083"
    
    re = annoGene(rownames(exp1),ID_type = "ENSEMBL");head(re)
    
    ##        SYMBOL                           biotypes         ENSEMBL  chr start
    ## 1     DDX11L1 transcribed_unprocessed_pseudogene ENSG00000223972 chr1 11869
    ## 2      WASH7P             unprocessed_pseudogene ENSG00000227232 chr1 14404
    ## 3   MIR6859-1                              miRNA ENSG00000278267 chr1 17369
    ## 4 MIR1302-2HG                             lncRNA ENSG00000243485 chr1 29554
    ## 6     FAM138A                             lncRNA ENSG00000237613 chr1 34554
    ## 7      OR4G4P             unprocessed_pseudogene ENSG00000268020 chr1 52473
    ##     end
    ## 1 14409
    ## 2 29570
    ## 3 17436
    ## 4 31109
    ## 6 36081
    ## 7 53312
    
    library(tinyarray)
    exp1 = trans_array(exp1,ids = re,from = "ENSEMBL",to = "SYMBOL")
    exp1[1:4,1:4]
    
    ##             TCGA-06-0878-01A TCGA-26-5135-01A TCGA-06-5859-01A TCGA-06-2563-01A
    ## DDX11L1            0.0000000        0.0221755       0.00000000       0.01797799
    ## WASH7P             0.6673943        1.7429940       0.65334641       1.57566444
    ## MIR6859-1          1.2006708        2.1880585       1.51229174       2.17610561
    ## MIR1302-2HG        0.0000000        0.0000000       0.02520857       0.00000000
    
    dim(exp1)
    ## [1] 56514   702
    #删除正常样本
    Group = make_tcga_group(exp1)
    table(Group)
    ## Group
    ## normal  tumor 
    ##      5    697
    exp1 = exp1[,Group == "tumor"]
    #过滤表达量0值太多的基因
    exp1 = exp1[apply(exp1, 1, function(x){sum(x>0)>200}),]
    

    2.2 TCGA生存信息

    surv_gbm = readr::read_tsv("TCGA-GBM.survival.tsv")
    surv_gbm$TYPE = "GBM"
    surv_lgg = readr::read_tsv("TCGA-LGG.survival.tsv")
    surv_lgg$TYPE = "LGG"
    surv1 = rbind(surv_gbm,surv_lgg)
    head(surv1)
    ## # A tibble: 6 x 5
    ##   sample              OS `_PATIENT`   OS.time TYPE 
    ##   <chr>            <dbl> <chr>          <dbl> <chr>
    ## 1 TCGA-12-0657-01A     1 TCGA-12-0657       3 GBM  
    ## 2 TCGA-32-1977-01A     0 TCGA-32-1977       3 GBM  
    ## 3 TCGA-19-1791-01A     0 TCGA-19-1791       4 GBM  
    ## 4 TCGA-28-1757-01A     0 TCGA-28-1757       4 GBM  
    ## 5 TCGA-19-2624-01A     1 TCGA-19-2624       5 GBM  
    ## 6 TCGA-41-4097-01A     1 TCGA-41-4097       6 GBM
    table(colnames(exp1) %in% surv1$sample)
    ## 
    ## FALSE  TRUE 
    ##     6   691
    s = intersect(colnames(exp1),surv1$sample)
    exp1 = exp1[,s]
    surv1 = surv1[match(s,surv1$sample),]
    colnames(surv1)[c(2,4)] = c("event","time")
    exp1[1:4,1:4]
    ##            TCGA-06-0878-01A TCGA-26-5135-01A TCGA-06-5859-01A TCGA-06-2563-01A
    ## WASH7P           0.66739425       1.74299400       0.65334641       1.57566444
    ## MIR6859-1        1.20067078       2.18805848       1.51229174       2.17610561
    ## AL627309.1       0.00000000       0.02066275       0.01386984       0.00000000
    ## CICP27           0.09691981       0.01013527       0.00000000       0.02449211
    head(surv1)
    ## # A tibble: 6 x 5
    ##   sample           event `_PATIENT`    time TYPE 
    ##   <chr>            <dbl> <chr>        <dbl> <chr>
    ## 1 TCGA-06-0878-01A     0 TCGA-06-0878   218 GBM  
    ## 2 TCGA-26-5135-01A     1 TCGA-26-5135   270 GBM  
    ## 3 TCGA-06-5859-01A     0 TCGA-06-5859   139 GBM  
    ## 4 TCGA-06-2563-01A     0 TCGA-06-2563   932 GBM  
    ## 5 TCGA-41-2571-01A     1 TCGA-41-2571    26 GBM  
    ## 6 TCGA-28-5207-01A     1 TCGA-28-5207   343 GBM
    identical(colnames(exp1),surv1$sample)
    ## [1] TRUE
    

    共691个有生存信息的tumor样本。

    3.CGGA芯片数据整理

    exp2 = read.table("CGGA/CGGA.mRNA_array_301_gene_level.20200506.txt",header = T,row.names = 1)
    surv2 = read.table("CGGA/CGGA.mRNA_array_301_clinical.20200506.txt",sep = "\t",header = T,check.names = F)
    head(surv2)
    ##    CGGA_ID TCGA_subtypes PRS_type Histology  Grade Gender Age   OS
    ## 1  CGGA_11     Classical  Primary       GBM WHO IV Female  57  155
    ## 2 CGGA_124   Mesenchymal  Primary       GBM WHO IV   Male  53  414
    ## 3 CGGA_126   Mesenchymal  Primary       GBM WHO IV Female  50 1177
    ## 4 CGGA_168   Mesenchymal  Primary       GBM WHO IV   Male  17 3086
    ## 5 CGGA_172   Mesenchymal  Primary       GBM WHO IV Female  57  462
    ## 6 CGGA_195   Mesenchymal  Primary       GBM WHO IV   Male  48  486
    ##   Censor (alive=0; dead=1) Radio_status (treated=1;un-treated=0)
    ## 1                        1                                     1
    ## 2                        1                                     1
    ## 3                        1                                     1
    ## 4                        0                                     1
    ## 5                        1                                     1
    ## 6                        1                                     1
    ##   Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
    ## 1                                         0            Wildtype
    ## 2                                         0            Wildtype
    ## 3                                         1            Wildtype
    ## 4                                         1            Wildtype
    ## 5                                         0            Wildtype
    ## 6                                         1            Wildtype
    ##   1p19q_Codeletion_status MGMTp_methylation_status
    ## 1                    <NA>            un-methylated
    ## 2                    <NA>            un-methylated
    ## 3                    <NA>            un-methylated
    ## 4                    <NA>            un-methylated
    ## 5                    <NA>            un-methylated
    ## 6                    <NA>            un-methylated
    s = intersect(surv2$CGGA_ID,colnames(exp2))
    exp2 = as.matrix(exp2[,s])
    surv2 = surv2[match(s,surv2$CGGA_ID),]
    colnames(surv2)[c(9,8)] = c("event","time")
    exp2[1:4,1:4]
    ##          CGGA_11  CGGA_124  CGGA_126   CGGA_168
    ## A1BG  -0.3603635 0.5649519 0.3047719  0.1749144
    ## A1CF   2.3413600 1.1707935 2.2081513  0.9652286
    ## A2BP1  0.0345194 1.0964074 0.3024883  0.0949111
    ## A2LD1 -0.9130564 0.5108800 0.4253669 -0.1949377
    head(surv2)
    ##    CGGA_ID TCGA_subtypes PRS_type Histology  Grade Gender Age time event
    ## 1  CGGA_11     Classical  Primary       GBM WHO IV Female  57  155     1
    ## 2 CGGA_124   Mesenchymal  Primary       GBM WHO IV   Male  53  414     1
    ## 3 CGGA_126   Mesenchymal  Primary       GBM WHO IV Female  50 1177     1
    ## 4 CGGA_168   Mesenchymal  Primary       GBM WHO IV   Male  17 3086     0
    ## 5 CGGA_172   Mesenchymal  Primary       GBM WHO IV Female  57  462     1
    ## 6 CGGA_195   Mesenchymal  Primary       GBM WHO IV   Male  48  486     1
    ##   Radio_status (treated=1;un-treated=0)
    ## 1                                     1
    ## 2                                     1
    ## 3                                     1
    ## 4                                     1
    ## 5                                     1
    ## 6                                     1
    ##   Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
    ## 1                                         0            Wildtype
    ## 2                                         0            Wildtype
    ## 3                                         1            Wildtype
    ## 4                                         1            Wildtype
    ## 5                                         0            Wildtype
    ## 6                                         1            Wildtype
    ##   1p19q_Codeletion_status MGMTp_methylation_status
    ## 1                    <NA>            un-methylated
    ## 2                    <NA>            un-methylated
    ## 3                    <NA>            un-methylated
    ## 4                    <NA>            un-methylated
    ## 5                    <NA>            un-methylated
    ## 6                    <NA>            un-methylated
    identical(surv2$CGGA_ID,colnames(exp2))
    ## [1] TRUE
    

    4.CGGA转录组数据整理

    exp3 = read.table("CGGA/CGGA.mRNAseq_325.RSEM-genes.20200506.txt",header = T,row.names = 1)
    exp3 = as.matrix(log2(exp3+1))
    surv3 = read.table("CGGA/CGGA.mRNAseq_325_clinical.20200506.txt",sep = "\t",header = T,check.names = F)
    head(surv3)
    ##     CGGA_ID  PRS_type Histology   Grade Gender Age   OS
    ## 1 CGGA_1001   Primary       GBM  WHO IV   Male  11 3817
    ## 2 CGGA_1006   Primary        AA WHO III   Male  42  254
    ## 3 CGGA_1007   Primary       GBM  WHO IV Female  57  345
    ## 4 CGGA_1011   Primary       GBM  WHO IV Female  46  109
    ## 5 CGGA_1015   Primary       GBM  WHO IV   Male  62  164
    ## 6 CGGA_1019 Recurrent      rGBM  WHO IV   Male  60  212
    ##   Censor (alive=0; dead=1) Radio_status (treated=1;un-treated=0)
    ## 1                        0                                     0
    ## 2                        1                                     1
    ## 3                        1                                     1
    ## 4                        1                                     1
    ## 5                        1                                     1
    ## 6                        1                                     0
    ##   Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
    ## 1                                         1            Wildtype
    ## 2                                         1            Wildtype
    ## 3                                         1            Wildtype
    ## 4                                         0            Wildtype
    ## 5                                         0            Wildtype
    ## 6                                         1            Wildtype
    ##   1p19q_codeletion_status MGMTp_methylation_status
    ## 1               Non-codel            un-methylated
    ## 2               Non-codel            un-methylated
    ## 3               Non-codel            un-methylated
    ## 4               Non-codel            un-methylated
    ## 5               Non-codel            un-methylated
    ## 6               Non-codel               methylated
    colnames(surv3)[c(8,7)] = c("event","time")
    
    s = intersect(surv3$CGGA_ID,colnames(exp3))
    exp3 = exp3[,s]
    surv3 = surv3[match(s,surv3$CGGA_ID),]
    exp3[1:4,1:4]
    ##          CGGA_1001 CGGA_1006 CGGA_1007 CGGA_1011
    ## A1BG      3.769772 3.0054000  4.958379  2.933573
    ## A1BG-AS1  1.641546 1.0908534  2.933573  2.411426
    ## A2M       8.826294 6.7487296  7.698357  9.467952
    ## A2M-AS1   2.104337 0.1763228  0.704872  1.384050
    head(surv3)
    ##     CGGA_ID  PRS_type Histology   Grade Gender Age time event
    ## 1 CGGA_1001   Primary       GBM  WHO IV   Male  11 3817     0
    ## 2 CGGA_1006   Primary        AA WHO III   Male  42  254     1
    ## 3 CGGA_1007   Primary       GBM  WHO IV Female  57  345     1
    ## 4 CGGA_1011   Primary       GBM  WHO IV Female  46  109     1
    ## 5 CGGA_1015   Primary       GBM  WHO IV   Male  62  164     1
    ## 6 CGGA_1019 Recurrent      rGBM  WHO IV   Male  60  212     1
    ##   Radio_status (treated=1;un-treated=0)
    ## 1                                     0
    ## 2                                     1
    ## 3                                     1
    ## 4                                     1
    ## 5                                     1
    ## 6                                     0
    ##   Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
    ## 1                                         1            Wildtype
    ## 2                                         1            Wildtype
    ## 3                                         1            Wildtype
    ## 4                                         0            Wildtype
    ## 5                                         0            Wildtype
    ## 6                                         1            Wildtype
    ##   1p19q_codeletion_status MGMTp_methylation_status
    ## 1               Non-codel            un-methylated
    ## 2               Non-codel            un-methylated
    ## 3               Non-codel            un-methylated
    ## 4               Non-codel            un-methylated
    ## 5               Non-codel            un-methylated
    ## 6               Non-codel               methylated
    identical(surv3$CGGA_ID,colnames(exp3))
    ## [1] TRUE
    

    我没仔细看这个表达矩阵具体是哪种标准化,反正这个数据只是作为例子,我们就不深究到底是啥了,直接拿来用。注意如果是自己要用来发文章的数据,是要看清楚的,没标准化需要自行标准化,cpm,tmp,fpkm都行。

    5.GSE16011的整理

    library(tinyarray)
    geo = geo_download("GSE16011")
    library(stringr)
    #只要tumor样本
    k = str_detect(geo$pd$title,"glioma");table(k)
    ## k
    ## FALSE  TRUE 
    ##     8   276
    geo$exp = geo$exp[,k]
    geo$pd = geo$pd[k,]
    # 把行名转换为gene symbol
    gpl = GEOquery::getGEO(filename = "GPL8542_family.soft.gz",destdir = ".")
    
    ids  = gpl@dataTable@table[,1:2]
    library(clusterProfiler)
    library(org.Hs.eg.db)
    e2s = bitr(ids$ORF,fromType = "ENTREZID",toType = "SYMBOL",OrgDb = "org.Hs.eg.db")
    ids = merge(ids,e2s,by.x = "ORF",by.y = "ENTREZID")
    ids = ids[,2:3]
    colnames(ids) = c("probe_id","symbol")
    exp4 = trans_array(geo$exp,ids)
    surv4 = geo$pd[,c(1,4,7,9)]
    exp4[1:4,1:4]
    ##      GSM405201 GSM405202 GSM405203 GSM405204
    ## A1BG  7.219401  6.031261  6.707681  6.330666
    ## A2M  12.206626 13.260587 12.352693 13.044375
    ## NAT1  5.182989  7.213272  6.570733  6.343490
    ## NAT2  5.239842  5.260373  5.697435  4.859682
    head(surv4)
    ##               title age at diagnosis:ch1  histology:ch1 gender:ch1
    ## GSM405201  glioma 8                44.57 OD (grade III)     Female
    ## GSM405202  glioma 9                28.69 OD (grade III)       Male
    ## GSM405203 glioma 11                38.58 OD (grade III)       Male
    ## GSM405204 glioma 13                33.89 OD (grade III)       Male
    ## GSM405205 glioma 20                48.03 OD (grade III)       Male
    ## GSM405206 glioma 21                31.53 OD (grade III)       Male
    identical(rownames(surv4),colnames(exp4))
    ## [1] TRUE
    

    临床信息表格里没有生存信息,所以从文章附件里找

    从文章附件里提取临床信息表格

    https://aacrjournals.org/cancerres/article/69/23/9065/553005/Intrinsic-Gene-Expression-Profiles-of-Gliomas-Are

    包非常难装,需要配置java,但这个需求不常用,直接跳过这个过程。

    if(!file.exists("exp_surv4.Rdata")){
      library(tabulizer)
      f <- "00085472can092307-sup-stabs_1-6.pdf"
      re <- extract_tables(f,pages = 1:10) #提取前10页的表格。
      str(re)
      re = do.call(rbind,re)
      re[1:4,1:4]
      colnames(re) = re[1,]
      re = re[-1,]
      re = data.frame(re)
      re[re==""]=NA
    
      library(readr)
      re$Survival..years. = parse_double(re$Survival..years.,locale = locale(decimal_mark = ","))
    
      re$Age.at.diagnosis = parse_double(re$Age.at.diagnosis,locale = locale(decimal_mark = ","))
    
      dim(exp4)
      k = re$Reviewed.histological.diagnosis!="control";table(k)
      re = re[k,]
      re$Database.number = paste("glioma",re$Database.number)
      surv4$ID = rownames(surv4)
    
      surv4 = merge(surv4,re,by.x = "title",by.y = "Database.number")
      colnames(surv4)[13:14] = c("event","time")
      table(surv4$event)
      surv4$time = as.integer(surv4$time*365) #天为单位
      surv4$event[surv4$event=="Lost to\rfollow up"]=NA
      table(surv4$event)
    
      surv4$event=ifelse(surv4$event=="Alive",0,1)
      head(surv4)
    
      s = intersect(surv4$ID,colnames(exp4))
      exp4 = exp4[,s]
      surv4 = surv4[match(s,surv4$ID),]
      save(exp4,surv4,file = "exp_surv4.Rdata")
    }
    load("exp_surv4.Rdata")
    exp4[1:4,1:4]
    ##      GSM405234 GSM405235 GSM405236 GSM405237
    ## A1BG  7.613150  6.995666  7.337393  6.287498
    ## A2M  13.166457 13.003225 13.814942 12.991301
    ## NAT1  5.581672  7.470894  7.845210  6.088651
    ## NAT2  5.206464  4.618245  4.881652  5.011908
    head(surv4)
    ##        title age at diagnosis:ch1  histology:ch1 gender:ch1        ID Gender
    ## 1 glioma 101                57.68 GBM (grade IV)     Female GSM405234 Female
    ## 2 glioma 104                71.09 GBM (grade IV)       Male GSM405235   Male
    ## 3 glioma 105                52.20 GBM (grade IV)       Male GSM405236   Male
    ## 4 glioma 107                51.17 GBM (grade IV)       Male GSM405237   Male
    ## 5  glioma 11                38.58 OD (grade III)       Male GSM405203   Male
    ## 6 glioma 111                37.25 GBM (grade IV)       Male GSM405238   Male
    ##   Reviewed.histological.diagnosis Age.at.diagnosis  KPS     Type.of.surgery
    ## 1                  GBM (grade IV)            57.68   70   Partial resection
    ## 2                  GBM (grade IV)            71.09   80   Partial resection
    ## 3                  GBM (grade IV)            52.20   80  Complete resection
    ## 4                  GBM (grade IV)            51.17   70 Stereotactic biopsy
    ## 5                  OD (grade III)            38.58 <NA>  Complete resection
    ## 6                  GBM (grade IV)            37.25   80 Stereotactic biopsy
    ##   Radiotherapy Chemotherapy event time
    ## 1          yes           no     1  226
    ## 2          yes           no     1   76
    ## 3          yes           no     1  375
    ## 4          yes           no     1  149
    ## 5          yes           no     1 3255
    ## 6          yes           no     1  343
    identical(surv4$ID,colnames(exp4))
    ## [1] TRUE
    

    检查和保存数据

    par(mfrow = c(2,2))
    boxplot(exp1[,1:10])
    boxplot(exp2[,1:10])
    boxplot(exp3[,1:10])
    boxplot(exp4[,1:10])
    
    image.png
    save(exp1,surv1,file = "exp_surv1.Rdata")
    save(exp2,surv2,file = "exp_surv2.Rdata")
    save(exp3,surv3,file = "exp_surv3.Rdata")
    save(exp4,surv4,file = "exp_surv4.Rdata")
    

    相关文章

      网友评论

        本文标题:一篇数据挖掘文章的图表复现-1

        本文链接:https://www.haomeiwen.com/subject/ovxssrtx.html