0.文章
文章标题:Characterization of an endoplasmic reticulum stress‐related signature to evaluate immune features and predict prognosis in glioma
期刊:J Cell Mol Med
影响因子:5.3
虽然是个小小的5分文章,但涉及到的分析非常丰富,图表也很多样,我把他的数据拿来做例子进行分析,开启一段更新咯。
流程图
构建模型部分的主要结果:
图A是四个数据集中,用两种算法(log-rank test和单因素cox)p<0.05 选出来的基因。B和C是lasso回归的两张经典图表。
图D是结合逐步回归法,选出的16个基因构成的多因素cox模型。可以看到C-index值是0.86。
这篇文章的后续用了两个数据做测试集,效果都很好,那是因为一开始的时候筛选构建lasso回归模型所使用的基因就是从四个数据集、两种算法取出来的交集,这些基因都是与生存关系非常密切的,所以用他们构建出来的模型,在这四个数据集里表现都不会差的,至于这算不算投机取巧嘛,无法定论,自己衡量就好。
这个系列将会有好几篇,今天先更新第一篇,整理数据。
1.数据整理的说明
1.1 下载的数据
四个数据的表达矩阵、生存信息表格。
TCGA的LGG和GBM是分开的。后续用代码合并到一起。
1.2 表达矩阵的要求
(1)就要是取过log、没有异常值的矩阵,标准化过的也可以,因为不做差异分析。
(2)如果是转录组数据,要log之后的标准化数据。(因为这里只做生存分析,不做差异分析,所以没有必要找count数据)
(3)行名都转换成gene symbol
(4)只要tumor样本,不要normal
1.3 临床信息的要求
(1)要有每个样本对应的生存时间和结局事件,并用代码调整顺序,与表达矩阵的样本顺序相同,严格一一对应。
(2)为了后续代码统一,把生存时间和结局事件列名统一起来,都叫“time”和“event”。
(3)至于时间的单位,反正是分别计算的,统不统一无所谓。
1.4 关于数据来源的说明
数据来源于哪个数据库,不是主要矛盾。重点是分清楚是芯片数据还是转录组数据,是芯片的话要能找到探针注释。如果是转录组数据,则需要分清楚是否标准化,进行了哪种标准化。反正是分别分析,又不需要合并,所以直接使用即可。
2.TCGA的RNA-seq数据
Table S1显示样本数量691,TCGA的GBM数据173样本,LGG有529样本。
2.1 TCGA表达矩阵
exp_gbm = read.table("TCGA-GBM.htseq_fpkm.tsv.gz",
header = T,row.names = 1,check.names = F)
exp_lgg = read.table("TCGA-LGG.htseq_fpkm.tsv.gz",
header = T,row.names = 1,check.names = F)
exp1 = as.matrix(cbind(exp_gbm,exp_lgg))
#行名转换为gene symbol
library(stringr)
library(AnnoProbe)
rownames(exp1) = str_split(rownames(exp1),"\\.",simplify = T)[,1];head(rownames(exp1))
## [1] "ENSG00000242268" "ENSG00000270112" "ENSG00000167578" "ENSG00000273842"
## [5] "ENSG00000078237" "ENSG00000146083"
re = annoGene(rownames(exp1),ID_type = "ENSEMBL");head(re)
## SYMBOL biotypes ENSEMBL chr start
## 1 DDX11L1 transcribed_unprocessed_pseudogene ENSG00000223972 chr1 11869
## 2 WASH7P unprocessed_pseudogene ENSG00000227232 chr1 14404
## 3 MIR6859-1 miRNA ENSG00000278267 chr1 17369
## 4 MIR1302-2HG lncRNA ENSG00000243485 chr1 29554
## 6 FAM138A lncRNA ENSG00000237613 chr1 34554
## 7 OR4G4P unprocessed_pseudogene ENSG00000268020 chr1 52473
## end
## 1 14409
## 2 29570
## 3 17436
## 4 31109
## 6 36081
## 7 53312
library(tinyarray)
exp1 = trans_array(exp1,ids = re,from = "ENSEMBL",to = "SYMBOL")
exp1[1:4,1:4]
## TCGA-06-0878-01A TCGA-26-5135-01A TCGA-06-5859-01A TCGA-06-2563-01A
## DDX11L1 0.0000000 0.0221755 0.00000000 0.01797799
## WASH7P 0.6673943 1.7429940 0.65334641 1.57566444
## MIR6859-1 1.2006708 2.1880585 1.51229174 2.17610561
## MIR1302-2HG 0.0000000 0.0000000 0.02520857 0.00000000
dim(exp1)
## [1] 56514 702
#删除正常样本
Group = make_tcga_group(exp1)
table(Group)
## Group
## normal tumor
## 5 697
exp1 = exp1[,Group == "tumor"]
#过滤表达量0值太多的基因
exp1 = exp1[apply(exp1, 1, function(x){sum(x>0)>200}),]
2.2 TCGA生存信息
surv_gbm = readr::read_tsv("TCGA-GBM.survival.tsv")
surv_gbm$TYPE = "GBM"
surv_lgg = readr::read_tsv("TCGA-LGG.survival.tsv")
surv_lgg$TYPE = "LGG"
surv1 = rbind(surv_gbm,surv_lgg)
head(surv1)
## # A tibble: 6 x 5
## sample OS `_PATIENT` OS.time TYPE
## <chr> <dbl> <chr> <dbl> <chr>
## 1 TCGA-12-0657-01A 1 TCGA-12-0657 3 GBM
## 2 TCGA-32-1977-01A 0 TCGA-32-1977 3 GBM
## 3 TCGA-19-1791-01A 0 TCGA-19-1791 4 GBM
## 4 TCGA-28-1757-01A 0 TCGA-28-1757 4 GBM
## 5 TCGA-19-2624-01A 1 TCGA-19-2624 5 GBM
## 6 TCGA-41-4097-01A 1 TCGA-41-4097 6 GBM
table(colnames(exp1) %in% surv1$sample)
##
## FALSE TRUE
## 6 691
s = intersect(colnames(exp1),surv1$sample)
exp1 = exp1[,s]
surv1 = surv1[match(s,surv1$sample),]
colnames(surv1)[c(2,4)] = c("event","time")
exp1[1:4,1:4]
## TCGA-06-0878-01A TCGA-26-5135-01A TCGA-06-5859-01A TCGA-06-2563-01A
## WASH7P 0.66739425 1.74299400 0.65334641 1.57566444
## MIR6859-1 1.20067078 2.18805848 1.51229174 2.17610561
## AL627309.1 0.00000000 0.02066275 0.01386984 0.00000000
## CICP27 0.09691981 0.01013527 0.00000000 0.02449211
head(surv1)
## # A tibble: 6 x 5
## sample event `_PATIENT` time TYPE
## <chr> <dbl> <chr> <dbl> <chr>
## 1 TCGA-06-0878-01A 0 TCGA-06-0878 218 GBM
## 2 TCGA-26-5135-01A 1 TCGA-26-5135 270 GBM
## 3 TCGA-06-5859-01A 0 TCGA-06-5859 139 GBM
## 4 TCGA-06-2563-01A 0 TCGA-06-2563 932 GBM
## 5 TCGA-41-2571-01A 1 TCGA-41-2571 26 GBM
## 6 TCGA-28-5207-01A 1 TCGA-28-5207 343 GBM
identical(colnames(exp1),surv1$sample)
## [1] TRUE
共691个有生存信息的tumor样本。
3.CGGA芯片数据整理
exp2 = read.table("CGGA/CGGA.mRNA_array_301_gene_level.20200506.txt",header = T,row.names = 1)
surv2 = read.table("CGGA/CGGA.mRNA_array_301_clinical.20200506.txt",sep = "\t",header = T,check.names = F)
head(surv2)
## CGGA_ID TCGA_subtypes PRS_type Histology Grade Gender Age OS
## 1 CGGA_11 Classical Primary GBM WHO IV Female 57 155
## 2 CGGA_124 Mesenchymal Primary GBM WHO IV Male 53 414
## 3 CGGA_126 Mesenchymal Primary GBM WHO IV Female 50 1177
## 4 CGGA_168 Mesenchymal Primary GBM WHO IV Male 17 3086
## 5 CGGA_172 Mesenchymal Primary GBM WHO IV Female 57 462
## 6 CGGA_195 Mesenchymal Primary GBM WHO IV Male 48 486
## Censor (alive=0; dead=1) Radio_status (treated=1;un-treated=0)
## 1 1 1
## 2 1 1
## 3 1 1
## 4 0 1
## 5 1 1
## 6 1 1
## Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
## 1 0 Wildtype
## 2 0 Wildtype
## 3 1 Wildtype
## 4 1 Wildtype
## 5 0 Wildtype
## 6 1 Wildtype
## 1p19q_Codeletion_status MGMTp_methylation_status
## 1 <NA> un-methylated
## 2 <NA> un-methylated
## 3 <NA> un-methylated
## 4 <NA> un-methylated
## 5 <NA> un-methylated
## 6 <NA> un-methylated
s = intersect(surv2$CGGA_ID,colnames(exp2))
exp2 = as.matrix(exp2[,s])
surv2 = surv2[match(s,surv2$CGGA_ID),]
colnames(surv2)[c(9,8)] = c("event","time")
exp2[1:4,1:4]
## CGGA_11 CGGA_124 CGGA_126 CGGA_168
## A1BG -0.3603635 0.5649519 0.3047719 0.1749144
## A1CF 2.3413600 1.1707935 2.2081513 0.9652286
## A2BP1 0.0345194 1.0964074 0.3024883 0.0949111
## A2LD1 -0.9130564 0.5108800 0.4253669 -0.1949377
head(surv2)
## CGGA_ID TCGA_subtypes PRS_type Histology Grade Gender Age time event
## 1 CGGA_11 Classical Primary GBM WHO IV Female 57 155 1
## 2 CGGA_124 Mesenchymal Primary GBM WHO IV Male 53 414 1
## 3 CGGA_126 Mesenchymal Primary GBM WHO IV Female 50 1177 1
## 4 CGGA_168 Mesenchymal Primary GBM WHO IV Male 17 3086 0
## 5 CGGA_172 Mesenchymal Primary GBM WHO IV Female 57 462 1
## 6 CGGA_195 Mesenchymal Primary GBM WHO IV Male 48 486 1
## Radio_status (treated=1;un-treated=0)
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
## 1 0 Wildtype
## 2 0 Wildtype
## 3 1 Wildtype
## 4 1 Wildtype
## 5 0 Wildtype
## 6 1 Wildtype
## 1p19q_Codeletion_status MGMTp_methylation_status
## 1 <NA> un-methylated
## 2 <NA> un-methylated
## 3 <NA> un-methylated
## 4 <NA> un-methylated
## 5 <NA> un-methylated
## 6 <NA> un-methylated
identical(surv2$CGGA_ID,colnames(exp2))
## [1] TRUE
4.CGGA转录组数据整理
exp3 = read.table("CGGA/CGGA.mRNAseq_325.RSEM-genes.20200506.txt",header = T,row.names = 1)
exp3 = as.matrix(log2(exp3+1))
surv3 = read.table("CGGA/CGGA.mRNAseq_325_clinical.20200506.txt",sep = "\t",header = T,check.names = F)
head(surv3)
## CGGA_ID PRS_type Histology Grade Gender Age OS
## 1 CGGA_1001 Primary GBM WHO IV Male 11 3817
## 2 CGGA_1006 Primary AA WHO III Male 42 254
## 3 CGGA_1007 Primary GBM WHO IV Female 57 345
## 4 CGGA_1011 Primary GBM WHO IV Female 46 109
## 5 CGGA_1015 Primary GBM WHO IV Male 62 164
## 6 CGGA_1019 Recurrent rGBM WHO IV Male 60 212
## Censor (alive=0; dead=1) Radio_status (treated=1;un-treated=0)
## 1 0 0
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 1
## 6 1 0
## Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
## 1 1 Wildtype
## 2 1 Wildtype
## 3 1 Wildtype
## 4 0 Wildtype
## 5 0 Wildtype
## 6 1 Wildtype
## 1p19q_codeletion_status MGMTp_methylation_status
## 1 Non-codel un-methylated
## 2 Non-codel un-methylated
## 3 Non-codel un-methylated
## 4 Non-codel un-methylated
## 5 Non-codel un-methylated
## 6 Non-codel methylated
colnames(surv3)[c(8,7)] = c("event","time")
s = intersect(surv3$CGGA_ID,colnames(exp3))
exp3 = exp3[,s]
surv3 = surv3[match(s,surv3$CGGA_ID),]
exp3[1:4,1:4]
## CGGA_1001 CGGA_1006 CGGA_1007 CGGA_1011
## A1BG 3.769772 3.0054000 4.958379 2.933573
## A1BG-AS1 1.641546 1.0908534 2.933573 2.411426
## A2M 8.826294 6.7487296 7.698357 9.467952
## A2M-AS1 2.104337 0.1763228 0.704872 1.384050
head(surv3)
## CGGA_ID PRS_type Histology Grade Gender Age time event
## 1 CGGA_1001 Primary GBM WHO IV Male 11 3817 0
## 2 CGGA_1006 Primary AA WHO III Male 42 254 1
## 3 CGGA_1007 Primary GBM WHO IV Female 57 345 1
## 4 CGGA_1011 Primary GBM WHO IV Female 46 109 1
## 5 CGGA_1015 Primary GBM WHO IV Male 62 164 1
## 6 CGGA_1019 Recurrent rGBM WHO IV Male 60 212 1
## Radio_status (treated=1;un-treated=0)
## 1 0
## 2 1
## 3 1
## 4 1
## 5 1
## 6 0
## Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
## 1 1 Wildtype
## 2 1 Wildtype
## 3 1 Wildtype
## 4 0 Wildtype
## 5 0 Wildtype
## 6 1 Wildtype
## 1p19q_codeletion_status MGMTp_methylation_status
## 1 Non-codel un-methylated
## 2 Non-codel un-methylated
## 3 Non-codel un-methylated
## 4 Non-codel un-methylated
## 5 Non-codel un-methylated
## 6 Non-codel methylated
identical(surv3$CGGA_ID,colnames(exp3))
## [1] TRUE
我没仔细看这个表达矩阵具体是哪种标准化,反正这个数据只是作为例子,我们就不深究到底是啥了,直接拿来用。注意如果是自己要用来发文章的数据,是要看清楚的,没标准化需要自行标准化,cpm,tmp,fpkm都行。
5.GSE16011的整理
library(tinyarray)
geo = geo_download("GSE16011")
library(stringr)
#只要tumor样本
k = str_detect(geo$pd$title,"glioma");table(k)
## k
## FALSE TRUE
## 8 276
geo$exp = geo$exp[,k]
geo$pd = geo$pd[k,]
# 把行名转换为gene symbol
gpl = GEOquery::getGEO(filename = "GPL8542_family.soft.gz",destdir = ".")
ids = gpl@dataTable@table[,1:2]
library(clusterProfiler)
library(org.Hs.eg.db)
e2s = bitr(ids$ORF,fromType = "ENTREZID",toType = "SYMBOL",OrgDb = "org.Hs.eg.db")
ids = merge(ids,e2s,by.x = "ORF",by.y = "ENTREZID")
ids = ids[,2:3]
colnames(ids) = c("probe_id","symbol")
exp4 = trans_array(geo$exp,ids)
surv4 = geo$pd[,c(1,4,7,9)]
exp4[1:4,1:4]
## GSM405201 GSM405202 GSM405203 GSM405204
## A1BG 7.219401 6.031261 6.707681 6.330666
## A2M 12.206626 13.260587 12.352693 13.044375
## NAT1 5.182989 7.213272 6.570733 6.343490
## NAT2 5.239842 5.260373 5.697435 4.859682
head(surv4)
## title age at diagnosis:ch1 histology:ch1 gender:ch1
## GSM405201 glioma 8 44.57 OD (grade III) Female
## GSM405202 glioma 9 28.69 OD (grade III) Male
## GSM405203 glioma 11 38.58 OD (grade III) Male
## GSM405204 glioma 13 33.89 OD (grade III) Male
## GSM405205 glioma 20 48.03 OD (grade III) Male
## GSM405206 glioma 21 31.53 OD (grade III) Male
identical(rownames(surv4),colnames(exp4))
## [1] TRUE
临床信息表格里没有生存信息,所以从文章附件里找
从文章附件里提取临床信息表格
包非常难装,需要配置java,但这个需求不常用,直接跳过这个过程。
if(!file.exists("exp_surv4.Rdata")){
library(tabulizer)
f <- "00085472can092307-sup-stabs_1-6.pdf"
re <- extract_tables(f,pages = 1:10) #提取前10页的表格。
str(re)
re = do.call(rbind,re)
re[1:4,1:4]
colnames(re) = re[1,]
re = re[-1,]
re = data.frame(re)
re[re==""]=NA
library(readr)
re$Survival..years. = parse_double(re$Survival..years.,locale = locale(decimal_mark = ","))
re$Age.at.diagnosis = parse_double(re$Age.at.diagnosis,locale = locale(decimal_mark = ","))
dim(exp4)
k = re$Reviewed.histological.diagnosis!="control";table(k)
re = re[k,]
re$Database.number = paste("glioma",re$Database.number)
surv4$ID = rownames(surv4)
surv4 = merge(surv4,re,by.x = "title",by.y = "Database.number")
colnames(surv4)[13:14] = c("event","time")
table(surv4$event)
surv4$time = as.integer(surv4$time*365) #天为单位
surv4$event[surv4$event=="Lost to\rfollow up"]=NA
table(surv4$event)
surv4$event=ifelse(surv4$event=="Alive",0,1)
head(surv4)
s = intersect(surv4$ID,colnames(exp4))
exp4 = exp4[,s]
surv4 = surv4[match(s,surv4$ID),]
save(exp4,surv4,file = "exp_surv4.Rdata")
}
load("exp_surv4.Rdata")
exp4[1:4,1:4]
## GSM405234 GSM405235 GSM405236 GSM405237
## A1BG 7.613150 6.995666 7.337393 6.287498
## A2M 13.166457 13.003225 13.814942 12.991301
## NAT1 5.581672 7.470894 7.845210 6.088651
## NAT2 5.206464 4.618245 4.881652 5.011908
head(surv4)
## title age at diagnosis:ch1 histology:ch1 gender:ch1 ID Gender
## 1 glioma 101 57.68 GBM (grade IV) Female GSM405234 Female
## 2 glioma 104 71.09 GBM (grade IV) Male GSM405235 Male
## 3 glioma 105 52.20 GBM (grade IV) Male GSM405236 Male
## 4 glioma 107 51.17 GBM (grade IV) Male GSM405237 Male
## 5 glioma 11 38.58 OD (grade III) Male GSM405203 Male
## 6 glioma 111 37.25 GBM (grade IV) Male GSM405238 Male
## Reviewed.histological.diagnosis Age.at.diagnosis KPS Type.of.surgery
## 1 GBM (grade IV) 57.68 70 Partial resection
## 2 GBM (grade IV) 71.09 80 Partial resection
## 3 GBM (grade IV) 52.20 80 Complete resection
## 4 GBM (grade IV) 51.17 70 Stereotactic biopsy
## 5 OD (grade III) 38.58 <NA> Complete resection
## 6 GBM (grade IV) 37.25 80 Stereotactic biopsy
## Radiotherapy Chemotherapy event time
## 1 yes no 1 226
## 2 yes no 1 76
## 3 yes no 1 375
## 4 yes no 1 149
## 5 yes no 1 3255
## 6 yes no 1 343
identical(surv4$ID,colnames(exp4))
## [1] TRUE
检查和保存数据
par(mfrow = c(2,2))
boxplot(exp1[,1:10])
boxplot(exp2[,1:10])
boxplot(exp3[,1:10])
boxplot(exp4[,1:10])
image.png
save(exp1,surv1,file = "exp_surv1.Rdata")
save(exp2,surv2,file = "exp_surv2.Rdata")
save(exp3,surv3,file = "exp_surv3.Rdata")
save(exp4,surv4,file = "exp_surv4.Rdata")
网友评论