实战系列（二）CCR数据挖掘建模文章操作

作者: vegene | 来源:发表于2019-06-21 15:27 被阅读177次

实战系列（二）CCR数据挖掘建模文章操作
2018-11-12
2018-12-26
2019-03-04
GEO挖掘实战二、差异分析及富集分析
GEO挖掘实战一、初步探索数据
苏州10月大数据建模与分析挖掘应用实战培训班
2019-07-14
某网价值1800元的Python课程分享
Python数据挖掘与机器学习_通信信用风险评估实战(3)——特

前一篇在这里实战系列（一）手把手复现3分lncRNA经典小文章

微信服务号本月没有推送了，所以第二篇先在这首发了。

我们来看一下今天介绍的文章吧：

2012年发表的结直肠癌7基因Signature，影响因子10.199分，到19年现在基本可以认为是个2-4分水平的文章

目前可以认为这是一个比较基础的建模分析套路，话不多说，上手吧。

结果展示

Figure 1.Workflow for development of a prognostic gene expression signature. From the learning series of 95 stage II and III colorectal cancer samples, a filtered gene expression data set was used as input for survival modeling. From 1,000 iterations of lasso-penalized multivariate modeling, 7 models were reported as optimal for survival prediction more than 50 times (middle bar plot). For each of the gene expression signatures, patients were dichotomized to good and poor prognosis groups according to all the possible stepwise increases in amounts of genes being expressed at levels associated with poor prognosis (bottom). All 28 possible stratifications (pale and dark blue boxes in heatmap) were tested for univariate associations with patient survival, yielding significant associations for 22 (dark blue).

Figure 2.Statistical characteristics of genes in the identified prognostic expression signature. The 7 genes in the expression signature (blue) had (A) low P values from univariate Cox proportional hazards analyses (median P = 0.02, Wald test for predictive potential) and (B) high variances in gene expression signals (median 2.4, log2 scale), compared with the remaining 3,091 genes included for survival modeling (gray; median P = 0.3; median variance 0.3). C, the Pearson correlations of expression signals between the 7 genes in the signature were generally weak (absolute values, 0.006–0.55).

Figure 3.Survival curves for patients with stage II and III colorectal cancer in the 3 independent series stratified by the 7-gene expression signature. A, patients in the learning series assigned to the poor prognosis group had a 10-year relapse-free survival rate of 9%, significantly poorer than the 62% survival rate for patients in the good prognosis group. The corresponding 5-year survival rates for patients in (B) the test series and (C) external validation series, were 49% compared with 78% and 45% compared with 81%, respectively. All survival curves were calculated by Kaplan–Meier statistics and compared by log-rank tests. Relapse or death from colorectal cancer within the 10 or 5 years of follow-up was regarded events, and patients with no events within the indicated period of follow-up were censored.

Figure 4.Survival curves in each stage for patients in the 3 independent series. Patients with stage II (left) and III (right) colorectal cancer were individually stratified according to the 7-gene expression signature. In the learning series, (A) 19% of stage II patients [HR, 6.6 (2.7–16.1)] and (B) 28% of stage III patients [HR, 2.6 (1.2–5.7)] were assigned to the poor prognosis group. In the test series, (C) 16% of stage II patients [HR, 3.3 (0.8–13.3)] and (D) 18% of stage III patients [HR, 2.3 (0.6–8.8)] were assigned to the poor prognosis group. In the external validation series, only (E) 10% of stage II patients were assigned to the poor prognosis group [HR, 1.6 (0.3–6.9)], compared with (F) 21% of stage III patients [HR, 4.1 (2.0–8.2)]. All survival curves were calculated by Kaplan–Meier statistics and compared by log-rank tests. Relapse or death from colorectal cancer within the 10 or 5 years of follow-up was regarded events, and patients with no events within the indicated period of follow-up were censored.

文章整体思路

• 收集结直肠癌II和III期的样本数据，准备训练集、测试集、验证集三套数据

• 随访时间超过十年的按十年活着处理

• 取三套数据中注释到的共同的基因集

• 筛选在训练集中方差大于0.2 单因素生存分析小于0.5的基因

• 一千次lasso回归

• 统计一千次回归后的基因组合出现频率，大于50次的留下

• 拿这些留下的基因集构建分类器。

操作步骤

1.下载GEO、TCGA数据，数据预处理

library(WGCNA)

time<-c(B,…, C)#B = time to recurrence or censoring

event<-c(D,…, E)#D = 1 (recurrence) or 0 (censoring)

datExpr<-t(as.matrix(read.table("M.txt",header=TRUE,sep="\t",row.names=1,

as.is=TRUE))) #M =tab-delimited gene expression

uCox<-standardScreeningCensoredTime(time,event,datExpr,fastCalculation=FALSE)

vars=apply(datExpr,1,var)

最终得到了三千多个满足条件的基因

2.lasso回归

lasso回归是线性回归方法的一种，主要用来对高维数据进行有监督的学习，lasso和岭回归是线性回归的变种。

岭回归与Lasso回归的出现是为了解决线性回归出现的过拟合以及在通过正规方程方法求解θ的过程中出现的x转置乘以x不可逆这两类问题的，这两种回归均通过在损失函数中引入正则化项来达到目的，λ是关键。

岭回归与Lasso回归最大的区别在于岭回归引入的是L2范数惩罚项，Lasso回归引入的是L1范数惩罚项，Lasso回归能够使得损失函数中的许多θ均变成0，这点要优于岭回归，因为岭回归是要所有的θ均存在的，这样计算量Lasso回归将远远小于岭回归。

3.统计1000次回归后的基因组合

library(penalized)

s<-Surv(time,event) #’time’ and ‘event’ are vectors as

exprData<-t(as.matrix(read.table("D.txt",header=TRUE,sep="\t", row.names=1, as.is=TRUE))) #D = tab-delimited gene

opt <-optL1(s, penalized=exprData, fold=10)

#lambda:正则化项参数，cvl：交叉验证的似然值，cvl越低表明此时的lambda越好

p<-penalized(s, penalized=exprData, lambda1=E) #E = chosen λ

coefficients(p)

lasso回归每一次迭代都会得到一个基因组合，这类似于解多元方程式一样，比如二元方程的话会有2个解，一个解，或者无解；对应几千基因的回归，相当于几千维的数据，可能得到千百个解，而在如此高维的数据计算中计算机无法准确的获得所有解，即一次可能能得到一个最优解。即每一次lasso回归会得到一个解。作者一千次回归中得到一千个解，统计相同解的频次，如图出现五十次以上的拿出来如图所示。

4.定义风险分类方法

• 作者定义对于频次高于五十次的基因组合单独拿出来，共有7个组合，然后作者定义对于任意一个组合中的基因，如果这个基因HR大于1，则定义95个样本中的这个基因表达水平最高的20%认为是高风险的，HR小于1，则定义这95个样本中这个基因表达水平最低的20%的样本认为是高风险的。那么对于组合中的每个基因都这么定义，则最后能够得到每个样本在该基因组合中被定义为高风险的次数，根据样本被定义为高风险的次数对样本进行高低风险分类,≥1genes,≥2genes.......

• 对于每个组合都如此计算

• 最后根据得到的预后显著性绘制热图