导读
文章截图写这篇文档的初衷就是一直想把代谢组学数据写成一个小流程,之前关于非靶向代谢组学写了一个分析总结——非靶向 | 靶向代谢组学数据分析总结-纲要。那么今天我将把代谢组数据的具体分析思路做个整理。思路参考的是中科院大连化物所
许国旺
老师不久前发表的一篇文章。附上文章截图如下:
代谢组学研究样本设计,特别是临床样本,一般在有条件的情况下(样本量大),一般采用三步走的方法:
实验群体概况分析流程介绍
样本收集
-
大多数用到的是血清/血浆样本(血清和血浆的最大区别就是血清在凝血的过程中纤维蛋白原转变成纤维蛋白块,也就是说血清中没有纤维蛋白原)。一般在次日凌晨6:00~8:00(经过一晚上的禁食,主要是为了避免饮食对生物流体样本检测结果的干扰)。
-
除了血清等生物流体样本,如果是做组织病变的,还可以采集组织的病变的部分和邻近的正常组织部分。值得注意的是,
在取样使用时一定要在冰上解冻,或放在-4℃冰箱
上机测定
- discovery和test实验集的样本采用伪靶向(pseudotargeted method,目前一种较为新颖的技术,也有一些相关的文章报道)的方法进行测定,而在validation群体中进行靶向验证在discovery和test集合中找到的代谢物biomarker。
数据分析
-
这篇文章的多元统计分析是通过SMICA-P软件来实现的,包括PCA分析(无监督)用来评估整体代谢物在组间的变化及监测整个实验的稳定性。pls-da分析(有监督)用于组间最大化,以及鉴定重要的代谢物(根据投影变量的重要性,variable important in the projection,VIP),并设置排列检验,以防止模型的过拟合。
-
单变量分析:变量如果经过适当转换后符合正态性就可以使用常规的参数检验如 student-t test。如果不转换可以使用非参数检验如wilcoxon test(两组),Kruskal wallis test(适用于三组及以上分组的情况)。当然多重检验还需要进行P值的校正(这里一般采用
p.adjust()
函数,BH方法)。 -
回归模型,疾病表型的响应变量一般为
二分类
变量,即患病 Vs. 健康,或者根据疾病严重程度也可以分为多组,视具体情况而定。所以这里一般采用二元逻辑回归
或者随机森林模型
寻找潜在的代谢物靶标。
总结
-
discovery集合用于发现组间差异显著的代谢物;
-
test集合永和验证(或者说看能重复多少)discovery集合中的差异代谢物,这里有两个要求那就是:a)在discovery里是差异显著的在test中也是差异显著,显著性水平设置要一样;b)差异的方向在两个群体中要一致。选择出在两个群体中重复到的代谢物,进一步通过binary-logistic模型选择代谢物来构建最佳模型。
-
validation集合中用于验证上一步得到的结果
附加代码在最后
# Description: this script was arranged for metabolomics data analysis
# under the guidelines of article "A Large-Scale, Multicenter Serum Metabolite Biomarker Identification
# Study for the Early Detection of Hepatocellular Carcinoma"
# the mock data was from the R package statTarget
rm(list = ls())
setwd("D:\\R\\R-exercise\\metabo_data")
library(statTarget)
library(impute)
datapath <- system.file('extdata',package = 'statTarget')
file <- paste(datapath,'data_example.csv',sep = '/')
example1 <- read.csv(file)
## the second example data
example2 <- read.csv(choose.files(),check.names = F,header = T,row.names = 1)
# to impute the value NA by KNN method in the package impute
group_label <- as.factor(example2[,1])
example2[,1] <- NULL
example3 <- as.data.frame(t(example2))
example4 <- impute.knn(as.matrix(example3),k=9)
example5 <- as.data.frame(t(example4$data))
## PCAj无监督分析用于评估整体代谢物在组间的变化
library("FactoMineR")
library("factoextra")
# 数据间进行标度化处理
Z_score <- function(x){
x <- (x - mean(x))/sd(x)
}
example5 <- apply(example5, 1, Z_score)
example5 <- as.data.frame(t(example5))
data.pca <- PCA(example5,graph = FALSE)
Group <- group_label
data <- cbind(example5, Group)
colnames(data)[ncol(data)] <- "class"
# Scree plot to determine the number of principal components
tiff("meta_scree.tiff",width=2000,height=2000,res=300)
fviz_eig(data.pca, addlabels = TRUE, ylim = c(0, 100))
dev.off()
# individual plot
tiff("meta_PCA.tiff",width=2000,height=2000,res=300)
ind.p <- fviz_pca_ind(data.pca, geom = "point", col.ind = data$class)
ggpubr::ggpar(ind.p,
title = "Principal Component Analysis",
subtitle = "metabolomic sets",
#caption = "Source: factoextra",
xlab = "PC1(53.2%)", ylab = "PC2(7.3%)",
legend.title = "Group", legend.position = "top",
ggtheme = theme_gray(), palette = "npg"
)
dev.off()
## pls-da分析
library(ropls)
example5_oplsda <- opls(example5,Group,predI = 1,orthoI = NA)
# calculate the vip value
VipVn <- getVipVn(opls(example5,Group,predI = 1,orthoI = NA,plot = FALSE))
# extract the variable name which the VIP value > 1
VipVn_great_than_1 <- names(VipVn[VipVn > 1])
# univariate statistics to extract variables differred between groups
pvaVn <- apply(example5,2,function(feaVn) wilcox.test(feaVn ~ Group,paired=FALSE)$p.value)
pvaVn_adj <- p.adjust(pvaVn,method = "BH",n=length(pvaVn))
# extract the adjusted p-value less than 0.05
pvaVn_adj_less_0.05 <- names(pvaVn_adj[pvaVn_adj < 0.05])
# use intersect function to obtain the variables both VIP value great than 1
# and p value less than 0.05
final_variable <- intersect(VipVn_great_than_1,pvaVn_adj_less_0.05)
# 热图展示13个final_variable的丰度
heatmap_df <- example4$data
heatmap_df <- as.data.frame(t(heatmap_df))
heatmap_df <- heatmap_df[,final_variable]
heatmap_df_log10 <- log10(heatmap_df)
heatmap_df_log10 <- as.matrix(t(heatmap_df_log10))
library(pheatmap)
annotation_col <- data.frame(Group=factor(rep(c("N","T"),c(25,61))))
rownames(annotation_col) <- colnames(heatmap_df_log10)
ann_colors <- list(Group=c(N="#4DAF4A",T="#984EA3"))
#plot.new()
tiff("metabo_biomarker_pheatmap.tiff",width = 3000,height = 700,res = 300)
pheatmap(heatmap_df_log10,treeheight_col = 20,annotation_colors = ann_colors,cluster_cols = FALSE,
cluster_rows = TRUE,show_colnames = FALSE,annotation_col=annotation_col,
gaps_col = c(25,25),border_color = "NA",cellwidth = 4.7,cellheight = 10)
dev.off()
## 建立模型
## first transform the Group 'N' and 'T' into '0' and '1'
example6 <- example5[,final_variable]
outcome = Group
outcome <- sub("N",0,outcome) # 将分类变量转为0,1变量
outcome <- sub("T",1,outcome)
# merge the data without classification with outcome
data_logi <- cbind(example6,outcome)
# 按照一定比例选择,训练集70%样本,验证集30%的样本
index <- sample(x=2,size = nrow(data_logi),replace = TRUE,prob = c(0.7,0.3))
traindata <- data_logi[index==1,]
testdata <- data_logi[index==2,]
# 先对训练集数据构建logit模型
logit_lm <- glm(outcome ~.,family = binomial,data=traindata)
# summary(logit_lm)
logit_step <- step(logit_lm,direction = "backward")
# summary(logit_step)
# 模型的显著性检验,不止变量要显著,模型也要显著
anova(object = logit_step,test = "Chisq")
# 模型对样本外数据(验证集)的预测
prob <- predict(object = logit_step,newdata=testdata,type="response")
prediction <- ifelse(prob >=0.5, 1,0)
prediction <- factor(prediction,levels = c(1,0),ordered = TRUE)
f <- table(testdata$outcome,prediction)
f
agreement <- as.vector(prediction) == testdata$outcome
table(agreement)
prop.table(table(agreement))
# ROC curve and AUC value calculation
library(pROC)
roc(testdata$outcome,prob)
roc1 <- roc(testdata$outcome,
prob,
percent=TRUE,
partial.auc.correct=TRUE,
ci=TRUE, boot.n=100, ci.alpha=0.9, stratified=FALSE,
plot=F, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE)
tiff("example_variable_pROC.tiff",width=2000,height=2000,res=300)
roc1 <- roc(testdata$outcome,prob,
ci=TRUE, boot.n=100, ci.alpha=0.9, stratified=FALSE,
plot=TRUE, percent=roc1$percent,col="#F781BF",cex.lab=1.2, cex.axis=1.2,cex.main=1.5)
sens.ci <- ci.se(roc1, specificities=seq(0, 100, 5))
plot(sens.ci, type="shape", col="#FF69B4")
plot(sens.ci, type="bars")
plot(roc1,col="black",add=T)
legend("bottomright",cex=1.2,c(paste("AUC=",round(roc1$ci[2],2),"%"),
paste("95% CI:",round(roc1$ci[1],2),"%-",round(roc1$ci[3],2),"%")))
dev.off()
## 使用随机森林进行潜在biomarker的选择
library(randomForest)
# 首先确定下后面要用到的参数mtry和ntree
n <- length(names(traindata))
set.seed(100)
names(traindata)[1:ncol(traindata)-1] <- paste("M",1:13,sep = "_")
# 选择mtry参数
for(i in 1:(n-1)){
mtry_fit <- randomForest(outcome ~ ., data = traindata, mtry = i)
err <- mean(mtry_fit$err.rate)
print(err) # 当mtry=2的时候误差最小
}
# 选择ntree参数
set.seed(100)
ntree_fit <- randomForest(outcome ~., data = traindata,mtry=2,ntree=1000)
plot(ntree_fit) # 在400左右模型内误差基本稳定,默认值为500,可以选默认值
# 变量重要性选择
set.seed(100)
#' rfcv是随机森林交叉验证函数:Random Forest Cross Validation
result <- rfcv(traindata[,-ncol(traindata)],traindata$outcome,cv.fold = 10)
result$error.cv #' 查看错误率表,21时错误率最低,为最佳模型
#' 绘制验证结果
with(result,plot(n.var,error.cv,log="x",type = "o",lwd=2)) # 结果是选择13个变量误差最小
# 构建模型
set.seed(100)
rf <- randomForest(outcome ~ ., data = traindata, mtry=2,ntree=400,importance=TRUE)
rf
# 验证并预测
names(testdata)[1:ncol(testdata)-1] <- paste("M",1:13,sep = "_")
pred1 <- predict(rf, testdata)
# 查看准确性
test_matrix <- table(pred1, testdata$outcome)
sum(diag(test_matrix))/sum(test_matrix)
# 0.8148
参考
[1] A Large-Scale, Multicenter Serum Metabolite Biomarker Identification Study for the Early Detection of Hepatocellular Carcinoma
网友评论