Gene function analysis and under

作者: Stone_Stan4d | 来源:发表于2018-01-14 17:10 被阅读47次

Oncotarget上的一篇文章。

Esophageal cancer (EC) is one of the most common digestive malignant tumors worldwide. Over the past decades, there have been minimal improvements in outcomes for patients with EC. New targets and novel therapies are needed to improve outcomes for these patients. This study aimed to explore the molecular mechanisms of EC by integrated bioinformatic analyses of the feature genes associated with EC and correlative gene functions which can distinguish cancerous tissues from non-cancerous tissues. Gene expression profile GSE20347 was downloaded from Gene Expression Omnibus (GEO) database, including 17 EC samples and their paired adjacent non- cancerous samples. The differentially expressed genes (DEGs) between EC and normal specimens were identified and then applied to analyze the GO enrichment on gene functions and KEGG pathways. Corresponding Pathway Relation Network (Pathway- net) and Gene Signal Network (signal-net) of DEGs were established based on the data collected from GCBI datasets. The results showed that DEGs mainly participated in the process of cell adhesion, cell proliferation, survival, invasion, metastasis and angiogenesis. Aberrant expression of PTK2, MAPK signaling pathway, PI3K-Akt signaling pathway, p53 signaling pathway and MET were closely associated with EC carcinogenesis. Importantly, Interleukin 8 (IL8) and C-X-C chemokine receptor type 7 (CXCR-7) were predicted to be significantly related to EC. These findings were further validated by analyzing both TCGA database and our clinical samples of EC. Our discovery provides a registry of genes and pathways that are disrupted in EC, which has the potential to be used in clinic for diagnosis and target therapy of EC in future.

直接看方法部分：

The probe-level data in CEL files were converted into expression value matrix by GCBI online platform analysis in the following link: www.gcbi.com.cn. Robust Multi-chip Average (RMA), including background- corrected, normalization and summary, was used to compute the expression value [22]. Quality control of gene expression data was performed using gene-specific probe. The median of NUSE value of each chip was applied to evaluate the feasibility of this design and the reliability of these analysis results. We further assessed the change rule of every probe set in the experiment by computation of RLE [23]. And the unified standard criterion for every sample was as follows: (1-0.2) < median < (1 + 0.2) and (-0.25) < median {RLE} < (0.25). The chips that did not meet this criterion were rejected. Data distribution was presented as box graph. Probe set annotation mainly referred the new version annotation files that were download on affymetrix official
website (http://www.affymetrix.com/support/
technical/annotationfilesmain.affx) and the probes without annotation were filtered.
DEGs

我准备子啊R语言里重复这一段过程，不用GCBI（貌似要钱？）

获取数据：

source("https://bioconductor.org/biocLite.R")
biocLite()
getwd()
setwd("D:/RWork/Rlearning")
library(GEOquery)

x = getGEOSuppFiles("GSE20347")

untar("GSE20347/GSE20347_RAW.tar", exdir = "data")
cels = list.files("data/", pattern = "[gz]")
sapply(paste("data", cels, sep = "/"), gunzip)

创建分组矩阵

phenodata = matrix(rep(list.files("data", pattern= "CEL"), 2), ncol =2)
class(phenodata)
phenodata <- as.data.frame(phenodata) #phenodata的类转换成了data.frame
colnames(phenodata) <- c("Name", "FileName") #为phenodata命名了列名

phenodata$Group <- lapply(phenodata$FileName, function(x) substr(x, 16, 16))
write.table(phenodata, "data/phenodata.txt", quote = F, sep = "\t", 
            row.names = F)

这一步有一个报错：

报错
应该是我用lapply函数造成的一个list类型数据。用下面的代码就可以了
write.table(unlist(phenodata), "data/phenodata.txt", sep = "\t", row.names = F)

生成affBatch对象

install.packages("simpleaffy")
library(affy)
celfiles <- ReadAffy(celfile.path = "data") 
library(RColorBrewer)
cols = palette(rainbow(length(celfiles)))

colnames(eset)
#注意这一步phenodata必须放在"data"路径下面，我也不知道为什么。。。。
#另外在phenodata数据框创建过程中，应该对rownames进行命名，可用>rownames(phenodata <- phenodata$Name)
par(mfrow = c(2, 1))
boxplot(celfiles)
boxplot(celfiles, col = cols)

affy_data <- rma(celfiles)

花式作图

par(mfrow = c(2, 1))
boxplot(celfiles)
boxplot(celfiles, col = cols)

affy_data <- rma(celfiles)
exp.RMA <- exprs(affy_data)
write.exprs(affy_data, "RMA_processedData.txt")

par(mfrow = c(1, 2))
hist(log2(intensity(celfiles[, 4])), breaks = 100, col = "blue")
hist(log2(exp.RMA), breaks = 100, col = "green")

par(mfrow=c(1,2))
boxplot(celfiles, col = rainbow(8))
boxplot(data.frame(exp.RMA), col = rainbow(8))

library(genefilter)
rsd<-rowSds(exp.RMA)
sel<-order(rsd,decreasing=TRUE)[1:100]
heatmap(exp.RMA[sel,],col=rainbow(256))

热图

获取差异基因

library(limma)
samps <- factor(unlist(phenodata$Group))
design <- model.matrix(~0 + samps)
design
fit <- lmFit(exp.RMA, design)
cont.matrix<-makeContrasts(sampsT - sampsN,levels=design)
fit2 <-contrasts.fit(fit,cont.matrix)
fit2 <- eBayes(fit2) 
DEGs <- topTable(fit2, coef = 1, number = 5000)
head(DEGs)

注释和写出数据

library(limma)
samps <- factor(unlist(phenodata$Group))
design <- model.matrix(~0 + samps)
design
fit <- lmFit(exp.RMA, design)
cont.matrix<-makeContrasts(sampsT - sampsN,levels=design)
fit2 <-contrasts.fit(fit,cont.matrix)
fit2 <- eBayes(fit2) 
DEGs <- topTable(fit2, coef = 1, number = 5000)
head(DEGs)

#注释
biocLite(affydb)
library(annotate)
affydb<-annPkgName("hgu133a2",type="db")
library(affydb,character.only=T)
DEGs$Symbols<-getSYMBOL(rownames(DEGs),affydb)
##根据每个探针组的ID获取对应的基因Gene Symbol,并作为一个新的列，加到数据框dif最后

DEGs$EntrezID<-getEG(rownames(DEGs),affydb) 
##根据每个探针组的ID获取对应的基因Entrez ID,同样加到数据框dif最后

write.table(DEGs, paste("Tumor_Normal","_limma.txt"),quote=FALSE,sep="\t")
head(DEGs)

全部代码如下：

source("https://bioconductor.org/biocLite.R")

biocLite()

getwd()

setwd("D:/RWork/Rlearning")

library(GEOquery)

x = getGEOSuppFiles("GSE20347")

untar("GSE20347/GSE20347_RAW.tar", exdir = "data")
cels = list.files("data/", pattern = "[gz]")
sapply(paste("data", cels, sep = "/"), gunzip)

phenodata = matrix(rep(list.files("data", pattern= "CEL"), 2), ncol =2)
class(phenodata)
phenodata <- as.data.frame(phenodata) #phenodata的类转换成了data.frame
colnames(phenodata) <- c("Name", "FileName") #为phenodata命名了列名


phenodata$Group <- lapply(phenodata$FileName, function(x) substr(x, 16, 16))
write.table(unlist(phenodata), "data/phenodata.txt", sep = "\t", row.names = F)

class(phenodata)

install.packages("simpleaffy")
library(affy)
celfiles <- ReadAffy(celfile.path = "data") 


colnames(eset)
#注意这一步phenodata必须放在"data"路径下面，我也不知道为什么。。。。
#另外在phenodata数据框创建过程中，应该对rownames进行命名，可用>rownames(phenodata <- phenodata$Name)
par(mfrow = c(2, 1))
boxplot(celfiles)
boxplot(celfiles, col = cols)

affy_data <- rma(celfiles)
exp.RMA <- exprs(affy_data)
write.exprs(affy_data, "RMA_processedData.txt")

par(mfrow = c(1, 2))
hist(log2(intensity(celfiles[, 4])), breaks = 100, col = "blue")
hist(log2(exp.RMA), breaks = 100, col = "green")

par(mfrow=c(1,2))
boxplot(celfiles, col = rainbow(8))
boxplot(data.frame(exp.RMA), col = rainbow(8))

library(genefilter)
rsd<-rowSds(exp.RMA)
sel<-order(rsd,decreasing=TRUE)[1:100]
heatmap(exp.RMA[sel,],col=rainbow(256))

#获取差异基因
library(limma)
samps <- factor(unlist(phenodata$Group))
design <- model.matrix(~0 + samps)
design
fit <- lmFit(exp.RMA, design)
cont.matrix<-makeContrasts(sampsT - sampsN,levels=design)
fit2 <-contrasts.fit(fit,cont.matrix)
fit2 <- eBayes(fit2) 
DEGs <- topTable(fit2, coef = 1, number = 5000)
head(DEGs)

#注释
biocLite(affydb)
library(annotate)
affydb<-annPkgName("hgu133a2",type="db")
library(affydb,character.only=T)
DEGs$Symbols<-getSYMBOL(rownames(DEGs),affydb)
##根据每个探针组的ID获取对应的基因Gene Symbol,并作为一个新的列，加到数据框dif最后

DEGs$EntrezID<-getEG(rownames(DEGs),affydb) 
##根据每个探针组的ID获取对应的基因Entrez ID,同样加到数据框dif最后

write.table(DEGs, paste("Tumor_Normal","_limma.txt"),quote=FALSE,sep="\t")
head(DEGs)

网友评论

本文标题：Gene function analysis and under

本文链接：https://www.haomeiwen.com/subject/hrikoxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！