GEO中可获取文件

image.png

1.GPL文件（平台信息，获得探针转换数据）

GPL <- getGEO(filename = 'GPL6244.soft') #读取方式

2.单个样本的表达矩阵
3.SOFT文件（应包括表达矩阵和平台信息）
1. matrix文件（所有样本的表达矩阵）

expr.df <- read.table(file = "GSE42589_series_matrix.txt", header =TRUE, 
comment.char = "!", row.names=1)

也可以得到样本的临床信息

Data <- getGEO(filename="GSE42872_series_matrix.txt.gz")
pData <- pData(phenoData(Data))

5.CEL文件（所有样本的原始数据）

library(affy)
dir_cels='D:\\test_analysis\\TNBC\\cel_files'
affy_data = ReadAffy(celfile.path=dir_cels)
eset.mas5 = mas5(affy_data)

当然这个affy包支持的芯片平台是有限的！
一般是hgu 95系列和133系列~~
其实严格来说，这个芯片得到的表达矩阵，是需要过滤的。
比如像下面的代码：

setwd('../')
library(affy)
dir_cels='GSE34824_RAW'
data <- ReadAffy(celfile.path=dir_cels)
eset <- rma(data)
calls <- mas5calls(data) # get PMA calls
calls <- exprs(calls)
absent <- rowSums(calls == 'A') # how may samples are each gene 'absent' in all samples
absent <- which (absent == ncol(calls)) # which genes are 'absent' in all samples
rmaFiltered <- eset[-absent,] # filters out the genes 'absent' in all samples

芯片文件格式

常见芯片数据文件格式
芯片试验----DAT文件，EXP文件----CEL文件----CHP文件，TXT文件，RPT文件

DAT文件：荧光信号图像文件
CEL文件：对荧光信号图像处理后，提取灰度信息的文件
CDF文件：基因芯片探针排布信息（哪个探针来自哪个探针组）
probe文件：探针序列信息
TXT/CHP文件：基因表达矩阵

基因芯片和bioconductor

eSet是bioconductor为基因表达数据格式所定制的标准

AffyBatch

phenoData: An optional AnnotatedDataFrame containing information about each sample. （临床信息）

featureData: An optional AnnotatedDataFrame containing information about each feature.

annotation: A character describing the platform on which the samples were assayed. （平台注释信息，用于探针转换）

assayData: A matrix of expression values, or an environment. （表达矩阵信息）

experimentData: An optional MIAME instance with meta-data (e.g., the lab and resulting publications from the analysis) about the experiment.