2019-12-30 下载UCSC Xena中的RNA-Seq数

作者: 王子威PtaYoth | 来源:发表于2019-12-30 20:59 被阅读0次

2019-12-30 下载UCSC Xena中的RNA-Seq数
Xena - TCGA数据下载
USCS Xena
UCSC xena数据下载教程
利用UCSC Xena做TCGA数据库的生存曲线分析
TCGA-RNAseq数据重分析 — step0 数据下载
聊UCSC xena的数据下载问题
其他下载方式
「r<-包」UCSCXenaTools v1.2.7
【r<-包|数据集|公开数据库】UCSCXenaTools

网址为：
https://xenabrowser.net/datapages/?cohort=TCGA%20Bladder%20Cancer%20(BLCA)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

DNA methylation 450K测了434例，RNA-Seq测了426例。
网站给出了3种类型的数据

For comparing data within this cohort, we recommend to use the "gene expression RNAseq" dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking.

比较BLCA cohort中的基因表达，推荐第1个gene expression RNAseq数据集。
和其他种类癌症进行表达比较，推荐第2个pancan normalized version。
和TCGA以外的cohort进行比较，且non-TCGA数据采取了percentile ranking，推荐第3个percentile version。

因为我要进行cohort内比较，故选择第1个数据集。

点开后显示了这些信息

行为identifier，列为samples，数据为log2(norm_count+1)，即经过log2(x+1)转换过的RSEM标准化count。

RSEM数据的差异分析参考这篇回答
https://support.bioconductor.org/p/91054/

其中一个叫Gordon Smyth的答主认为可以直接用limma包分析log变换的数据，只需：
先normalizeQuantiles()进行标准化
再使用eBayes(trend=TRUE) pipeline(without voom)
还给出了示例代码


# 这个包399Mb，暂时没下，看代码
BiocManager::install("curatedCRCData")
library(curatedCRCData)
data(TCGA.RNASeqV2_eset) #data()直接提取genomicMatrix了，可以看下TCGA.RNASeqV2_eset是个什么object
targets <- pData(TCGA.RNASeqV2_eset) #pData()提取phenoData
y <- normalizeQuantiles(exprs(TCGA.RNASeqV2_eset)) 
# normalizeQuantiles()对列进行标准化，使之拥有相同的四分位数。
# 函数的帮助文档中提到,一般情况下不会直接调用该条函数，而是使用limma包的normalizeBetweenArrays()
type <- factor(targets$sample_type)
table(type)

keep <- rowSums(y > log2(11)) >= 14 #保留至少在14个样本中counts数>10的identifier
table(keep)
keep

y2 <- y[keep,] #提取出标准化后矩阵中符合条件的行
design <- model.matrix(~type) #design matrix
fit <- lmFit(y2,design) #差异分析
fit <- eBayes(fit,robust=TRUE,trend=TRUE) 
# robust和trend两个参数不懂
# trend: logical, should an intensity-trend be allowed for the prior variance? Default is that the prior variance is constant.
# robust: logical, should the estimation of df.prior and var.prior be robustified against outlier sample variances?
topTable(fit,coef=2)

plotMDS(y2, label=type)
#Plot samples on a two-dimensional scatterplot so that distances on the plot approximate the typical log2 fold changes between the samples.

plotSA(fit) 
#Plot residual standard deviation versus average log expression for a fitted microarray linear model.

先去跑了试试了...