UCSC Xena和TCGAbiolinks的HNSC临床数据不

作者: Stone_Stan4d | 来源:发表于2019-01-12 16:13 被阅读37次

UCSC Xena和TCGAbiolinks的HNSC临床数据不
聊UCSC xena的数据下载问题
「r<-包」UCSCXenaTools v1.2.7
UCSC xena数据下载教程
Xena - TCGA数据下载
利用UCSC Xena做TCGA数据库的生存曲线分析
USCS Xena
单基因TCGA的Cox森林图怎么画才好看
TCGA-RNAseq数据重分析 — step0 数据下载
【r<-包|数据集|公开数据库】UCSCXenaTools

我们先加载数据，数据分别来自TCGAbiolinks（直接下载自TCGA-GDC）和UCSCXena，单看临床数据。

library(TCGAbiolinks)
library(dplyr)
library(DT)
library(stringr)
library(SummarizedExperiment)
rm(list = ls())
load('E:/TCGA/TCGAbiolinks/rdata/step1-HNSC-clinical-rnaseq.rdata')
load('E:/TCGA/UCSCXena/Project/Data/HNSC/meth.HNSC.rdata')

其中Xena的数据重复计算了癌旁样本。

> dim(clinical.biolink)
[1] 528  42
> dim(clinical.Xena)
[1] 604 141
>

Xena

Biolinks

我们把clinical.Xena癌旁样本剔除掉：

> clinical.Xena$sampleID
  [1] "TCGA-4P-AA8J-01" "TCGA-BA-4074-01" "TCGA-BA-4075-01" "TCGA-BA-4076-01" "TCGA-BA-4077-01" "TCGA-BA-4078-01"
  [7] "TCGA-BA-5149-01" "TCGA-BA-5151-01" "TCGA-BA-5152-01" "TCGA-BA-5153-01" "TCGA-BA-5555-01" "TCGA-BA-5556-01"
 [13] "TCGA-BA-5557-01" "TCGA-BA-5558-01" "TCGA-BA-5559-01" "TCGA-BA-6868-01" "TCGA-BA-6869-01" "TCGA-BA-6870-01"

其中14-15位01的是原发肿瘤样本，06是复发癌,11是癌旁。我们做一个统计看看：

> clinical.Xena$sampleID %>% word(sep = '-', 4) %>% table()
.
 01  06  11 
528   2  74 
> ind = (clinical.Xena$sampleID %>% word(sep = '-', 4)) == '01'
> ind %>% table()
.
FALSE  TRUE 
   76   528
> clinical.Xena = clinical.Xena[ind, ]

528例原发肿瘤样本，和2例复发肿瘤，74例癌旁正常样本。
对行名进行赋值：

> rownames(clinical.biolink) = clinical.biolink$submitter_id
> rownames(clinical.Xena) = clinical.Xena$sampleID
> rownames(clinical.Xena)[1:5]
[1] "TCGA-4P-AA8J-01" "TCGA-BA-4074-01" "TCGA-BA-4075-01"
[4] "TCGA-BA-4076-01" "TCGA-BA-4077-01"
> rownames(clinical.biolink)[1:5]
[1] "TCGA-4P-AA8J" "TCGA-BA-4074" "TCGA-BA-4075" "TCGA-BA-4076"
[5] "TCGA-BA-4077"
> rownames(clinical.biolink) %>% 
  substr(., 1, 12) -> rownames(clinical.biolink)
> rownames(clinical.Xena)[1:5]
[1] "TCGA-4P-AA8J" "TCGA-BA-4074" "TCGA-BA-4075" "TCGA-BA-4076"
[5] "TCGA-BA-4077"

貌似行名一致。。。。
我们看看生存事件：

> clinical.biolink$vital_status %>% table()
.
alive  dead 
  304   224 
> clinical.Xena$OS %>% table()
.
  0   1 
302 223

Xena的数据死亡发生例数少1个，存活例数少2个。总数少3个（=528-302-223），仔细看是NA值：

Xena中的OS事件有3个NA值

我们把他们的样本名调出来, 看看Biolinks的数据是怎样的：

> rownames(clinical.Xena)[is.na(clinical.Xena$OS)]
[1] "TCGA-BB-4224" "TCGA-CQ-A4CA" "TCGA-H7-A6C4"
> samples.na = rownames(clinical.Xena)[is.na(clinical.Xena$OS)]
> clinical.biolink[samples.na, ]

image.png

我们看看官网的资料，是否与TCGAbiolinks下载的数据一致：

TCGA-GDC的官网资料

这个是缺失资料的更新。
我们接着看：

> clinical.biolink$tissue_or_organ_of_origin %>% table() %>% sort(decreasing = T)
.
                                       Tongue, NOS 
                                               130 
                                       Larynx, NOS 
                                               116 
Overlapping lesion of lip, oral cavity and pharynx 
                                                70 
                               Floor of mouth, NOS 
                                                54 
                                       Tonsil, NOS 
                                                46 
                               Base of tongue, NOS 
                                                24 
                                        Mouth, NOS 
                                                23 
                                      Cheek mucosa 
                                                19 
                                  Hypopharynx, NOS 
                                                 9 
                                   Oropharynx, NOS 
                                                 9 
                                          Gum, NOS 
                                                 8 
                                       Hard palate 
                                                 4 
                                          Lip, NOS 
                                                 3 
                           Anterior floor of mouth 
                                                 2 
                                         Lower gum 
                                                 2 
                                  Border of tongue 
                                                 1 
                                          Mandible 
                                                 1 
                                       Palate, NOS 
                                                 1 
                                      Pharynx, NOS 
                                                 1 
                      Posterior wall of oropharynx 
                                                 1 
                                   Retromolar area 
                                                 1 
                                      Supraglottis 
                                                 1 
                                         Upper Gum 
                                                 1 
                    Ventral surface of tongue, NOS 
                                                 1

Overlapping lesion of lip, oral cavity and pharynx项太长，重新命名：

> ind = clinical.biolink$tissue_or_organ_of_origin== 
+   'Overlapping lesion of lip, oral cavity and pharynx'
> clinical.biolink$tissue_or_organ_of_origin[ind] = 'Oral, lip and pharynx'
> clinical.Xena$anatomic_neoplasm_subdivision %>% table() %>% sort(decreasing = T)
.
   Oral Tongue         Larynx    Oral Cavity Floor of mouth         Tonsil 
           133            117             73             63             45 
Base of tongue  Buccal Mucosa Alveolar Ridge    Hypopharynx     Oropharynx 
            27             23             18             10              9 
   Hard Palate            Lip 
             7              3

> clinical.biolink$tissue_or_organ_of_origin %>% table() %>% sort(decreasing = T)
.
                   Tongue, NOS                    Larynx, NOS 
                           130                            116 
         Oral, lip and pharynx            Floor of mouth, NOS 
                            70                             54 
                   Tonsil, NOS            Base of tongue, NOS 
                            46                             24 
                    Mouth, NOS                   Cheek mucosa 
                            23                             19 
              Hypopharynx, NOS                Oropharynx, NOS 
                             9                              9 
                      Gum, NOS                    Hard palate 
                             8                              4 
                      Lip, NOS        Anterior floor of mouth 
                             3                              2 
                     Lower gum               Border of tongue 
                             2                              1 
                      Mandible                    Palate, NOS 
                             1                              1 
                  Pharynx, NOS   Posterior wall of oropharynx 
                             1                              1 
               Retromolar area                   Supraglottis 
                             1                              1 
                     Upper Gum Ventral surface of tongue, NOS 
                             1                              1

> clinical.Xena$anatomic_neoplasm_subdivision %>% table() %>% sort(decreasing = T)
.
   Oral Tongue         Larynx    Oral Cavity Floor of mouth         Tonsil 
           133            117             73             63             45 
Base of tongue  Buccal Mucosa Alveolar Ridge    Hypopharynx     Oropharynx 
            27             23             18             10              9 
   Hard Palate            Lip 
             7              3

我们可以清晰的看出，很多不一样的。看看具体什么不一样：

clinical.biolink = clinical.biolink[rownames(clinical.Xena),]#确保行一致
clinical.merge = data.frame(
  subdivision.Xena = clinical.Xena$anatomic_neoplasm_subdivision,
  subdivision.biolink = clinical.biolink$tissue_or_organ_of_origin,
  row.names = rownames(clinical.Xena))

image.png

我们以舌癌（tongue cancer）为例，找出不一致的样本：

> tongue.Xena = grepl('[Tt]ongue', clinical.merge$subdivision.Xena)
> tongue.biolink = grepl('[Tt]ongue', clinical.merge$subdivision.biolink)
> tongue.Xena %>% table()
.
FALSE  TRUE 
  368   160 
> tongue.biolink %>% table()
.
FALSE  TRUE 
  372   156

TCGAbiolink中少了4例。
具体看看：

> clinical.merge[tongue.Xena != tongue.biolink, ]
             subdivision.Xena   subdivision.biolink
TCGA-BA-5557      Oral Tongue Oral, lip and pharynx
TCGA-BA-A4IG   Base of tongue       Oropharynx, NOS
TCGA-BB-7872      Oral Tongue            Mouth, NOS
TCGA-CN-6016   Floor of mouth           Tongue, NOS
TCGA-CN-A642   Floor of mouth           Tongue, NOS
TCGA-CQ-6218   Floor of mouth           Tongue, NOS
TCGA-CQ-6225      Oral Tongue Oral, lip and pharynx
TCGA-CQ-7067      Oral Tongue            Mouth, NOS
TCGA-CQ-7072   Floor of mouth           Tongue, NOS
TCGA-D6-A6EM      Oral Tongue            Mouth, NOS
TCGA-MZ-A6I9   Base of tongue       Oropharynx, NOS
TCGA-T2-A6WX      Oral Tongue   Floor of mouth, NOS

528例中有12例的舌癌诊断记录是有差别的。我们看看最新官网的：

TCGA-BA-5557

TCGA-BA-A4IG

TCGA-CQ-7072
我们发现TCGAbiolinks下载的数据与官网的数据是一致的。
再来看看数据库介绍的文档：
UCSC-Xena

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, describing tumor tissue and matched normal tissues from more than 11,000 patients, is publically available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.

Data on this hub is derived from files download from TCGA Data Coordinating Center (DCC) in Jan 2016, and Broad Firehose(analyses 2016-01-28).

bioconductor中的TCGAbiolinks

TCGAbiolinks: An R/Bioconductor package for integrative analysis with GDC data
Bioconductor version: Release (3.8)

The aim of TCGAbiolinks is : i) facilitate the GDC open-access data retrieval, ii) prepare the data using the appropriate pre-processing strategies, iii) provide the means to carry out different standard analyses and iv) to easily reproduce earlier research results. In more detail, the package provides multiple methods for analysis (e.g., differential expression analysis, identifying differentially methylated regions) and methods for visualization (e.g., survival plots, volcano plots, starburst plots) in order to easily develop complete analysis pipelines.

可以看出UCSC中的数据是来自2016年的TCGA Data Coordinating Center (DCC)和Broad Firehose
而自2016年后，TCGA官网进行了多少次更新呢？

image.png

UCSC Xena和TCGAbiolinks的HNSC临床数据不
我们先加载数据，数据分别来自TCGAbiolinks（直接下载自TCGA-GDC）和UCSCXena，单看临床数据...
聊UCSC xena的数据下载问题
作者：白介素2 UCSC xena数据存储中心总览 UCSC xena的数据存储仓库主要包括的数据有以下这些： ...
「r<-包」UCSCXenaTools v1.2.7
今天在推特上看到 UCSC Xena 更新了关于 GDC 的数据以及收到 B.C 君关于 UCSC Xena 支...
UCSC xena数据下载教程
UCSC xena[http://xena.ucsc.edu/#overview] 点击进去界面选择DATA S...
Xena - TCGA数据下载
1，Xena官网浏览器中输入网址 http://xena.ucsc.edu/[http://xena.ucsc.e...
利用UCSC Xena做TCGA数据库的生存曲线分析
UCSC的分析入口打开UCSC Xena网站：UCSC Xena。（网页速度可能不太稳定）点击界面的GET ST...
USCS Xena
数据下载专题 | UCSC UCSC Xena是由加州大学圣克鲁兹分校(University Of Cingifo...
单基因TCGA的Cox森林图怎么画才好看
首先调取Xena网页的TCGA数据做Cox生存分析，数据可以通过hiplot官网自主研发的ucsc-xena-sh...
TCGA-RNAseq数据重分析 — step0 数据下载
数据下载依然在UCSC Xena 数据库，下载方法参考：TCGA-甲基化数据重分析 — step0 UCSC Xe...
【r<-包|数据集|公开数据库】UCSCXenaTools
XenaR包提供了一个简单的UCSC Xena接口，可以获取一些UCSC Xena存储的信息，包括GDC、TCGA...