美文网首页R语言做生信R语言与科研
UCSC Xena和TCGAbiolinks的HNSC临床数据不

UCSC Xena和TCGAbiolinks的HNSC临床数据不

作者: Stone_Stan4d | 来源:发表于2019-01-12 16:13 被阅读37次

    我们先加载数据,数据分别来自TCGAbiolinks(直接下载自TCGA-GDC)和UCSCXena,单看临床数据。

    library(TCGAbiolinks)
    library(dplyr)
    library(DT)
    library(stringr)
    library(SummarizedExperiment)
    rm(list = ls())
    load('E:/TCGA/TCGAbiolinks/rdata/step1-HNSC-clinical-rnaseq.rdata')
    load('E:/TCGA/UCSCXena/Project/Data/HNSC/meth.HNSC.rdata')
    

    其中Xena的数据重复计算了癌旁样本。

    > dim(clinical.biolink)
    [1] 528  42
    > dim(clinical.Xena)
    [1] 604 141
    > 
    
    Xena
    Biolinks

    我们把clinical.Xena癌旁样本剔除掉:

    > clinical.Xena$sampleID
      [1] "TCGA-4P-AA8J-01" "TCGA-BA-4074-01" "TCGA-BA-4075-01" "TCGA-BA-4076-01" "TCGA-BA-4077-01" "TCGA-BA-4078-01"
      [7] "TCGA-BA-5149-01" "TCGA-BA-5151-01" "TCGA-BA-5152-01" "TCGA-BA-5153-01" "TCGA-BA-5555-01" "TCGA-BA-5556-01"
     [13] "TCGA-BA-5557-01" "TCGA-BA-5558-01" "TCGA-BA-5559-01" "TCGA-BA-6868-01" "TCGA-BA-6869-01" "TCGA-BA-6870-01"
    

    其中14-15位01的是原发肿瘤样本,06是复发癌,11是癌旁。我们做一个统计看看:

    > clinical.Xena$sampleID %>% word(sep = '-', 4) %>% table()
    .
     01  06  11 
    528   2  74 
    > ind = (clinical.Xena$sampleID %>% word(sep = '-', 4)) == '01'
    > ind %>% table()
    .
    FALSE  TRUE 
       76   528
    > clinical.Xena = clinical.Xena[ind, ]
    

    528例原发肿瘤样本,和2例复发肿瘤,74例癌旁正常样本。
    对行名进行赋值:

    > rownames(clinical.biolink) = clinical.biolink$submitter_id
    > rownames(clinical.Xena) = clinical.Xena$sampleID
    > rownames(clinical.Xena)[1:5]
    [1] "TCGA-4P-AA8J-01" "TCGA-BA-4074-01" "TCGA-BA-4075-01"
    [4] "TCGA-BA-4076-01" "TCGA-BA-4077-01"
    > rownames(clinical.biolink)[1:5]
    [1] "TCGA-4P-AA8J" "TCGA-BA-4074" "TCGA-BA-4075" "TCGA-BA-4076"
    [5] "TCGA-BA-4077"
    > rownames(clinical.biolink) %>% 
      substr(., 1, 12) -> rownames(clinical.biolink)
    > rownames(clinical.Xena)[1:5]
    [1] "TCGA-4P-AA8J" "TCGA-BA-4074" "TCGA-BA-4075" "TCGA-BA-4076"
    [5] "TCGA-BA-4077"
    

    貌似行名一致。。。。
    我们看看生存事件:

    > clinical.biolink$vital_status %>% table()
    .
    alive  dead 
      304   224 
    > clinical.Xena$OS %>% table()
    .
      0   1 
    302 223 
    

    Xena的数据死亡发生例数少1个,存活例数少2个。总数少3个(=528-302-223),仔细看是NA值:


    Xena中的OS事件有3个NA值

    我们把他们的样本名调出来, 看看Biolinks的数据是怎样的:

    > rownames(clinical.Xena)[is.na(clinical.Xena$OS)]
    [1] "TCGA-BB-4224" "TCGA-CQ-A4CA" "TCGA-H7-A6C4"
    > samples.na = rownames(clinical.Xena)[is.na(clinical.Xena$OS)]
    > clinical.biolink[samples.na, ]
    
    image.png

    我们看看官网的资料,是否与TCGAbiolinks下载的数据一致:


    TCGA-GDC的官网资料

    这个是缺失资料的更新。
    我们接着看:

    > clinical.biolink$tissue_or_organ_of_origin %>% table() %>% sort(decreasing = T)
    .
                                           Tongue, NOS 
                                                   130 
                                           Larynx, NOS 
                                                   116 
    Overlapping lesion of lip, oral cavity and pharynx 
                                                    70 
                                   Floor of mouth, NOS 
                                                    54 
                                           Tonsil, NOS 
                                                    46 
                                   Base of tongue, NOS 
                                                    24 
                                            Mouth, NOS 
                                                    23 
                                          Cheek mucosa 
                                                    19 
                                      Hypopharynx, NOS 
                                                     9 
                                       Oropharynx, NOS 
                                                     9 
                                              Gum, NOS 
                                                     8 
                                           Hard palate 
                                                     4 
                                              Lip, NOS 
                                                     3 
                               Anterior floor of mouth 
                                                     2 
                                             Lower gum 
                                                     2 
                                      Border of tongue 
                                                     1 
                                              Mandible 
                                                     1 
                                           Palate, NOS 
                                                     1 
                                          Pharynx, NOS 
                                                     1 
                          Posterior wall of oropharynx 
                                                     1 
                                       Retromolar area 
                                                     1 
                                          Supraglottis 
                                                     1 
                                             Upper Gum 
                                                     1 
                        Ventral surface of tongue, NOS 
                                                     1 
    

    Overlapping lesion of lip, oral cavity and pharynx项太长,重新命名:

    > ind = clinical.biolink$tissue_or_organ_of_origin== 
    +   'Overlapping lesion of lip, oral cavity and pharynx'
    > clinical.biolink$tissue_or_organ_of_origin[ind] = 'Oral, lip and pharynx'
    > clinical.Xena$anatomic_neoplasm_subdivision %>% table() %>% sort(decreasing = T)
    .
       Oral Tongue         Larynx    Oral Cavity Floor of mouth         Tonsil 
               133            117             73             63             45 
    Base of tongue  Buccal Mucosa Alveolar Ridge    Hypopharynx     Oropharynx 
                27             23             18             10              9 
       Hard Palate            Lip 
                 7              3
    
    > clinical.biolink$tissue_or_organ_of_origin %>% table() %>% sort(decreasing = T)
    .
                       Tongue, NOS                    Larynx, NOS 
                               130                            116 
             Oral, lip and pharynx            Floor of mouth, NOS 
                                70                             54 
                       Tonsil, NOS            Base of tongue, NOS 
                                46                             24 
                        Mouth, NOS                   Cheek mucosa 
                                23                             19 
                  Hypopharynx, NOS                Oropharynx, NOS 
                                 9                              9 
                          Gum, NOS                    Hard palate 
                                 8                              4 
                          Lip, NOS        Anterior floor of mouth 
                                 3                              2 
                         Lower gum               Border of tongue 
                                 2                              1 
                          Mandible                    Palate, NOS 
                                 1                              1 
                      Pharynx, NOS   Posterior wall of oropharynx 
                                 1                              1 
                   Retromolar area                   Supraglottis 
                                 1                              1 
                         Upper Gum Ventral surface of tongue, NOS 
                                 1                              1 
    
    > clinical.Xena$anatomic_neoplasm_subdivision %>% table() %>% sort(decreasing = T)
    .
       Oral Tongue         Larynx    Oral Cavity Floor of mouth         Tonsil 
               133            117             73             63             45 
    Base of tongue  Buccal Mucosa Alveolar Ridge    Hypopharynx     Oropharynx 
                27             23             18             10              9 
       Hard Palate            Lip 
                 7              3 
    

    我们可以清晰的看出,很多不一样的。看看具体什么不一样:

    clinical.biolink = clinical.biolink[rownames(clinical.Xena),]#确保行一致
    clinical.merge = data.frame(
      subdivision.Xena = clinical.Xena$anatomic_neoplasm_subdivision,
      subdivision.biolink = clinical.biolink$tissue_or_organ_of_origin,
      row.names = rownames(clinical.Xena))
    
    image.png

    我们以舌癌(tongue cancer)为例,找出不一致的样本:

    > tongue.Xena = grepl('[Tt]ongue', clinical.merge$subdivision.Xena)
    > tongue.biolink = grepl('[Tt]ongue', clinical.merge$subdivision.biolink)
    > tongue.Xena %>% table()
    .
    FALSE  TRUE 
      368   160 
    > tongue.biolink %>% table()
    .
    FALSE  TRUE 
      372   156 
    

    TCGAbiolink中少了4例。
    具体看看:

    > clinical.merge[tongue.Xena != tongue.biolink, ]
                 subdivision.Xena   subdivision.biolink
    TCGA-BA-5557      Oral Tongue Oral, lip and pharynx
    TCGA-BA-A4IG   Base of tongue       Oropharynx, NOS
    TCGA-BB-7872      Oral Tongue            Mouth, NOS
    TCGA-CN-6016   Floor of mouth           Tongue, NOS
    TCGA-CN-A642   Floor of mouth           Tongue, NOS
    TCGA-CQ-6218   Floor of mouth           Tongue, NOS
    TCGA-CQ-6225      Oral Tongue Oral, lip and pharynx
    TCGA-CQ-7067      Oral Tongue            Mouth, NOS
    TCGA-CQ-7072   Floor of mouth           Tongue, NOS
    TCGA-D6-A6EM      Oral Tongue            Mouth, NOS
    TCGA-MZ-A6I9   Base of tongue       Oropharynx, NOS
    TCGA-T2-A6WX      Oral Tongue   Floor of mouth, NOS
    

    528例中有12例的舌癌诊断记录是有差别的。我们看看最新官网的:

    TCGA-BA-5557
    TCGA-BA-A4IG
    TCGA-CQ-7072
    我们发现TCGAbiolinks下载的数据与官网的数据是一致的。
    再来看看数据库介绍的文档:
    UCSC-Xena

    The Cancer Genome Atlas

    The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, describing tumor tissue and matched normal tissues from more than 11,000 patients, is publically available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.

    Data on this hub is derived from files download from TCGA Data Coordinating Center (DCC) in Jan 2016, and Broad Firehose(analyses 2016-01-28).

    bioconductor中的TCGAbiolinks

    TCGAbiolinks: An R/Bioconductor package for integrative analysis with GDC data
    Bioconductor version: Release (3.8)

    The aim of TCGAbiolinks is : i) facilitate the GDC open-access data retrieval, ii) prepare the data using the appropriate pre-processing strategies, iii) provide the means to carry out different standard analyses and iv) to easily reproduce earlier research results. In more detail, the package provides multiple methods for analysis (e.g., differential expression analysis, identifying differentially methylated regions) and methods for visualization (e.g., survival plots, volcano plots, starburst plots) in order to easily develop complete analysis pipelines.

    可以看出UCSC中的数据是来自2016年的TCGA Data Coordinating Center (DCC)和Broad Firehose
    而自2016年后,TCGA官网进行了多少次更新呢?

    image.png

    相关文章

      网友评论

        本文标题:UCSC Xena和TCGAbiolinks的HNSC临床数据不

        本文链接:https://www.haomeiwen.com/subject/ehtcdqtx.html