美文网首页tcga
[R]TCGAbiolinks包:数据准备--query、dow

[R]TCGAbiolinks包:数据准备--query、dow

作者: 小贝学生信 | 来源:发表于2021-08-19 21:32 被阅读0次

    TCGAbiolinks包是一站式分析TCGA数据的R包工具,它集成了TCGA数据下载、分析、可视化的全部流程。此次系列笔记主要跟着 TCGAbiolinks帮助文档重新学习下TCGA数据挖掘流程。

    一、查找感兴趣的TCGA数据

    • GDCquery()
    GDCquery(
      project,
      data.category,
      data.type,
      workflow.type,
      legacy = FALSE,
      access,
      platform,
      file.type,
      barcode,
      data.format,
      experimental.strategy,
      sample.type
    )
    

    1、可设置的参数

    1.1、根据肿瘤类型

    • project参数:指定一个或多个感兴趣的TCGA项目名
    • 如下代码所示,供包括33种TCGA癌症类型
    projects = TCGAbiolinks:::getGDCprojects()$project_id
    TCGAs = grep("TCGA", projects, value = T)
    sort(TCGAs)
    # [1] "TCGA-ACC"  "TCGA-BLCA" "TCGA-BRCA" "TCGA-CESC" "TCGA-CHOL" "TCGA-COAD"
    # [7] "TCGA-DLBC" "TCGA-ESCA" "TCGA-GBM"  "TCGA-HNSC" "TCGA-KICH" "TCGA-KIRC"
    # [13] "TCGA-KIRP" "TCGA-LAML" "TCGA-LGG"  "TCGA-LIHC" "TCGA-LUAD" "TCGA-LUSC"
    # [19] "TCGA-MESO" "TCGA-OV"   "TCGA-PAAD" "TCGA-PCPG" "TCGA-PRAD" "TCGA-READ"
    # [25] "TCGA-SARC" "TCGA-SKCM" "TCGA-STAD" "TCGA-TGCT" "TCGA-THCA" "TCGA-THYM"
    # [31] "TCGA-UCEC" "TCGA-UCS"  "TCGA-UVM" 
    
    Study Abbreviation Study Name 中文名
    ACC Adrenocortical carcinoma 肾上腺皮质癌
    BLCA Bladder Urothelial Carcinoma 膀胱尿路上皮癌
    BRCA Breast invasive carcinoma 浸润性乳腺癌
    CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma 宫颈鳞状细胞癌和宫颈内腺癌
    CHOL Cholangiocarcinoma 胆管癌
    COAD Colon adenocarcinoma 结肠腺癌
    DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma 淋巴样肿瘤弥漫大b细胞淋巴瘤
    ESCA Esophageal carcinoma 食管癌癌
    GBM Glioblastoma multiforme 多形性成胶质细胞瘤
    HNSC Head and Neck squamous cell carcinoma 头颈部鳞状细胞癌
    KICH Kidney Chromophobe 肾嫌色细胞癌
    KIRC Kidney renal clear cell carcinoma 肾透明细胞癌
    KIRP Kidney renal papillary cell carcinoma 肾乳头状细胞癌
    LAML Acute Myeloid Leukemia 急性髓系白血病
    LGG Brain Lower Grade Glioma 脑低级别胶质瘤
    LIHC Liver hepatocellular carcinoma 肝脏肝细胞癌
    LUAD Lung adenocarcinoma 肺腺癌
    LUSC Lung squamous cell carcinoma 肺鳞癌
    MESO Mesothelioma 间皮瘤
    OV Ovarian serous cystadenocarcinoma 卵巢浆液性囊腺癌
    PAAD Pancreatic adenocarcinoma 胰腺腺癌
    PCPG Pheochromocytoma and Paraganglioma 嗜铬细胞瘤和副神经节瘤
    PRAD Prostate adenocarcinoma 前列腺腺癌
    READ Rectum adenocarcinoma 直肠腺癌
    SARC Sarcoma 肉瘤
    SKCM Skin Cutaneous Melanoma 皮肤皮肤黑色素瘤
    STAD Stomach adenocarcinoma 胃腺癌
    TGCT Testicular Germ Cell Tumors 睾丸生殖细胞肿瘤
    THCA Thyroid carcinoma 甲状腺癌
    THYM Thymoma 胸腺瘤
    UCEC Uterine Corpus Endometrial Carcinoma 子宫内膜癌
    UCS Uterine Carcinosarcoma 子宫癌肉瘤
    UVM Uveal Melanoma 葡萄膜黑色素瘤

    1.2 hg19/hg38

    • 主要根据参考基因组的不同,包含两套数据:GDC Legacy Archive【主要GRCh37 (hg19)】,GDC harmonized database【GRCh38 (hg38)】
    • 通过设置参数legacy ,默认为FALSE(hg19);TRUE则表示使用hg38参考基因组的测序数据。

    1.3 下载数据类型

    基于上述的参数,我们可以设置如下参数,交代我们的目标数据类型

    • data.category = 指定下载什么类型的数据:如组学数据、临床数据....
    #查看某一种肿瘤所包含的数据类型
    TCGAbiolinks:::getProjectSummary("TCGA-BRCA")$data_categories
    #   file_count case_count               data_category
    # 1       4679       1098            Sequencing Reads
    # 2       1183       1098                    Clinical
    # 3       6627       1098       Copy Number Variation
    # 4       5315       1098                 Biospecimen
    # 5       1234       1095             DNA Methylation
    # 6       6080       1097     Transcriptome Profiling
    # 7       8648       1044 Simple Nucleotide Variation
    
    • data.type = 更加细节的数据类型选择(optional)
    • workflow.type = 同一个测序数据可能有不同的pipeline处理流程(optional, for harmonized )
    • platform = 测序平台(optional)
    • file.type = 具体的数据文件(optional, for legacy)
      如果不知道目标数据的上述信息,可以参考下面的概述
    GDC harmonized database
    Data.category Data.type Workflow.Type Platform
    Transcriptome Profiling Gene Expression Quantification HTSeq - Counts
    Transcriptome Profiling Gene Expression Quantification HTSeq - FPKM
    Transcriptome Profiling Gene Expression Quantification HTSeq - FPKM-UQ
    Transcriptome Profiling Gene Expression Quantification STAR - Counts
    Transcriptome Profiling Isoform Expression Quantification -
    Transcriptome Profiling miRNA Expression Quantification -
    Transcriptome Profiling Splice Junction Quantification
    Copy number variation Copy Number Segment
    Copy number variation Masked Copy Number Segment
    Copy number variation Gene Level Copy Number Scores
    Simple Nucleotide Variation Masked Somatic Mutation MuSE Variant Aggregation and Masking
    Simple Nucleotide Variation Masked Somatic Mutation MuTect2 Variant Aggregation and Masking
    Simple Nucleotide Variation Masked Somatic Mutation SomaticSniper Variant Aggregation and Masking
    Simple Nucleotide Variation Masked Somatic Mutation VarScan2 Variant Aggregation and Masking
    Raw Sequencing Data -
    Biospecimen Slide Image
    Biospecimen Biospecimen Supplement
    Clinical -
    DNA Methylation Methylation Beta Value Illumina Human Methylation 450
    DNA Methylation Methylation Beta Value Illumina Human Methylation 27
    GDC Legacy Archive
    Data.category Data.type Platform file.type
    Copy number variation - Affymetrix SNP Array 6.0 nocnv_hg18.seg
    Copy number variation - Affymetrix SNP Array 6.0 hg18.seg
    Copy number variation - Affymetrix SNP Array 6.0 nocnv_hg19.seg
    Copy number variation - Affymetrix SNP Array 6.0 hg19.seg
    Copy number variation - Illumina HiSeq -
    Simple nucleotide variation Simple somatic mutation
    Raw sequencing data
    Biospecimen
    Clinical
    Protein expression MDA RPPA Core -
    Gene expression Gene expression quantification Illumina HiSeq normalized_results
    Gene expression Gene expression quantification Illumina HiSeq results
    Gene expression Gene expression quantification HT_HG-U133A -
    Gene expression Gene expression quantification AgilentG4502A_07_2 -
    Gene expression Gene expression quantification AgilentG4502A_07_1 -
    Gene expression Gene expression quantification HuEx-1_0-st-v2 FIRMA.txt
    Gene expression Gene expression quantification gene.txt
    Gene expression Isoform expression quantification - -
    Gene expression miRNA gene quantification - hg19.mirna
    Gene expression miRNA gene quantification hg19.mirbase20
    Gene expression miRNA gene quantification mirna
    Gene expression Exon junction quantification - -
    Gene expression Exon quantification - -
    Gene expression miRNA isoform quantification - hg19.isoform
    Gene expression miRNA isoform quantification - isoform
    DNA methylation Illumina Human Methylation 450 Not used
    DNA methylation Illumina Human Methylation 27 Not used
    DNA methylation Illumina DNA Methylation OMA003 CPI Not used
    DNA methylation Illumina DNA Methylation OMA002 CPI Not used
    DNA methylation Illumina Hi Seq
    DNA methylation Bisulfite sequence alignment
    DNA methylation Methylation percentage
    DNA methylation Aligned reads
    Raw microarray data Raw intensities Illumina Human Methylation 450 idat
    Raw Microarray Data Raw intensities Illumina Human Methylation 27 idat
    Structural Rearrangement
    Other

    1.4 样本标签Barcode

    完整的barcode:形如 TCGA-G4-6317-02A-11D-2064-05,这个标签包含了从病人来源到测序过程、分析的所有信息,如下图所示比较重要的是ParticipantSamplePortion三个部分,分别交代了病人编号、样本类型、测序类型
    病人的id:形如 TCGA-G4-6317
    样本来源的id:形如 TCGA-G4-6317-02

    • 其中比较重要的是交代样本类型的Sample的两位数信息,是后面进行差异分析的分组依据。具体对应的含义如下。例如01表示病人的原位瘤组织;11表示来自病人的正常组织....

    • 基于上述理解,我们也可以设置sample.type =参数指定下载感兴趣的样本类型数据,例如sample.type = "Primary Tumor"

    • 对于给定的TCGA barcode,可以利用TCGAquery_SampleTypes()提取出目标分组的样本;TCGAquery_MatchedCoupledSampleTypes()函数可以提取来自同一病人的配对样本数据。

    query <- GDCquery(project = c("TCGA-BRCA"),
                      legacy = FALSE, #default(GDC harmonized database)
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts")
    dim(getResults(query))
    #[1] 1222   29
    query_info = getResults(query)
    TP = TCGAquery_SampleTypes(query_info$sample.submitter_id,"TP")
    NT = TCGAquery_SampleTypes(query_info$sample.submitter_id,"NT")
    query <- GDCquery(project = c("TCGA-BRCA"),
                      legacy = FALSE, #default(GDC harmonized database)
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",
                      barcode = c(TP, NT))
    dim(getResults(query))
    #[1] 1215   29
    
    Pair_sample = TCGAquery_MatchedCoupledSampleTypes(query_info$sample.submitter_id,c("NT","TP"))
    query <- GDCquery(project = c("TCGA-BRCA"),
                      legacy = FALSE, #default(GDC harmonized database)
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",
                      barcode = Pair_sample)
    dim(getResults(query))
    #[1] 229  29
    

    如上是查询TCGA目标数据的几种常见标准,还有几个参数没有介绍,可参看函数帮助文档。可根据自己的目的灵活设置上述参数。

    2、query示例

    2.1 胆管癌转录组数据 | hg19 | 所有样本

    TCGAbiolinks:::getProjectSummary("TCGA-CHOL",legacy = TRUE)$data_categories
    #   file_count case_count               data_category
    # 1         30         30          Protein expression
    # 2        680         36       Copy number variation
    # 3         51         51                 Biospecimen
    # 4        444         36 Simple nucleotide variation
    # 5        450         36             Gene expression
    # 6        686         36         Raw microarray data
    # 7         45         36             DNA methylation
    # 8        193         51                    Clinical
    # 9        365         51         Raw sequencing data
    query <- GDCquery(project = "TCGA-CHOL",
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type  = "normalized_results")
    dim(getResults(query))
    #[1] 45 32
    t(getResults(query)[1,])
    #                       1                                                                                   
    # id                    "34216957-50e3-434c-8c38-72f0f2ddcf16"                                              
    # data_format           "TXT"                                                                               
    # access                "open"                                                                              
    # cases                 "TCGA-3X-AAV9-01A-72R-A41I-07"                                                      
    # file_name             "unc.edu.59012a78-0e8f-4b99-af97-0dbb1d3d0513.2538862.rsem.genes.normalized_results"
    # submitter_id          NA                                                                                  
    # data_category         "Gene expression"                                                                   
    # type                  "file"                                                                              
    # file_size             437196                                                                              
    # platform              "Illumina HiSeq"                                                                    
    # state_comment         NA                                                                                  
    # tags                  character,3                                                                         
    # updated_datetime      "2017-03-05T10:11:44.298823-06:00"                                                  
    # md5sum                "23836c9f9bdb053c567d91a67b62159d"                                                  
    # file_id               "34216957-50e3-434c-8c38-72f0f2ddcf16"                                              
    # data_type             "Gene expression quantification"                                                    
    # state                 "live"                                                                              
    # experimental_strategy "RNA-Seq"                                                                           
    # file_state            "submitted"                                                                         
    # version               "1"                                                                                 
    # data_release          "0.0 - 29.0"                                                                        
    # project               "TCGA-CHOL"                                                                         
    # center_id             "ee7a85b3-8177-5d60-a10c-51180eb9009c"                                              
    # center_center_type    "CGCC"                                                                              
    # center_code           "07"                                                                                
    # center_name           "University of North Carolina"                                                      
    # center_namespace      "unc.edu"                                                                           
    # center_short_name     "UNC"                                                                               
    # sample_type           "Primary Tumor"                                                                     
    # is_ffpe               FALSE                                                                               
    # cases.submitter_id    "TCGA-3X-AAV9"                                                                      
    # sample.submitter_id   "TCGA-3X-AAV9-01A"
    
    

    2.2 肺腺癌癌转录组数据 | hg38 | 原位瘤+正常组织

    TCGAbiolinks:::getProjectSummary("TCGA-LUAD",legacy = FALSE)$data_categories
    # 4       2916        519     Transcriptome Profiling
    query <- GDCquery(project = "TCGA-LUAD",
                      legacy = FALSE,
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts")
    dim(getResults(query))
    #[1] 594  29
    

    2.3 乳腺癌甲基化数据 | hg19 | Illumina Human Methylation 450平台

    TCGAbiolinks:::getProjectSummary("TCGA-BRCA",legacy = TRUE)$data_categories
    #7       1250       1097             DNA methylation
    query <- GDCquery(project = "TCGA-BRCA",
                      legacy = TRUE,
                      data.category = "DNA methylation",
                      platform = "Illumina Human Methylation 450")
    dim(getResults(query))
    #[1] 895  32
    

    二、根据选择的query,下载数据

    • GDCdownload()函数使用比较简单,指定我们上一步得到的query即可。
    • 提供两种下载方式:apiclient,前者较快,但有时不太稳定;后者较慢。推荐使用api方式(default),当下载大文件时,可设置files.per.chunk = n,表示分批下载,每批下载n个病人的数据,可避免中途报错,而前功尽弃。
    • directory表示下载到哪个文件夹,默认会创建、下载到GDCdata文件夹
    GDCdownload(
      query,
      token.file,
      method = "api",
      directory = "GDCdata",
      files.per.chunk = NULL
    )
    
    • 示例数据
    query <- GDCquery(project = "TCGA-CHOL",
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type  = "normalized_results")
    GDCdownload(query, files.per.chunk = 10)
    # Downloading data for project TCGA-CHOL
    # GDCdownload will download 45 files. A total of 19.580796 MB
    # Downloading chunk 1 of 5 (10 files, size = 4.351703 MB) as Wed_Aug_18_21_52_08_2021_0.tar.gz
    # Downloading: 1.9 MB     Downloading chunk 2 of 5 (10 files, size = 4.350318 MB) as Wed_Aug_18_21_52_08_2021_1.tar.gz
    # Downloading: 1.8 MB     Downloading chunk 3 of 5 (10 files, size = 4.351067 MB) as Wed_Aug_18_21_52_08_2021_2.tar.gz
    # Downloading: 1.8 MB     Downloading chunk 4 of 5 (10 files, size = 4.353528 MB) as Wed_Aug_18_21_52_08_2021_3.tar.gz
    # Downloading: 1.9 MB     Downloading chunk 5 of 5 (5 files, size = 2.17418 MB) as Wed_Aug_18_21_52_08_2021_4.tar.gz
    # Downloading: 900 kB
    

    三、读取已经下载到本地的文件到当前环境

    • GDCprepare()会根据我们提供的query对象,以及下载数据的储存目录(默认也是GDCdata文件夹),完成数据读取的操作,以SummarizedExperiment格式展示。
    • 还可设置save = TRUEfilename = ****参数,在读取数据后,自动将SummarizedExperiment对象保存为Rdata,以供之后方便调用(defalut
      为FALSE)
    query <- GDCquery(project = "TCGA-CHOL",
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type  = "normalized_results")
    GDCdownload(query, files.per.chunk = 10)
    data <- GDCprepare(query, save = T, save.filename = "CHOL_RNAseq.rda")
    # -------------------
    #   oo Reading 45 files
    # -------------------
    #   |=================================================|100%                      Completed after 0 s 
    # -------------------
    #   oo Merging 45 files
    # -------------------
    #   Starting to add information to samples
    # => Add clinical information to samples
    # => Adding TCGA molecular information from marker papers
    # => Information will have prefix 'paper_' 
    # chol subtype information from:doi:10.1016/j.celrep.2017.02.033
    # => Saving file: CHOL_RNAseq.rda
    # => File saved
    
    
    • GDCprepare()在读取数据的过程中,会自动进行样本信息、基因信息的注释。但目前这还不能支持全部类型数据。
    library(SummarizedExperiment)
    #表达矩阵信息
    dim(assay(data))
    #[1] 19947    45
    assays(data)
    # List of length 1
    # names(1): normalized_count
    assay(data, "normalized_count")[1:4,1:4]
    #       TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R-A41I-07
    # A1BG                      70.9581                      29.9768                  108409.2249                    1485.0630
    # A2M                    23986.2548                    8129.6961                   98095.2358                    7119.1570
    # NAT1                      72.4007                      52.8682                     160.2275                      76.5504
    # NAT2                       8.7099                       0.0000                    1472.3868                      23.2558
    
    #样本(临床)信息
    dim(colData(data))
    #[1]  45 205
    colData(data)[1:4,1:4]
    # DataFrame with 4 rows and 4 columns
    #                                         barcode      patient           sample shortLetterCode
    #                                         <character>  <character>      <character>     <character>
    # TCGA-3X-AAV9-01A-72R-A41I-07 TCGA-3X-AAV9-01A-72R.. TCGA-3X-AAV9 TCGA-3X-AAV9-01A              TP
    # TCGA-3X-AAVC-01A-21R-A41I-07 TCGA-3X-AAVC-01A-21R.. TCGA-3X-AAVC TCGA-3X-AAVC-01A              TP
    # TCGA-W5-AA2R-11A-11R-A41I-07 TCGA-W5-AA2R-11A-11R.. TCGA-W5-AA2R TCGA-W5-AA2R-11A              NT
    # TCGA-ZH-A8Y4-01A-11R-A41I-07 TCGA-ZH-A8Y4-01A-11R.. TCGA-ZH-A8Y4 TCGA-ZH-A8Y4-01A              TP
    
    #不同的基因ID类型
    dim(rowData(data))
    #[1] 19947     3
    rowData(data)[1:6,1:3]
    # DataFrame with 6 rows and 3 columns
    #                   gene_id entrezgene ensembl_gene_id
    #                   <character>  <integer>     <character>
    # A1BG                 A1BG          1 ENSG00000121410
    # A2M                   A2M          2 ENSG00000175899
    # NAT1                 NAT1          9 ENSG00000171428
    # NAT2                 NAT2         10 ENSG00000156006
    # RP11-986E7.7 RP11-986E7.7         12 ENSG00000273259
    # AADAC               AADAC         13 ENSG00000114771
    
    
    #基因的坐标信息
    rowRanges(data)
    # GRanges object with 19947 ranges and 3 metadata columns:
    #           seqnames              ranges strand |      gene_id entrezgene ensembl_gene_id
    #         <Rle>           <IRanges>  <Rle> |  <character>  <integer>     <character>
    # A1BG    chr19   58856544-58864865      - |         A1BG          1 ENSG00000121410
    # A2M    chr12     9220260-9268825      - |          A2M          2 ENSG00000175899
    # NAT1     chr8   18027986-18081198      + |         NAT1          9 ENSG00000171428
    # NAT2     chr8   18248755-18258728      + |         NAT2         10 ENSG00000156006
    # RP11-986E7.7    chr14   95058395-95090983      + | RP11-986E7.7         12 ENSG00000273259
    # ...      ...                 ...    ... .          ...        ...             ...
    # RASAL2-AS1     chr1 178060643-178063119      - |   RASAL2-AS1  100302401 ENSG00000224687
    # LINC00882     chr3 106555658-106959488      - |    LINC00882  100302640 ENSG00000242759
    # FTX     chrX   73183790-73513409      - |          FTX  100302692 ENSG00000230590
    # TICAM2     chr5 114914339-114961876      - |       TICAM2  100302736 ENSG00000243414
    # SLC25A5-AS1     chrX 118599997-118603061      - |  SLC25A5-AS1  100303728 ENSG00000224281
    # -------
    # seqinfo: 24 sequences from an unspecified genome; no seqlengths
    
    

    以上就是查找数据,下载数据,读取数据的全部流程,接下来就可以开始分析数据了~

    补充:关于病人的临床数据与肿瘤分型

    1、获取病人的临床数据

    • 如上在GDCprepare()过程中,会自动注释病人样本的临床信息。
    • 我们也可以预先单独下载每个病人的临床数据,以供参考。
    方法一:GDCquery() pipeline
    query <- GDCquery(project = "TCGA-ACC", 
                      data.category = "Clinical",
                      data.type = "Clinical Supplement", 
                      data.format = "BCR Biotab")
    GDCdownload(query, files.per.chunk = 20)
    clinical.BCRtab.all <- GDCprepare(query)
    
    
    grep("clinical_", names(clinical.BCRtab.all), value = T)
    # [1] "clinical_drug_brca"               "clinical_omf_v4.0_brca"          
    # [3] "clinical_follow_up_v4.0_brca"     "clinical_follow_up_v1.5_brca"    
    # [5] "clinical_follow_up_v4.0_nte_brca" "clinical_patient_brca"           
    # [7] "clinical_radiation_brca"          "clinical_nte_brca"               
    # [9] "clinical_follow_up_v2.1_brca" 
    clinical_patient_brca = as.data.frame(clinical.BCRtab.all$clinical_patient_brca)
    clinical_patient_brca[1:4,1:4]
    #                       bcr_patient_uuid bcr_patient_barcode form_completion_date                  prospective_collection
    # 1                     bcr_patient_uuid bcr_patient_barcode form_completion_date tissue_prospective_collection_indicator
    # 2                              CDE_ID:      CDE_ID:2003301              CDE_ID:                          CDE_ID:3088492
    # 3 6E7D5EC6-A469-467C-B748-237353C23416        TCGA-3C-AAAU            2014-1-13                                      NO
    # 4 55262FCB-1B01-4480-B322-36570430C917        TCGA-3C-AALI            2014-7-28                                      NO
    
    方法二:GDCquery_clinic()
    • 根据官方介绍,这个函数下载的是indexed clinical: a refined clinical data that is created using the XML files(方法一).
    • 这种方法下载速度较快,建议优先使用。如果没有想要的信息,再使用方法一。
    clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
    clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
    clinical[1:4,1:4]
    #   submitter_id synchronous_malignancy ajcc_pathologic_stage tumor_stage
    # 1 TCGA-E2-A14U                     No               Stage I     stage i
    # 2 TCGA-E9-A1RC                     No            Stage IIIC  stage iiic
    # 3 TCGA-D8-A1J9                     No              Stage IA    stage ia
    # 4 TCGA-E2-A14P                     No            Stage IIIC  stage iiic
    

    2、获取病人的肿瘤分型

    • PanCancerAtlas_subtypes()
      The columns “Subtype_Selected” was selected as most prominent subtype classification (from the other columns)
    subtypes <- PanCancerAtlas_subtypes()
    dim(subtypes)
    #[1] 7734   10
    table(subtypes$cancer.type)
    # ACC  AML BLCA BRCA COAD ESCA  GBM HNSC KICH KIRC KIRP  LGG LIHC LUAD LUSC OVCA PCPG 
    # 91  187  129 1218  341  169  606  279   66  442  161  516  196  230  178  489  178 
    # PRAD READ SKCM STAD THCA UCEC  UCS 
    # 333  118  333  383  496  538   57
    head(as.data.frame(subtypes))
    #   pan.samplesID cancer.type                         Subtype_mRNA   Subtype_DNAmeth Subtype_protein Subtype_miRNA Subtype_CNA Subtype_Integrative Subtype_other      Subtype_Selected
    # 1  TCGA-OR-A5J1         ACC steroid-phenotype-high+proliferation         CIMP-high              NA       miRNA_1       Quiet                COC3           C1A         ACC.CIMP-high
    # 2  TCGA-OR-A5J2         ACC steroid-phenotype-high+proliferation          CIMP-low               1       miRNA_1       Noisy                COC3           C1A          ACC.CIMP-low
    # 3  TCGA-OR-A5J3         ACC               steroid-phenotype-high CIMP-intermediate               3       miRNA_6 Chromosomal                COC2           C1A ACC.CIMP-intermediate
    # 4  TCGA-OR-A5J4         ACC                                 <NA>         CIMP-high              NA       miRNA_6 Chromosomal                <NA>          <NA>         ACC.CIMP-high
    # 5  TCGA-OR-A5J5         ACC               steroid-phenotype-high CIMP-intermediate              NA       miRNA_2 Chromosomal                COC2           C1A ACC.CIMP-intermediate
    # 6  TCGA-OR-A5J6         ACC                steroid-phenotype-low          CIMP-low               2       miRNA_1       Noisy                COC1           C1B          ACC.CIMP-low
    
    • TCGAquery_subtype()
      These subtypes will be automatically added in the summarizedExperiment object through GDCprepare. But you can also use the TCGAquery_subtype function to retrieve this information.
    brca.subtype <- TCGAquery_subtype(tumor = "brca")
    t(brca.subtype[1,])
    #                                     [,1]          
    # patient                             "TCGA-3C-AAAU"
    # Tumor.Type                          "BRCA"        
    # Included_in_previous_marker_papers  "NO"          
    # vital_status                        "Alive"       
    # days_to_birth                       "-20211"      
    # days_to_death                       "NA"          
    # days_to_last_followup               "4047"        
    # age_at_initial_pathologic_diagnosis "55"          
    # pathologic_stage                    "NA"          
    # Tumor_Grade                         "NA"          
    # BRCA_Pathology                      "NA"          
    # BRCA_Subtype_PAM50                  "LumA"        
    # MSI_status                          "NA"          
    # HPV_Status                          "NA"          
    # tobacco_smoking_history             "NA"          
    # CNV Clusters                        "C6"          
    # Mutation Clusters                   "C7"          
    # DNA.Methylation Clusters            "C1"          
    # mRNA Clusters                       "C1"          
    # miRNA Clusters                      "C3"          
    # lncRNA Clusters                     "NA"          
    # Protein Clusters                    "NA"          
    # PARADIGM Clusters                   "C5"          
    # Pan-Gyn Clusters                    "NA"
    

    GDCquery_Maf()函数可以支持下载突变数据,这里就暂时不学习了。之后有机会再了解一下。

    相关文章

      网友评论

        本文标题:[R]TCGAbiolinks包:数据准备--query、dow

        本文链接:https://www.haomeiwen.com/subject/rpigbltx.html