美文网首页TCGATCGA万花筒移动医疗
TCGA数据下载分析(1-1):RTCGA包gene获取表达值并

TCGA数据下载分析(1-1):RTCGA包gene获取表达值并

作者: Y大宽 | 来源:发表于2018-08-06 16:25 被阅读122次
    写在前面:
    • 方法:下载处理TCGA数据的R包很多,数据来源也不一样,这一部分开始对几个包分别进行使用,写出心得。
    • 结果最终想得到的是用其中两个包
    • 这部分,RTCGA包
    • 参考:作者文档这个以及生信技能树
    ---------------------------------------------------------------------------

    1 安装并加载包

    # Load the bioconductor installer. 
    ## try http:// if https:// URLs are not supported
    source("https://bioconductor.org/biocLite.R")
    biocLite("RTCGA")
    # Install the clinical and mRNA gene expression data packages
    biocLite("RTCGA.clinical") ## 14Mb
    biocLite('RTCGA.rnaseq') ##  (612.6 MB)
    biocLite("RTCGA.mRNA") ##  (85.0 MB)
    biocLite('RTCGA.mutations')  ## (103.8 MB)
    library(RTCGA.clinical) 
    library(RTCGA.mRNA)
    library(RTCGA.rnaseq)
    library(RTCGA.mutations)
    
    > library(RTCGA)
    Welcome to the RTCGA (version: 1.8.0).
    
    #查看BRCA的内容
    > checkTCGA('DataSets', 'LIHC')
       Size                                                                                                                                     Name
    1   61K                                                                                   LIHC.Clinical_Pick_Tier1.Level_4.2016012800.0.0.tar.gz
    2  932K                                                                                        LIHC.Merge_Clinical.Level_1.2016012800.0.0.tar.gz
    3  1.6G LIHC.Merge_methylation__humanmethylation450__jhu_usc_edu__Level_3__within_bioassay_data_set_function__data.Level_3.2016012800.0.0.tar.gz
    4  1.5M                  LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2016012800.0.0.tar.gz
    5   22M               LIHC.Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_isoform_expression__data.Level_3.2016012800.0.0.tar.gz
    6  255K                LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz
    7   78K   LIHC.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz.bak.20160128
    8   83M                           LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__exon_expression__data.Level_3.2016012800.0.0.tar.gz
    9  8.7M                           LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2016012800.0.0.tar.gz
    10 8.0M                LIHC.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__splice_junction_expression__data.Level_3.2016012800.0.0.tar.gz
    11 102M                            LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.Level_3.2016012800.0.0.tar.gz
    12  32M                 LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz
    13 281M                         LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms__data.Level_3.2016012800.0.0.tar.gz
    14  80M              LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_isoforms_normalized__data.Level_3.2016012800.0.0.tar.gz
    15 905M                   LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__exon_quantification__data.Level_3.2016012800.0.0.tar.gz
    16  70M               LIHC.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__junction_quantification__data.Level_3.2016012800.0.0.tar.gz
    17 5.8M                        LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg18__seg.Level_3.2016012800.0.0.tar.gz
    18 5.8M                        LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
    19 1.5M     LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg18__seg.Level_3.2016012800.0.0.tar.gz
    20 1.5M     LIHC.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
    21 210M                                                                                LIHC.Methylation_Preprocess.Level_3.2016012800.0.0.tar.gz
    22 2.0M                                                                               LIHC.Mutation_Packager_Calls.Level_3.2016012800.0.0.tar.gz
    23 416M                                                                            LIHC.Mutation_Packager_Coverage.Level_3.2016012800.0.0.tar.gz
    24  25M                                                                     LIHC.Mutation_Packager_Oncotated_Calls.Level_3.2016012800.0.0.tar.gz
    25  47M                                                                 LIHC.Mutation_Packager_Oncotated_Raw_Calls.Level_3.2016012800.0.0.tar.gz
    26 3.8M                                                                           LIHC.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0.tar.gz
    27 765M                                                                        LIHC.Mutation_Packager_Raw_Coverage.Level_3.2016012800.0.0.tar.gz
    28 585K                                                                                 LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz
    29 275K                                                                    LIHC.RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz.bak.20160128
    30 312M                                                                                    LIHC.mRNAseq_Preprocess.Level_3.2016012800.0.0.tar.gz
    31 3.4M                                                                              LIHC.miRseq_Mature_Preprocess.Level_3.2016012800.0.0.tar.gz
    32 2.8M                                                                                     LIHC.miRseq_Preprocess.Level_3.2016012800.0.0.tar.gz
    
    关于这些数据
    • mRNA是芯片数据
    • ranseq是测序数据
      具体参考这里这里
    • miRSeq is micro-RNA seq. Micro RNAs are a class of non protein coding RNAs that have regulatory functions (they typically bind to the 3`UTR of coding mRNAs and regulate that way)
    • mRNA, in this situation, refers to a cDNA microarray, where pre-designed probes on the microarray surface will bind to known target mRNAs, which may be coding or non-coding
    • mRNASeq is what most will loosely refer to as 'RNA-seq'. Most protocols will capture all classes of RNA species that have poly-adenylated (poly(A)) tails. RNAs that don't have these tails include ribsomal RNAs and many enhancer RNAs, but there are always exceptions to these rules.

    推荐(建议参考)

    miRSeq

    • illuminahiseq_mirnaseq-miR_gene_expression - normalised micro-RNAseq counts over each micro-RNA
    • illuminahiseq_mirnaseq-miR_isoform_expression - nomalised micro-RNAseq counts over each splice isoform of each micro-RNA
      mRNA
    • agilent4502a_07_3-unc_lowess_normalization_gene_level - LOWESS-normalised cDNA expression values over each gene (data from University of North Carolina)
      mRNASeq
    • illuminahiseq_rnaseq-gene_expression - normalised RNAseq counts over each gene
    • illuminahiseq_rnaseq-exon_expression - normalised RNAseq counts over each exon of each gene
    > checkTCGA('Dates')
     [1] "2011-10-26" "2011-11-15" "2011-11-28" "2011-12-06" "2011-12-30" "2012-01-10" "2012-01-24" "2012-02-17" "2012-03-06" "2012-03-21"
    [11] "2012-04-12" "2012-04-25" "2012-05-15" "2012-05-25" "2012-06-06" "2012-06-23" "2012-07-07" "2012-07-25" "2012-08-04" "2012-08-25"
    [21] "2012-09-13" "2012-10-04" "2012-10-18" "2012-10-20" "2012-10-24" "2012-11-02" "2012-11-14" "2012-12-06" "2012-12-21" "2013-01-16"
    [31] "2013-02-03" "2013-02-22" "2013-03-09" "2013-03-26" "2013-04-06" "2013-04-21" "2013-05-08" "2013-05-23" "2013-06-06" "2013-06-23"
    [41] "2013-07-15" "2013-08-09" "2013-09-23" "2013-10-10" "2013-11-14" "2013-12-10" "2014-01-15" "2014-02-15" "2014-03-16" "2014-04-16"
    [51] "2014-05-18" "2014-06-14" "2014-07-15" "2014-09-02" "2014-10-17" "2014-12-06" "2015-02-02" "2015-02-04" "2015-04-02" "2015-06-01"
    [61] "2015-08-21" "2015-11-01" "2016-01-28"
    

    2 下载某肿瘤中Datasets中的某类数据

    可以用前述的checkTCGA查看类型,然后针对性下载,用法如下:
    downloadTCGA(cancerTypes, dataSet = "Merge_Clinical.Level_1", destDir, date = NULL, untarFile = TRUE, removeTar = TRUE, allDataSets = FALSE)

    • dataSet在checkTCGA中查看,可以看出数据类型名字很长,所以这里只要部分(连续)匹配即可
    • destDir在当前工作目录下创建一个新文件
    setwd("E:/00TCGA/data/LIHC")
    #downloading the miRNA DATA
    dir.create('miRNA')
    downloadTCGA(cancerTypes = 'LIHC',
                 dataSet = 'miR_gene_expression',
                 destDir = 'rnaseq',
                 date = tail(checkTCGA('Dates'), 2)[1])
    
    #downloading the ranseq data
    dir.create('rnaseq')
    downloadTCGA(cancerTypes = 'LIHC',
                 dataSet = 'Level_3__gene_expression',
                 destDir = 'rnaseq',
                 date = tail(checkTCGA('Dates'), 2)[1])
    

    3 获取某基因在任意癌症中的mRNA表达数据并可视化(ggplot2,ggpubrboxplotTCGA

    3.0 获取mRNA表达数据

    获取EZH2,PTEN,HGFAC,FYN,TIGD1等5个gene在BRCA,OV中的mRNA表达数据(LIHC没有mRNA)
    expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
                            extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))
    
    > expr
    # A tibble: 1,305 x 7
       bcr_patient_barcode          dataset     EZH2   PTEN HGFAC    FYN  TIGD1
       <chr>                        <chr>      <dbl>  <dbl> <dbl>  <dbl>  <dbl>
     1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.mRNA -1.79   1.36  -2.31 -0.798 -0.86 
     2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.mRNA -1.57   0.428 -2.49 -0.532 -1.06 
     3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.mRNA -2.49   1.31  -2.65  0.007 -1.06 
     4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.mRNA -2.14   0.810 -2.58 -0.466 -0.714
     5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.mRNA  0.529  0.251 -2.95  0.960 -0.664
     6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.mRNA -1.44   1.31  -3.26 -0.622 -0.490
     7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.mRNA -0.426 -0.237 -2.71  0.208  0.759
     8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.mRNA -0.579 -1.24  -2.45  0.563  0.022
     9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.mRNA -1.16   1.21  -2.04 -0.616 -0.287
    10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.mRNA -0.894  0.288 -2.31 -0.314  1.33 
    # ... with 1,295 more rows
    

    3.1 ggplot2绘制指定基因在不同肿瘤中的表达boxplot

    expr<-expressionsTCGA(BRCA.mRNA, OV.mRNA,LUSC.mRNA,
                            extract.cols = c("EZH2", "PTEN", "HGFAC","FYN", "TIGD1"))
    expr1<-expr[,-1]
    expr2<-gather(expr1, key ="mRNA", value="value", -dataset)
    
    ggplot(expr2, aes(y=value,
                 x=reorder(dataset, value, mean),
                 fill= dataset))+
      geom_boxplot()+
      theme_RTCGA()+
      scale_fill_brewer(palette = "Set3")+
      facet_grid(mRNA~.)+
      theme(legend.position = "top")
    

    boxplot如下


    5genes in three kinds of cancers.jpeg

    3.2 ggpubr绘制指定基因在不同肿瘤中的表达boxplot并进行统计学分析作图

    这部分参考jimmy
    查看样本量

    table(expr$dataset)
    
    > nb_samples
    BRCA.mRNA LUSC.mRNA   OV.mRNA 
          590       154       561
    

    bcr_patient_barcode这列改名,以便下一步可视化作图

    expr$dataset <- gsub(pattern = ".mRNA", replacement = "",  expr$dataset)
    expr$bcr_patient_barcode <- paste0(expr$dataset, c(1:590, 1:561, 1:154))
    expr
    
    > expr
    # A tibble: 1,305 x 7
       bcr_patient_barcode dataset   EZH2   PTEN HGFAC    FYN  TIGD1
       <chr>               <chr>    <dbl>  <dbl> <dbl>  <dbl>  <dbl>
     1 BRCA1               BRCA    -1.79   1.36  -2.31 -0.798 -0.86 
     2 BRCA2               BRCA    -1.57   0.428 -2.49 -0.532 -1.06 
     3 BRCA3               BRCA    -2.49   1.31  -2.65  0.007 -1.06 
     4 BRCA4               BRCA    -2.14   0.810 -2.58 -0.466 -0.714
     5 BRCA5               BRCA     0.529  0.251 -2.95  0.960 -0.664
     6 BRCA6               BRCA    -1.44   1.31  -3.26 -0.622 -0.490
     7 BRCA7               BRCA    -0.426 -0.237 -2.71  0.208  0.759
     8 BRCA8               BRCA    -0.579 -1.24  -2.45  0.563  0.022
     9 BRCA9               BRCA    -1.16   1.21  -2.04 -0.616 -0.287
    10 BRCA10              BRCA    -0.894  0.288 -2.31 -0.314  1.33
    

    绘制EZH2表达值的boxplot图

    library(ggpubr)
    ggboxplot(expr, x = "dataset", y = "EZH2",
              title = "EZH2", ylab = "Expression",
              color = "dataset", palette = "jco")
    
    ezh2.jpeg

    4 获取某基因在任意癌症中的rnaseq表达数据并可视化(ggplot2,ggpubr和boxplotTCGA)

    • ggplot2和ggpubr的用法与第3部分几乎一致,而boxplotTCGA无法直接获取mRNA表达数据,只有rnaseq,另外还有mutation等数据。

    • 不同的是,不仅需要gene symbol还要entrez的ID,如MET|4233

    4.0 获取rnaseq表达数据

    • 获取EZH2,PTEN,HGFAC,FYN等4个gene在BRCA,OV和LUSC中的rnaseq表达数据
    expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
                          extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))
    
    > expr
    # A tibble: 2,071 x 6
       bcr_patient_barcode          dataset     `EZH2|2146` `PTEN|5728` `HGFAC|3083` `FYN|2534`
       <chr>                        <chr>             <dbl>       <dbl>        <dbl>      <dbl>
     1 TCGA-3C-AAAU-01A-11R-A41B-07 BRCA.rnaseq        487.       1724.        4.83       247. 
     2 TCGA-3C-AALI-01A-11R-A41B-07 BRCA.rnaseq        941.       1107.        9.79       523. 
     3 TCGA-3C-AALJ-01A-31R-A41B-07 BRCA.rnaseq        492.       1479.        7.25       814. 
     4 TCGA-3C-AALK-01A-11R-A41B-07 BRCA.rnaseq        334.       1877.        9.10       562. 
     5 TCGA-4H-AAAK-01A-12R-A41B-07 BRCA.rnaseq        297.       1740.        1.70       549. 
     6 TCGA-5L-AAT0-01A-12R-A41B-07 BRCA.rnaseq        238.       1597.        7.63       689. 
     7 TCGA-5L-AAT1-01A-12R-A41B-07 BRCA.rnaseq        258.       1374.        0.815      853. 
     8 TCGA-5T-A9QA-01A-11R-A41B-07 BRCA.rnaseq        253.       2181.        3.14        84.7
     9 TCGA-A1-A0SB-01A-11R-A144-07 BRCA.rnaseq        392.       2529.        0.901      740. 
    10 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.rnaseq        191.       1876.        0          522. 
    # ... with 2,061 more rows
    

    4.1 ggplot2绘制指定基因在不同肿瘤中的表达boxplot(rnaseq)

    #rnaseq ggplot2
    expr<-expressionsTCGA(BRCA.rnaseq, OV.rnaseq,LUSC.rnaseq,
                          extract.cols = c("EZH2|2146", "PTEN|5728", "HGFAC|3083","FYN|2534"))
    
    expr1<-expr[,-1]
    expr2<-gather(expr1, key ="rnaseq", value="value", -dataset)
    
    ggplot(expr2, aes(y=value,
                      x=reorder(dataset, value, mean),
                      fill= dataset))+
      geom_boxplot()+
      theme_RTCGA()+
      scale_fill_brewer(palette = "Set3")+
      facet_grid(rnaseq~.)+
      theme(legend.position = "top")
    
    rnaseq ggplot2.jpeg

    4.2 ggpubr

    查看样本量

    nb_samples <- table(expr$dataset)
    nb_samples
    
    > nb_samples
    BRCA.rnaseq LUSC.rnaseq   OV.rnaseq 
           1212         552         307 
    
    ggboxplot(expr, x = "dataset", y = "`PTEN|5728`",
              title = "ESR1|2099", ylab = "Expression",
              color = "dataset", palette = "jco")
    
    PTEN_5728.jpeg

    4.3 boxplotTCGA

    具体参考RTCGA的文档

    expressionsTCGA(LIHC.rnaseq, BLCA.rnaseq, BRCA.rnaseq, OV.rnaseq,extra
                    extract.cols = "MET|4233") %>%
      rename(cohort = dataset,
      MET = `MET|4233`) %>%
      #cancer samples
      filter(substr(bcr_patient_barcode, 14, 15) == "01") -> 
      ACC_BLCA_BRCA_OV.rnaseq
    
    boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "MET")
    
    MET RNASEQ.jpeg
    boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "log1p(MET)")
    
    Rplot.jpeg

    后记-----------------------------------------

    这部分只是用RTCGA演示了如何下载数据;mRNA和rnaseq表达值的plot。

    相关文章

      网友评论

        本文标题:TCGA数据下载分析(1-1):RTCGA包gene获取表达值并

        本文链接:https://www.haomeiwen.com/subject/izvivftx.html