美文网首页生物信息学scRNA-seqRNA-seq
3大数据库超2万RNA-seq数据重新统一处理

3大数据库超2万RNA-seq数据重新统一处理

作者: 因地制宜的生信达人 | 来源:发表于2019-04-02 09:18 被阅读222次

    3大数据库超2万RNA-seq数据重新统一处理

    各种大型计划产出的RNA-seq数据资源已经非常丰富了,但是大家都想把多个数据库联合起来分析,就不得不面对批次效应这个问题,所以UCSC团队就使用统一的流程把这些数据重新处理了,在亚马逊云上,一个样本花费1.3美元。

    发表在:Nature Biotechnology publication: https://doi.org/10.1038/nbt.3772

    3大数据库是:

    1. The Cancer Genome Atlas (TCGA)
    2. Genotype-Tissue Expression (GTEx)
    3. Therapeutically Applicable Research To Generate Effective Treatments (TARGET)

    而且还提供网页工具供查询使用:

    Differential gene and isoform expression of FOXM1 transcription factor in TCGA vs. GTEx

    使用的数据处理流程

    如下图: CutAdapt was used for adapter trimming, STAR was used for alignment, and RSEM and Kallisto were used as quantifiers.

    img

    流程介绍

    如果你对RNA-seq数据处理流程有意外,直接去看我长达74个小时全套生物信息学入门视频:生信技能树视频课程学习路径,这么好的视频还免费!

    参考基因组选择

    • STAR, RSEM, and Kallisto indexes were all built with the same reference genome. HG38 (no alt analysis) with overlapping genes from the PAR locus removed (chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415).
      • ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines

    注释文件的选择

    • RSEM: Gencode V23 comprehensive annotation (CHR)
      • http://www.gencodegenes.org/releases/23.html first row
    • Kallisto: Gencode V23 comprehensive annotation (ALL)
      • http://www.gencodegenes.org/releases/23.html second row

    软件参数的选择

    • STAR
      • sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/star --runThreadN 32 --runMode genomeGenerate --genomeDir /data/genomeDir --genomeFastaFiles hg38.fa --sjdbGTFfile gencode.v23.annotation.gtf
    • Kallisto
      • sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/kallisto index -i hg38.gencodeV23.transcripts.idx transcriptome_hg38_gencodev23.fasta
      • Kallisto index that was used during the recompute is available here.
    • RSEM
      • sudo docker run -v $(pwd):/data --entrypoint=rsem-prepare-reference jvivian/rsem -p 4 --gtf gencode.v23.annotation.gtf hg38.fa hg38

    可以看到,上面的3大要素, 就是我五年前在 生信菜鸟团博客写教程的基本规律。

    Raw data

    Nature Publication Supplementary Note 7 – Data Availability

    Submitter sample ID to Xena sample ID mapping

    TCGA mapping

    GTEx mapping

    TARGET mapping

    最后公布的可供下载的数据集

    其中TCGA TARGET GTEx 3大数据库) (共有 13 datasets)

    cohort: TCGA TARGET GTEx

    表达矩阵样本量很可观

    • RSEM expected_count

      (n=19,109)

      UCSC Toil RNAseq Recompute

    • RSEM expected_count (DESeq2 standardized)

      (n=19,039)

      UCSC Toil RNAseq Recompute

      RSEM expected_count output normalized using DESeq2

    • RSEM fpkm

      (n=19,131)

      UCSC Toil RNAseq Recompute

    • RSEM norm_count

      (n=19,120)

      UCSC Toil RNAseq Recompute

      TCGA TARGET GTEx gene expression by UCSC TOIL RNA-seq recompute

    • RSEM tpm

      (n=19,131)

      UCSC Toil RNAseq Recompute

    phenotype

    • TCGA GTEX main categories

      (n=17,221)

      UCSC Toil RNAseq Recompute

    • TCGA survival data

      (n=10,496)

      UCSC Toil RNAseq Recompute

    • TCGA TARGET GTEX selected phenotypes

      (n=19,131)

      UCSC Toil RNAseq Recompute

    somatic mutation (SNP and INDEL)

    • TCGA somatic mutations (Pan-cancer Atlas MC3 public version)

      (n=8,463)

      UCSC Toil RNAseq Recompute

    transcript expression RNAseq

    • RSEM expected_count

      (n=19,109)

      UCSC Toil RNAseq Recompute

      TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

    • RSEM fpkm

      (n=19,129)

      UCSC Toil RNAseq Recompute

      TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

    • RSEM isoform percentage

      (n=19,131)

      UCSC Toil RNAseq Recompute

      TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

    • RSEM tpm

      (n=19,131)

      UCSC Toil RNAseq Recompute

      TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

    相关文章

      网友评论

        本文标题:3大数据库超2万RNA-seq数据重新统一处理

        本文链接:https://www.haomeiwen.com/subject/wkqwbqtx.html