美文网首页转录组高级分析
RNA 3. SCI 文章中基于T CGA 差异表达基因之 DE

RNA 3. SCI 文章中基于T CGA 差异表达基因之 DE

作者: 桓峰基因 | 来源:发表于2022-02-25 14:29 被阅读0次

    前言

    上期我们介绍了基于 limma 来做差异表达基因,那么这期来讲一下 DESeq2,那么这两款软件有什么区别吗?区别主要在于一个是计算芯片探针给出来的结果,而 DESeq2 是基于NGS 测序结果中 Read counts 来计算差异表达,根据输入数据的不同,我们对比一下做法。

    在比较高通量测序分析中,一项基本任务是分析计数数据,如 RNA-seq 中每个基因的 Read count,以获得跨实验条件的系统性变化的证据。离散性,大动态范围和异常值的存在需要一个合适的统计方法。DESeq2 是一种计数数据的差分分析方法,使用离散度和折叠变化的收缩估计来提高估计的稳定性和可解释性。这使得更多的定量分析集中在强度上,而不仅仅是差异表达的存在。下面我们就根据这篇文章的数据模式进行差异分析。

    01. 软件包安装

    安装 DESeq2 软件包,这个包需要通过 BiocManager 来安装,所以首先检测是否安装 BiocManager ,我之前安装过 DESeq2 ,所以不需要重复安装,如果使用 RStudio 安装不成功,可以通过 R 软件安装,运行如下:

    if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

    if (!require("DESeq2", quietly = TRUE))
    BiocManager::install("DESeq2")

    if (!requireNamespace("TCGAbiolinks", quietly=TRUE))
    BiocManager::install("TCGAbiolinks")

    if (!requireNamespace("EDASeq", quietly = TRUE))
    BiocManager::install("EDASeq")

    if (!requireNamespace("SummarizedExperiment", quietly = TRUE))
    BiocManager::install("SummarizedExperiment")

    if (!requireNamespace("EnhancedVolcano", quietly = TRUE))
    BiocManager::install("EnhancedVolcano")

    if (!requireNamespace("limma", quietly = TRUE))
    BiocManager::install("limma")

    library(DESeq2)
    library(TCGAbiolinks)
    library(EDASeq)
    library(SummarizedExperiment)
    library(EnhancedVolcano)
    library(limma)

    02. TCGA 数据读取

    这次我们选择 TCGA 数据的RNA-SEQ的 Reads count 数据,一般后缀为 HTSeq-counts.txt。

    我们看通过 TCGAbiolinks 这个软件包都可以获得哪些数据库的数据集,TCGA 全部数据集,还是非常全面的,如下:

    getGDCprojects()$project_id
    ##  [1] "TCGA-BRCA"             "GENIE-MSK"             "GENIE-VICC"            "GENIE-UHN"            
    ## [5] "CPTAC-2" "CMI-ASC" "BEATAML1.0-COHORT" "CGCI-BLGSP"
    ## [9] "BEATAML1.0-CRENOLANIB" "CMI-MPC" "CMI-MBC" "GENIE-GRCC"
    ## [13] "GENIE-MDA" "GENIE-JHU" "GENIE-NKI" "FM-AD"
    ## [17] "VAREPOP-APOLLO" "WCDT-MCRPC" "GENIE-DFCI" "TARGET-ALL-P3"
    ## [21] "TARGET-ALL-P2" "OHSU-CNL" "TARGET-ALL-P1" "MMRF-COMMPASS"
    ## [25] "TARGET-CCSK" "ORGANOID-PANCREATIC" "NCICCR-DLBCL" "TARGET-NBL"
    ## [29] "TARGET-OS" "TARGET-RT" "TARGET-WT" "TCGA-LAML"
    ## [33] "CGCI-HTMCP-CC" "TARGET-AML" "HCMI-CMDC" "TCGA-DLBC"
    ## [37] "TCGA-CHOL" "CTSP-DLBCL1" "TRIO-CRU" "TCGA-MESO"
    ## [41] "TCGA-ACC" "TCGA-UCS" "TCGA-KICH" "TCGA-PCPG"
    ## [45] "TCGA-ESCA" "TCGA-THYM" "TCGA-TGCT" "TCGA-UVM"
    ## [49] "TCGA-CESC" "TCGA-BLCA" "TCGA-PAAD" "TCGA-LIHC"
    ## [53] "TCGA-SKCM" "TCGA-UCEC" "TCGA-PRAD" "REBC-THYR"
    ## [57] "TCGA-THCA" "TCGA-OV" "TCGA-LGG" "TCGA-SARC"
    ## [61] "CPTAC-3" "TCGA-COAD" "TCGA-READ" "TCGA-KIRP"
    ## [65] "TCGA-GBM" "TCGA-STAD" "TCGA-LUAD" "TCGA-KIRC"
    ## [69] "TCGA-LUSC" "TCGA-HNSC"

    使用TCGAbiolinks:::getProjectSummary(project)查看project中有哪些数据类型,如查询"TCGA-COAD",有8种数据类型,case_count为病人数,file_count为对应的文件数。要下载表达谱,可以设置data.category="Transcriptome Profiling",如下:

    TCGAbiolinks:::getProjectSummary("TCGA-COAD")
    ## $file_count
    ## [1] 15701
    ##
    ## $data_categories
    ## file_count case_count data_category
    ## 1 2971 460 Copy Number Variation
    ## 2 531 461 Clinical
    ## 3 2835 461 Biospecimen
    ## 4 2493 459 Transcriptome Profiling
    ## 5 3952 433 Simple Nucleotide Variation
    ## 6 363 360 Proteome Profiling
    ## 7 556 458 DNA Methylation
    ## 8 2000 460 Sequencing Reads
    ##
    ## $case_count
    ## [1] 461
    ##
    ## $file_size
    ## [1] 2.747227e+13

    现在我们选择一个结肠癌的表达数据,比较癌和癌旁组织之间的表达差异基因,下载 TCGA-COAD,下载方式可以选择直接下载:<https://portal.gdc.cancer.gov/>{.uri} 下载 HTSeq-counts.txt 和临床数据,也可以通过 TCGAbiolinks 软件包下载。

    # 请求数据。
    query <- GDCquery(project ="TCGA-COAD",
    data.category = "Transcriptome Profiling",
    data.type ="Gene Expression Quantification" ,
    workflow.type="HTSeq - Counts")

    getResults(query, rows, cols) 根据指定行名或列名从 query 中获取结果,此处用来获得样本的 barcode,检索出519个 barcodes。从 samplesDown 中筛选出TP(实体肿瘤)样本的barcodes,如下:

    samplesDown <- getResults(query,cols=c("cases"))

    我们需要对数据进行处理,整理出需要的数据,这里我们只是比较癌和癌旁的表达差异,那么临床数据需要找到样本的对应分组。之前我们给出样本类型的简写,实体瘤样本为 Primary Solid Tumor 即为 TP, 正常样本为 Solid Tissue Normal 即为 NT。通过 TCGAquery_SampleTypes(barcode, typesample) 获得 478个 TP样本barcodes,41个NT样本barcodes。

    dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
    typesample = "TP")
    dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
    typesample = "NT")

    将样本与分组关联起来,以便有序过滤样本,并保证能够对应修改分组信息,如下:

    group<-as.data.frame(list(Sample=c(dataSmTP,dataSmNT),
    Group=c(rep("TP",length(dataSmTP)),rep("NT",length(dataSmNT)))))

    设置 barcodes 参数,筛选符合要求的 478 个肿瘤样本数据和 41 正常组织数据,根据传入 barcodes 进行数据过滤,如下:

    queryDown <- GDCquery(project = "TCGA-COAD",
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",
    workflow.type = "HTSeq - Counts",
    barcode = c(dataSmTP, dataSmNT))

    下载数据,默认存放位置为当前工作目录下的GDCdata文件夹中。

    1. method ;"API"或者"client","API"速度更快,但是容易下载中断;

    2. directory:下载文件的保存地址。Default: GDCdata;

    3. files.per.chunk = NULL:使用API下载大文件的时候,可以把文件分成几个小文件来下载,可以解决下载容易中断的问题。

    GDCdownload(queryDown,method = "api", 
    directory = "GDCdata",
    files.per.chunk = 10)

    03. 数据处理

    GDCprepare()将前面GDCquery()的结果准备成R语言可处理的SE(SummarizedExperiment)文件

    读取下载的数据并将其准备到R对象中,在工作目录生成(save=TRUE)COAD_case.rda文件。GDCprepare():Prepare GDC data,准备GDC数据,使其可用于R语言中进行分析

    dataPrep1 <- GDCprepare(query = queryDown, 
    save = TRUE,
    save.filename ="COAD_case.rda"
    )

    获得的结果如下:

    dataPrep1
    ## class: RangedSummarizedExperiment 
    ## dim: 56602 519
    ## metadata(1): data_release
    ## assays(1): HTSeq - Counts
    ## rownames(56602): ENSG00000000003 ENSG00000000005 ... ENSG00000281912 ENSG00000281920
    ## rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id
    ## colnames(519): TCGA-D5-6530-01A-11R-1723-07 TCGA-G4-6320-01A-11R-1723-07 ...
    ## TCGA-AA-3712-11A-01R-1723-07 TCGA-AA-3522-11A-01R-A32Z-07
    ## colData names(107): barcode patient ... paper_vascular_invasion_present
    ## paper_vital_status

    去除dataPrep1中的异常值,dataPrep1数据中含有肿瘤组织和正常组织的数据,生产热图,以及数据,函数功能描述:Array Array Intensity correlation (AAIC) and correlation boxplot to define outlier,如下:

    dataPrep2 <- TCGAanalyze_Preprocessing(object = dataPrep1,
    cor.cut = 0.6,
    datatype = "HTSeq - Counts")

    获得一个 counts 矩阵, 如下:

    dim(dataPrep2)
    ## [1] 56602   519
    dataPrep2[1:5,1:3]
    ##                 TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-4N-A93T-01A-11R-A37K-07
    ## ENSG00000000003 7280 7164
    ## ENSG00000000005 23 67
    ## ENSG00000000419 2065 2632
    ## ENSG00000000457 869 916
    ## ENSG00000000460 466 266
    ## TCGA-4T-AA8H-01A-11R-A41B-07
    ## ENSG00000000003 2927
    ## ENSG00000000005 89
    ## ENSG00000000419 848
    ## ENSG00000000457 370
    ## ENSG00000000460 214

    将预处理后的数据dataPrep2,写入新文件"COAD_dataPrep.csv"

    write.csv(dataPrep2,file = "COAD_dataPrep.csv",quote = FALSE)

    TCGAtumor_purity(barcodes, estimate, absolute, lump, ihc, cpe),使用来自5种方法的5个估计值作为阈值对 TCGA 样本进行过滤,这5个值是estimate, absolute, lump, ihc, cpe,这里设置cpe=0.6(cpe是派生的共识度量,是将所有方法的标准含量归一化后的均值纯度水平,以使它们具有相等的均值和标准差)。筛选肿瘤纯度大于等于60%的样本数据,如下:

    purityDATA <- TCGAtumor_purity(colnames(dataPrep1), 0, 0, 0, 0, 0.6)
    ## the following TCGA barcodes do not have info on tumor purity:
    ##  [1] "TCGA-A6-2672-01B-03R-2302-07" "TCGA-AZ-6601-11A-01R-1774-07" "TCGA-A6-2686-11A-01R-A32Z-07"
    ## [4] "TCGA-AZ-6603-11A-02R-1839-07" "TCGA-AZ-6605-11A-01R-1839-07" "TCGA-AA-3660-11A-01R-1723-07"
    ## [7] "TCGA-AA-3713-11A-01R-1723-07" "TCGA-AZ-6598-11A-01R-1774-07" "TCGA-A6-2680-11A-01R-A32Z-07"
    ## [10] "TCGA-AA-3655-11A-01R-1723-07" "TCGA-AA-3516-11A-01R-A32Z-07" "TCGA-F4-6704-11A-01R-1839-07"
    ## [13] "TCGA-AZ-6600-11A-01R-1774-07" "TCGA-A6-5659-11A-01R-1653-07" "TCGA-AA-3520-11A-01R-A32Z-07"
    ## [16] "TCGA-A6-2679-11A-01R-A32Z-07" "TCGA-A6-2678-11A-01R-A32Z-07" "TCGA-AZ-6599-11A-01R-1774-07"
    ## [19] "TCGA-A6-5662-11A-01R-1653-07" "TCGA-A6-2683-11A-01R-A32Z-07" "TCGA-AA-3525-11A-01R-A32Z-07"
    ## [22] "TCGA-AA-3527-11A-01R-A32Z-07" "TCGA-AA-3489-11A-01R-1839-07" "TCGA-A6-2685-11A-01R-A32Z-07"
    ## [25] "TCGA-A6-5665-11A-01R-1653-07" "TCGA-AA-3697-11A-01R-1723-07" "TCGA-AA-3496-11A-01R-1839-07"
    ## [28] "TCGA-AA-3534-11A-01R-A32Z-07" "TCGA-A6-2682-11A-01R-A32Z-07" "TCGA-AA-3663-11A-01R-1723-07"
    ## [31] "TCGA-A6-5667-11A-01R-1723-07" "TCGA-AA-3517-11A-01R-A32Z-07" "TCGA-AA-3518-11A-01R-1672-07"
    ## [34] "TCGA-A6-2675-11A-01R-1723-07" "TCGA-A6-2671-11A-01R-A32Z-07" "TCGA-AA-3662-11A-01R-1723-07"
    ## [37] "TCGA-AA-3531-11A-01R-A32Z-07" "TCGA-AA-3511-11A-01R-1839-07" "TCGA-A6-2684-11A-01R-A32Z-07"
    ## [40] "TCGA-AA-3514-11A-01R-A32Z-07" "TCGA-AA-3712-11A-01R-1723-07" "TCGA-AA-3522-11A-01R-A32Z-07"

    filtered 为被过滤的数据, pure_barcodes是我们要的肿瘤数据

    Purity.COAD<-purityDATA$pure_barcodes
    length(Purity.COAD)
    ## [1] 450
    normal.COAD<-purityDATA$filtered
    length(normal.COAD)
    ## [1] 42

    获取肿瘤纯度大于60%的450个肿瘤组织样本,42个正常组织样本,共计492个样本

    puried_data <-dataPrep2[,c(Purity.COAD,normal.COAD)]
    puried_data[1:5,1:3]
    ##                 TCGA-D5-6530-01A-11R-1723-07 TCGA-G4-6320-01A-11R-1723-07
    ## ENSG00000000003 2241 5323
    ## ENSG00000000005 2 75
    ## ENSG00000000419 1181 1168
    ## ENSG00000000457 591 616
    ## ENSG00000000460 290 302
    ## TCGA-AD-6888-01A-11R-1928-07
    ## ENSG00000000003 8152
    ## ENSG00000000005 105
    ## ENSG00000000419 5375
    ## ENSG00000000457 771
    ## ENSG00000000460 644

    基因注释,需要加载"SummarizedExperiment"包,"SummarizedExperiment container"每个由数字或其他模式的类似矩阵的对象表示。通常表示感兴趣的基因组范围,列代表样品。

    rowData(dataPrep1)
    ## DataFrame with 56602 rows and 3 columns
    ## ensembl_gene_id external_gene_name original_ensembl_gene_id
    ## <character> <character> <character>
    ## ENSG00000000003 ENSG00000000003 TSPAN6 ENSG00000000003.13
    ## ENSG00000000005 ENSG00000000005 TNMD ENSG00000000005.5
    ## ENSG00000000419 ENSG00000000419 DPM1 ENSG00000000419.11
    ## ENSG00000000457 ENSG00000000457 SCYL3 ENSG00000000457.12
    ## ENSG00000000460 ENSG00000000460 C1orf112 ENSG00000000460.15
    ## ... ... ... ...
    ## ENSG00000281904 ENSG00000281904 AC233263.6 ENSG00000281904.1
    ## ENSG00000281909 ENSG00000281909 HERC2P7 ENSG00000281909.1
    ## ENSG00000281910 ENSG00000281910 SNORA50A ENSG00000281910.1
    ## ENSG00000281912 ENSG00000281912 LINC01144 ENSG00000281912.1
    ## ENSG00000281920 ENSG00000281920 AC007389.5 ENSG00000281920.1

    传入数据 dataPrep1 必须为 SummarizedExperiment 对象

    rownames(puried_data)<-rowData(dataPrep1)$external_gene_name

    将结果写入文件"puried.COAD.cancer.csv"

    write.csv(puried_data,file = "puried.COAD.csv",quote = FALSE)

    将标准化后的数据再过滤,如下:

    dataNorm <- TCGAanalyze_Normalization(tabDF = puried_data,
    geneInfo = geneInfo,
    method = "gcContent")
    ## I Need about  348 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]
    ## Step 1 of 4: newSeqExpressionSet ...
    ## Step 2 of 4: withinLaneNormalization ...
    ## Step 3 of 4: betweenLaneNormalization ...
    ## Step 4 of 4: .quantileNormalization ...

    去除掉表达量较低(count较低)的基因,得到最终的数据,如下:

    dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
    method = "quantile",
    qnt.cut = 0.25)
    str(dataFilt)
    ##  num [1:13125, 1:492] 442 0 5632 185 1326 ...
    ## - attr(*, "dimnames")=List of 2
    ## ..$ : chr [1:13125] "A1CF" "A2ML1" "A2M" "A4GALT" ...
    ## ..$ : chr [1:492] "TCGA-D5-6530-01A-11R-1723-07" "TCGA-G4-6320-01A-11R-1723-07" "TCGA-AD-6888-01A-11R-1928-07" "TCGA-CK-6747-01A-11R-1839-07" ...
    dataFilt[1:5,1:3]
    ##        TCGA-D5-6530-01A-11R-1723-07 TCGA-G4-6320-01A-11R-1723-07 TCGA-AD-6888-01A-11R-1928-07
    ## A1CF 442 420 1471
    ## A2ML1 0 0 8
    ## A2M 5632 2389 1530
    ## A4GALT 185 102 169
    ## AAAS 1326 1456 1397

    04. 差异表达基因分析

    在做差异表达时,输入文件要求必须是 counts 矩阵,那么我们将上面整理后得到的 dataFilt 做成矩阵,如下:

    exp<-as.matrix(dataFilt)
    rownames(exp)[1:100]
    ##   [1] "A1CF"     "A2ML1"    "A2M"      "A4GALT"   "AAAS"     "AACS"     "AADAC"    "AADAT"   
    ## [9] "AAGAB" "AAK1" "AAMP" "AARS2" "AARSD1" "AARS" "AASDHPPT" "AASDH"
    ## [17] "AASS" "AATF" "AATK" "ABAT" "ABCA10" "ABCA11P" "ABCA12" "ABCA13"
    ## [25] "ABCA17P" "ABCA1" "ABCA2" "ABCA3" "ABCA5" "ABCA6" "ABCA7" "ABCA8"
    ## [33] "ABCA9" "ABCB10" "ABCB1" "ABCB4" "ABCB6" "ABCB7" "ABCB8" "ABCB9"
    ## [41] "ABCC10" "ABCC13" "ABCC1" "ABCC2" "ABCC3" "ABCC4" "ABCC5" "ABCC6P1"
    ## [49] "ABCC6P2" "ABCC6" "ABCC9" "ABCD1" "ABCD3" "ABCD4" "ABCE1" "ABCF1"
    ## [57] "ABCF2" "ABCF3" "ABCG1" "ABCG2" "ABCG5" "ABHD10" "ABHD11" "ABHD12B"
    ## [65] "ABHD12" "ABHD13" "ABHD14A" "ABHD14B" "ABHD15" "ABHD2" "ABHD3" "ABHD4"
    ## [73] "ABHD5" "ABHD6" "ABHD8" "ABI1" "ABI2" "ABI3BP" "ABI3" "ABL1"
    ## [81] "ABL2" "ABLIM1" "ABLIM2" "ABLIM3" "ABO" "ABR" "ABT1" "ABTB1"
    ## [89] "ABTB2" "ACAA1" "ACAA2" "ACACA" "ACACB" "ACAD10" "ACAD8" "ACAD9"
    ## [97] "ACADM" "ACADSB" "ACADS" "ACADVL"

    数据整理并且过滤后,此时获得行为 13125 个基因,列为 492 个样本的基因表达矩阵,如下:

    dim(exp)
    ## [1] 13125   492

    对表达矩阵的相同行取平均值,利用 limma 包中函数 avereps 进行计算,其实在这些做差异分析的数据包有些也可以兼容使用,哪个方法方便实用我们就选择哪个函数,这个都无所谓,最后计算的结果,如下:

    data=avereps(exp)
    dim(data)
    ## [1] 13125   492

    DEGSeq2 这个包要求表达值必须为整数,所以我们需要把矩阵中的数值进行取整数,利用 round 函数,如下

    data=round(data,0)

    设计分组信息,起初我们根据 TP 和 NT 样本信息共检索到519个样本,由于我们上面对不符合标准的样本进行了一定的过滤,所以需要重新整理样本分组信息,如下:

    head(group)
    ##                         Sample Group
    ## 1 TCGA-D5-6530-01A-11R-1723-07 TP
    ## 2 TCGA-G4-6320-01A-11R-1723-07 TP
    ## 3 TCGA-AD-6888-01A-11R-1928-07 TP
    ## 4 TCGA-CK-6747-01A-11R-1839-07 TP
    ## 5 TCGA-AA-3975-01A-01R-1022-07 TP
    ## 6 TCGA-A6-6780-01A-11R-1839-07 TP
    table(group$Group)
    ## 
    ## NT TP
    ## 41 478

    根据我们之前整理的临床分组,癌组织478个,正常组织41个,过滤后的样本数量临床分组的样本492,癌组织样本451,正常组织样本41个,如下:

    group1=group[group$Sample %in% colnames(exp),]
    table(group1$Group)
    ## 
    ## NT TP
    ## 41 451

    DESeq2 软件包中 DESeqDataSetFromMatrix 函数要求分组设计格式,如下:

    design=as.factor(group1$Group)

    上面一系列操作都是为了达到 DESeq2 的输入文件的标准,但其实主程序非常简单,一行命令搞定所有,如下:

    dds<-DESeqDataSetFromMatrix(data,DataFrame(design),design = ~design)
    dds<-DESeq(dds,fitType = "local") ## or mean
    res<-as.data.frame(results(dds))
    head(res)
    ##            baseMean log2FoldChange      lfcSE      stat       pvalue         padj
    ## A1CF 1204.033014 -1.2743230 0.16883176 -7.547887 4.423763e-14 1.361676e-13
    ## A2ML1 7.896115 3.5771758 0.57556789 6.215037 5.131255e-10 1.225841e-09
    ## A2M 11164.726955 -1.4163123 0.14667565 -9.656083 4.632373e-22 2.233648e-21
    ## A4GALT 379.188423 -0.7515222 0.16140586 -4.656102 3.222520e-06 5.937889e-06
    ## AAAS 1537.066486 0.4436685 0.05680360 7.810569 5.693031e-15 1.843143e-14
    ## AACS 2016.414047 0.1868876 0.07884018 2.370461 1.776591e-02 2.343257e-02

    05. 绘制火山图

    差异基因完成之后,我们进行火山图的绘制,这个方法比较好用我们在讲解 limma 软件包做差异分析的时候已经使用过,我们稍微改一下既可以使用,如下:

    require(EnhancedVolcano)
    EnhancedVolcano(res,
    lab = rownames(res),
    labSize = 2,
    x = "log2FoldChange",
    y = "pvalue",
    xlab = bquote(~Log[2]~ "fold change"),
    ylab = bquote(~-Log[10]~italic(P)),
    pCutoff = 0.01,## pvalue闃堝€?
    FCcutoff = 2,## FC cutoff
    xlim = c(-5,5),
    ylim = c(0,5),
    colAlpha = 0.6,
    legendLabels =c("NS","Log2 FC"," p-value",
    " p-value & Log2 FC"),
    legendPosition = "top",
    legendLabSize = 10,
    legendIconSize = 3.0,
    pointSize = 1.5,
    title ="DESeq2 results",
    subtitle = 'Differential Expression Genes',
    )

    我们在做整套分析时需要注意使用硬件的配置,我是在window 11 利用 Rstudio 做的分析,可想而知,还是很耗内存的,看下我电脑的配置,如下:

    然后我们再看一下做这几百个样本需要的内存,当然这中间产生了很多切换变量,如果可以优化的后期我会尽量优化,如下:

    关于差异表达主流软件已经都了解的差不多了,后面会针对检测出来的表达基因做一些后续的分析,比如GO, KEGG, GSEA等功能上的分析,敬请期待!

    关注公众号,每日有更新!

    桓峰基因 生物信息分析,SCI文章撰写及生物信息基础知识学习:R语言学习,perl基础编程,linux系统命令,Python遇见更好的你 44篇原创内容 --> 公众号

    Reference:

    1. Agarwal A, Koppstein D, Rozowsky J, et al. Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays. BMC Genomics. 2010;11:383.

    2. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139‐140.

    3. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.

    4. Maurya NS, Kushwaha S, Chawade A, Mani A. Transcriptome profiling by combined machine learning and statistical R analysis identifies TMEM236 as a potential novel diagnostic biomarker for colorectal cancer. Sci Rep. 2021;11(1):14304.

    本文使用 文章同步助手 同步

    相关文章

      网友评论

        本文标题:RNA 3. SCI 文章中基于T CGA 差异表达基因之 DE

        本文链接:https://www.haomeiwen.com/subject/rdbxrrtx.html