美文网首页走进转录组转录组专题
RNA-seq分析流程二:DEseq2做不同组间差异表达分析

RNA-seq分析流程二:DEseq2做不同组间差异表达分析

作者: 纵纵纵小鸮 | 来源:发表于2022-10-21 11:05 被阅读0次

    使用DEseq2循环做多组间差异表达分析

        当有多组RNA-seq数据时,有时需要对多个组合进行差异表达分析,例如当我有CIM0/CIM7/CIM14/CIM28四组时,我需要得到每个组合间的差异表达情况,CIM7:CIM0; CIM14:CIM0; CIM14:CIM7等。使用ANOVA的方式也可以进行多组间比较,但由于ANOVA是指定同一个CK,并且不能得到具体是哪组相对于CK有差异表达,不能精准的解决我的需求,因此选择使用DEseq2循环对不同组进行差异表达分析。

    一. R脚本

        目前脚本中DEGs(差异表达基因)筛选标准为log2FoldChange>1或log2FoldChange<-1以及pvalue<0.05, qvalue<0.05,可根据需求更改阈值。

    ###

    library(GenomicFeatures)

    library(DESeq2)

    library(dplyr)

    ###load and set output file

    file_in <- "counts.txt"

    file_design <- "design.txt"

    file_compare <- "compare2.txt"

    file_deg_num = paste("data//","DE_",file_in,sep="")  ##number of DEGs in all compare

    file_final_csv  = paste("data//","DE_",file_in, "_Final_Out.csv",sep="")

    file_final_genelist = paste("data//","DEG_geneid_allcomapre.txt",sep="")

    ################################################################

    ###########get DEGs in different compare

    # counts file

    data_in = read.table(file_in, head=TRUE,row.names =1, check.names = FALSE)

    mycompare=read.table(file_compare,head=TRUE)

    mydesign=read.table(file_design, head = TRUE)

    ##filter counts

    countData = as.data.frame(data_in)

    dim(countData)

    mycounts_filter <- countData[rowSums(countData) != 0,]

    dim(mycounts_filter)

    ##check group

    head(mycompare)

    head(mydesign)

    total_num=dim(mycompare)[1]  ##get the num of groups

    tracking = 0

    gene_num_out = c()

    pvalue_cut = 0.01

    condition_name = c()

    for (index_num in c(1:total_num)){

      tracking = tracking + 1

      test_name = as.character(mycompare[index_num,1])

      group1 = as.character(mycompare[index_num, 2])

      group2 = as.character(mycompare[index_num, 3])

      sample1 = mydesign[mydesign$Group == group1,]

      sample2 = mydesign[mydesign$Group == group2,]

      allsample = rbind(sample1, sample2)

      counts_sample = as.character(allsample$counts_id)

      groupreal = as.character(allsample$Group)

      countData = mycounts_filter[, counts_sample]

      colData = as.data.frame(cbind(counts_sample, groupreal))

      names(colData) = c("sample", "condition")

      dds <- DESeqDataSetFromMatrix(countData = countData,

                                    colData = colData,

                                    design = ~ condition)

      dds <- DESeq(dds)

      resSFtreatment <- results(dds, cooksCutoff =FALSE, contrast=c("condition",group2,group1))

      out_test = as.data.frame(resSFtreatment)

      final_each = cbind( out_test$log2FoldChange,out_test$pvalue,out_test$padj)

      rownames(final_each) = resSFtreatment@rownames

      names = c('(logFC)','(pvalue)','(Qvalue)')

      final_name = paste(test_name,'_',names,sep="")

      colnames(final_each) = final_name

      if (tracking == 1){

        final_table = final_each

      }else{

        final_table = cbind(final_table, final_each)

      }

      gene_sel = out_test[((!is.na(out_test$pvalue))&(!is.na(out_test$log2FoldChange)))&out_test$pvalue < 0.05 & abs(out_test$log2FoldChange) >= 1 & out_test$padj<0.05, ]

      gene_sel<-na.omit(gene_sel)  ##delete rows contain NA

      gene_sel_up = gene_sel [gene_sel$log2FoldChange>0,]

      gene_sel_do = gene_sel [gene_sel$log2FoldChange<0,]

      file_out_up = paste("data//DEGsid//","UP_",test_name,".txt",sep="")

      file_out_do = paste("data//DEGsid//","DOWN_",test_name,".txt",sep="")

      gene_list_up = rownames(gene_sel_up)

      gene_list_do = rownames(gene_sel_do)

      all_ub_down = list(gene_list_up, gene_list_do)

      nameup <- paste(test_name,"_up",sep="")

      namedown <- paste(test_name,"_down",sep="")

      names(all_ub_down) = c(nameup, namedown)

      if (tracking == 1){

        final_genelist = all_ub_down

      }else{

        final_genelist = c(final_genelist, all_ub_down)

      }

      gene_num_up = length(gene_list_up)

      gene_num_do = length(gene_list_do)

      write.table(gene_sel_up, file= file_out_up , row.names = TRUE,col.names = TRUE)

      write.table(gene_sel_do, file= file_out_do , row.names = TRUE,col.names = TRUE)

      condition_name = c(condition_name, paste("UP_",test_name,sep=""),paste("DO_",test_name,sep=""))

      gene_num_out = c(gene_num_out,gene_num_up,gene_num_do)

    }

    final_DEGs_list<-do.call(cbind, lapply(lapply(final_genelist, unlist), `length<-`, max(lengths(final_genelist))))

    final_DEGs_list

    out_final2 = cbind(condition_name, gene_num_out)

    colnames(out_final2) = c("Tests", "DEG number")

    write.table(final_DEGs_list, file=file_final_genelist, row.names = FALSE, sep="\t", na = "") ###all DEGs list

    write.table(out_final2, file=file_deg_num, row.names = FALSE, sep="\t")    ##DEG number in different compare

    write.csv(final_table, file=file_final_csv, row.names = TRUE,quote=TRUE)  ##allgene in all compare

    二. 输入文件

    2.1.counts.txt

        DEseq2的输入counts数据需为整数;如果是RSEM结果,使用expected

    counts取整。

    2.2.design.txt

    文件共四列,格式如下:

    sample_id: 样本id;

    counts_id: counts文件中的样品id;

    Group: 样本分组;

    rep: 样本重复;

    2.3.compare.txt

    指定比较组

    Test:进行比较的组

    denominator: 对照组

    numerator: 比较组

    注:对照组在前,比较组在后,假如做CIM7相对于CIM0的差异表达基因分析,denominatorCIM0numeratorCIM7

    三.输出文件

        运行脚本前在当前目录下建立data文件夹,并在data下建立DEGsid文件夹,用于存放输出文件。脚本运行完成后会在data文件夹下输出3个文件:

    3.1.DE_counts.txt_Final_Out.csv

    输出总表,格式如下:

    第一列为基因id,包括所有基因,后几列为所有比较组的logFC,pvalue以及qvaue值。

    3.2.DE_counts.txt

    所有比较组中差异表达基因的数目。

    3.3. DEG_geneid_allcomapre.txt

        所有比较组中差异表达基因的基因id,该文件用于筛选不同组间的重叠基因。“-up”表示比较组内上调的基因,“_down”表示比较组内下调的基因。

    3.4.DEGsid

        DEGsid文件夹下生成一系列文件,为每个比较组的DEseq2的分析结果, “UP_”和“DOWN”分别表示上调和下调的基因。

    四.找不同比较组间的重叠DEGs

    得到这些输出结果后我还想找到不同组间的重叠DEGs:

    names(mydegs)获得不同比较组的名称,指定mycom1和mycom2后使用intersect()找到交集。

    相关文章

      网友评论

        本文标题:RNA-seq分析流程二:DEseq2做不同组间差异表达分析

        本文链接:https://www.haomeiwen.com/subject/khfszrtx.html