美文网首页GEO数据挖掘
maSigpro建模代码分析之make.design.matri

maSigpro建模代码分析之make.design.matri

作者: 小潤澤 | 来源:发表于2021-01-15 18:28 被阅读0次

    前言

    之前我们介绍了maSigpro的基本建模方式:分析时间序列的RNA-seq tool,这一次我们从代码的角度来分析其具体的建模原理
    我们利用自带的数据集做运算

    library(maSigPro) # load maSigPro library
    data(data.abiotic)
    data(edesign.abiotic)
    
    edesign = edesign.abiotic
    data = data.abiotic
    
    design <- make.design.matrix(edesign, degree = 2)
    
    fit <- p.vector(data, design, Q = 0.05, MT.adjust = "BH", min.obs = 20)
    
    tstep <- T.fit(fit, step.method = "backward", alfa = 0.05)
    

    我们将重点从这几段代码入手来解析其建模原理

    make.design.matrix

    那么第一步是将我们的设计矩阵做转换,首先我们的设计矩阵为这个样子:


    至于该函数有一下几个参数:

    "make.design.matrix" <-
    function (edesign, degree = 2, time.col = 1, 
    repl.col = 2, group.cols = c(3:ncol(edesign)))
    

    当满足if (dim(as.matrix(edesign))[2] > 3)
    执行:

    if (dim(as.matrix(edesign))[2] > 3) {
            dummy.cols <- group.cols[2:length(group.cols)]
            treatm.label <- paste(colnames(edesign)[dummy.cols], 
                "vs", control.label, sep = "")
            groups.label <- c(control.label, treatm.label)
            matrix.dummy <- as.matrix(edesign[, dummy.cols])
            ## Shared origin
        dummy <- NULL
        j = 0
        origen  <- min(edesign[,time.col])
        origen <-edesign[edesign[,1]==origen,]
        for (i in 1:length(dummy.cols)) {
           share <- apply(origen[,c(3,dummy.cols[i])],1,sum)
           if  (!is.element(TRUE, share>1)) {
           j=j+1
              dummy <- cbind(dummy,matrix.dummy[,i])
              colnames(dummy)[j] <- treatm.label[i]
           } 
        }
            time <- as.matrix(edesign[, time.col])
            colnames(time) <- colnames(edesign)[time.col]
            dis <- cbind(dummy, time)
            rownames(dis) <- rownames(edesign)
            groups.vector <- c(colnames(dummy), control.label)
            colnames.dis <- colnames(dis)
            dis <- cbind(dis, dis[,ncol(dis)] * matrix.dummy)
            colnames(dis) <- c(colnames.dis, paste(colnames(edesign)[time.col], 
                "x", colnames(edesign)[dummy.cols], sep = ""))
            groups.vector <- c(groups.vector, treatm.label)
            if (degree >= 2) {
                for (i in 2:degree) {
                    colnames.dis <- colnames(dis)
                    dis <- cbind(dis, edesign[, time.col]^i, edesign[, 
                      time.col]^i * edesign[, dummy.cols])
                    colnames(dis) <- c(colnames.dis, paste(colnames(edesign)[time.col], 
                      i, sep = ""), paste(colnames(edesign)[time.col], 
                      "", i, "x", colnames(edesign)[dummy.cols], 
                      sep = ""))
                    groups.vector <- c(groups.vector, groups.label)
                }
            }
        }
    

    满足这个条件意味着你的设计矩阵的列数要大于3,如果满足该条件则:

    #group.cols是将control和control后面的列取出来
    group.cols = c(3:ncol(edesign))
    control.label <- colnames(edesign)[group.cols][1]
    #指代设计矩阵第一列是时间
    time.col = 1
    
    dummy.cols <- group.cols[2:length(group.cols)]
    #设计列名,即不同处理条件和control比较
    treatm.label <- paste(colnames(edesign)[dummy.cols], 
                          "vs", control.label, sep = "")
    groups.label <- c(control.label, treatm.label)
    matrix.dummy <- as.matrix(edesign[, dummy.cols])
    
    ##初始化
    dummy <- NULL
    j = 0
    #取出最小的时间序列,该例子中是最小的时间是3
    origen  <- min(edesign[,time.col])
    origen <-edesign[edesign[,1]==origen,]
    
    origen
    接下来就需要对dummy循环赋值了:
    for (i in 1:length(dummy.cols)) {
      share <- apply(origen[,c(3,dummy.cols[i])],1,sum)
      if  (!is.element(TRUE, share>1)) {
        j=j+1
        dummy <- cbind(dummy,matrix.dummy[,i])
        colnames(dummy)[j] <- treatm.label[i]
      } 
    }
    

    此时设计出来的矩阵为:

    dummy
    通过该矩阵的形式我们可以看到所有的处理组(Cold,Heat,Salt)都是与Control组相比,比方说ColdvsControl则对应于列名含有Cold的组别,并用 1 来区别

    若degree=1

    #提取设计矩阵含有时间的那一列
    time <- as.matrix(edesign[, time.col])
    colnames(time) <- colnames(edesign)[time.col]
    #合并dummy和time
    dis <- cbind(dummy, time)
    rownames(dis) <- rownames(edesign)
    groups.vector <- c(colnames(dummy), control.label)
    colnames.dis <- colnames(dis)
    #dis[,ncol(dis)]实际上是提取time这一列
    dis <- cbind(dis, dis[,ncol(dis)] * matrix.dummy)
    colnames(dis) <- c(colnames.dis, paste(colnames(edesign)[time.col], 
                                           "x", colnames(edesign)[dummy.cols], sep = ""))
    groups.vector <- c(groups.vector, treatm.label)
    
    dis[,ncol(dis)]
    matrix.dummy
    matrix.dummy是设计矩阵的后三列,代表不同的处理
    dis矩阵(degree=1)
    dis矩阵有一个特点,比方说Control_3H因为没有任何处理所以ColdvsControl,HeatvsControl和SaltvsControl均为0,但是该样本的时间点为3,所以time那一列为3,因为没有任何处理(Cold,Heat和Salt),所以后面的(Time×Cold,Time×Heat,Time×Salt)均为0;
    又例如Cold_9H,只有Cold处理,所以ColdvsControl这一项为1,其他(HeatvsControl和SaltvsControl)为0,但是该样本的时间点为9,所以time那一列为9,因为只有Cold处理,所以后面的Time×Cold为9,其他(Time×Heat,Time×Salt)为0

    若degree≥2

    接下来就要进行degree的判断了,如果这里我们假设degree=3,这里的degree代表的是多项式回归的最高次数,我们前面简单介绍了dis矩阵的构造,下面

    degree = 3
    if (degree >= 2) {
      for (i in 2:degree) {
        colnames.dis <- colnames(dis)
        dis <- cbind(dis, edesign[, time.col]^i, 
                          edesign[, time.col]^i * edesign[,dummy.cols])
        colnames(dis) <- c(colnames.dis, paste(colnames(edesign)[time.col], i, sep = ""), 
                           paste(colnames(edesign)[time.col], "", i, "x", colnames(edesign)[dummy.cols], sep = ""))
        groups.vector <- c(groups.vector, groups.label)
      }
    }
    
    dis矩阵(degree=3)左半边
    dis矩阵(degree=3)右半边

    我们可以看到,degree=3的dis矩阵和degree=1的dis矩阵相比多了Time2和Time3,例如Control_3HTime2是取了2次方的,即3^2=9,Time3是去了3次方的3^3=27
    Cold_9HTime2是取了2次方的,即9^2=81,Time3是去了3次方的9^3=729,其他的解释与degree=1的dis矩阵一样
    这么做的目的是引入与时间相关的高次多项式,更好的拟合模型

    下期预告:maSigpro建模代码分析之p.vector

    相关文章

      网友评论

        本文标题:maSigpro建模代码分析之make.design.matri

        本文链接:https://www.haomeiwen.com/subject/kouiaktx.html