美文网首页
RNA-seq nomalization

RNA-seq nomalization

作者: 大福和胜利比不上我小五毛 | 来源:发表于2019-02-21 14:17 被阅读21次

    7 Normalization

    As different libraries will be sequenced to different depths, offsets are built in the statistical model of DESeq2 to ensure that parameters are comparable.

    The term normalization is often used for that, but it should be noted that the raw read counts are not actually altered. DESeq2defines a virtual reference sample by taking the median of each gene’s values across samples and then computes size factors as the median of ratios of each sample to the reference sample.

    Generally, the ratios of the size factors should roughly match the ratios of the library sizes. Dividing each column of the count table by the corresponding size factor yields normalized count values, which can be scaled to give a counts per million interpretation.

    Thus, if all size factors are roughly equal to one, the libraries have been sequenced equally deeply.

    Count data sets typically show a (left–facing) trombone shape in average vs mean–difference plots (MA–plots), reflecting the higher variability of log ratios at lower counts. In addition, points will typically be centered around a log ratio of 0 if the normalization factors are calculated appropriately, although this is just a general guide.

    In the code below we produce density estimates of the sample counts and pairwise mean–difference plots after the computation of the size factors. The MA–plots are saved to a file in order to avoid cluttering the document.

    We first relevel the condition factor to make sure that we have the wild type samples as a base level. This will allow us to obtain the fold changes always as WT–deletion. We also rename the deletion level to “Del”.

    colData(DESeq2Table)$condition <- factor(colData(DESeq2Table)$condition, levels = c("WT", "Del"))
    colData(DESeq2Table)$condition <- factor(ifelse(is.na(colData(DESeq2Table)$condition),  "Del", "WT"), levels = c("WT", "Del"))
    

    We now estimate estimate the size factors an print them

    DESeq2Table <- estimateSizeFactors(DESeq2Table)
    sizeFactors(DESeq2Table)
    
    ##            WT-SRR1042885            WT-SRR1042886            WT-SRR1042887            WT-SRR1042888 
    ##                   0.8362                   0.6322                   1.3975                   1.3709 
    ## Del_8_17_homo-SRR1042889 Del_8_17_homo-SRR1042890 Del_8_17_homo-SRR1042891 Del_8_17_homo-SRR1042892 
    ##                   1.0086                   0.4473                   0.8602                   2.6086
    

    To assess whether the normalization has worked, we plot the densities of counts for the different samples. Since most of the genes are (heavily) affected by the experimental conditions, a succesful normalization will lead to overlapping densities.

    multidensity( counts(DESeq2Table, normalized = T)[idx.nz ,],
        xlab="mean counts", xlim=c(0, 1000))
    
    image.png

    This is indeed the case here. The same holds true for the empircal cummulative distribution functions (ECDFs) which can be thought of as integrals of the densities and give the probability of observing a certain number of counts equal to <nobr aria-hidden="true" style="transition: none 0s ease 0s; border: 0px; padding: 0px; margin: 0px; max-width: none; max-height: none; min-width: 0px; min-height: 0px; vertical-align: 0px; line-height: normal; text-decoration: none; white-space: nowrap !important;">x</nobr><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math> or less given the data.

    In an ECDF plot, the estimated probility is plotted on the y–axis and the count values on the x–axis. I.e. you can read of the median and other quantiles from this plot.

    As already mentioned, if the normalization has worked, the ECDFs of the different samples should be overlapping.

     multiecdf( counts(DESeq2Table, normalized = T)[idx.nz ,],
        xlab="mean counts", xlim=c(0, 1000))
    
    image.png

    To further assess systematic differences between the samples, we can also plot pairwise mean–average plots: We plot the average of the log–transformed counts vs the fold change per gene for each of the sample pairs.

    The function combn helps us to automatically generate all sample pairs and the function MDPlot from the EDASeq package then generates the the pairwise MA plots.

    pdf("pairwiseMAs.pdf")
      MA.idx = t(combn(1:8, 2))
      for( i in  seq_along( MA.idx[,1])){ 
       MDPlot(counts(DESeq2Table, normalized = T)[idx.nz ,], 
            c(MA.idx[i,1],MA.idx[i,2]), 
        main = paste( colnames(DESeq2Table)[MA.idx[i,1]], " vs ",
         colnames(DESeq2Table)[MA.idx[i,2]] ), ylim = c(-3,3))
        }
    dev.off()
    
    ## png 
    ##   2
    

    The plots look good, if there is no systematic shift between the samples, the log–fold–change should scatter around zero, which is the case here.

    相关文章

      网友评论

          本文标题:RNA-seq nomalization

          本文链接:https://www.haomeiwen.com/subject/ggmryqtx.html