RNA-seq nomalization

作者: 大福和胜利比不上我小五毛 | 来源:发表于2019-02-21 14:17 被阅读21次

RNA-seq nomalization
RNA-seq 检测变异之 GATK 最佳实践流程
RNA-seq无比对直接定量(Kallisto - sleuth
STAR 使用手册
39个转录组分析工具，120种组合评估
无生物学重复RNA-seq分析 CORNAS: coverag
深入理解RNA-seq
生成htseq-count的input文件
还在单细胞测序？单核更出色
Day1-君君

7 Normalization

As different libraries will be sequenced to different depths, offsets are built in the statistical model of DESeq2 to ensure that parameters are comparable.

The term normalization is often used for that, but it should be noted that the raw read counts are not actually altered. DESeq2defines a virtual reference sample by taking the median of each gene’s values across samples and then computes size factors as the median of ratios of each sample to the reference sample.

Generally, the ratios of the size factors should roughly match the ratios of the library sizes. Dividing each column of the count table by the corresponding size factor yields normalized count values, which can be scaled to give a counts per million interpretation.

Thus, if all size factors are roughly equal to one, the libraries have been sequenced equally deeply.

Count data sets typically show a (left–facing) trombone shape in average vs mean–difference plots (MA–plots), reflecting the higher variability of log ratios at lower counts. In addition, points will typically be centered around a log ratio of 0 if the normalization factors are calculated appropriately, although this is just a general guide.

In the code below we produce density estimates of the sample counts and pairwise mean–difference plots after the computation of the size factors. The MA–plots are saved to a file in order to avoid cluttering the document.

We first relevel the condition factor to make sure that we have the wild type samples as a base level. This will allow us to obtain the fold changes always as WT–deletion. We also rename the deletion level to “Del”.

colData(DESeq2Table)$condition <- factor(colData(DESeq2Table)$condition, levels = c("WT", "Del"))
colData(DESeq2Table)$condition <- factor(ifelse(is.na(colData(DESeq2Table)$condition),  "Del", "WT"), levels = c("WT", "Del"))

We now estimate estimate the size factors an print them

DESeq2Table <- estimateSizeFactors(DESeq2Table)
sizeFactors(DESeq2Table)

##            WT-SRR1042885            WT-SRR1042886            WT-SRR1042887            WT-SRR1042888 
##                   0.8362                   0.6322                   1.3975                   1.3709 
## Del_8_17_homo-SRR1042889 Del_8_17_homo-SRR1042890 Del_8_17_homo-SRR1042891 Del_8_17_homo-SRR1042892 
##                   1.0086                   0.4473                   0.8602                   2.6086

To assess whether the normalization has worked, we plot the densities of counts for the different samples. Since most of the genes are (heavily) affected by the experimental conditions, a succesful normalization will lead to overlapping densities.

multidensity( counts(DESeq2Table, normalized = T)[idx.nz ,],
    xlab="mean counts", xlim=c(0, 1000))

image.png

This is indeed the case here. The same holds true for the empircal cummulative distribution functions (ECDFs) which can be thought of as integrals of the densities and give the probability of observing a certain number of counts equal to <nobr aria-hidden="true" style="transition: none 0s ease 0s; border: 0px; padding: 0px; margin: 0px; max-width: none; max-height: none; min-width: 0px; min-height: 0px; vertical-align: 0px; line-height: normal; text-decoration: none; white-space: nowrap !important;">x</nobr><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math> or less given the data.

In an ECDF plot, the estimated probility is plotted on the y–axis and the count values on the x–axis. I.e. you can read of the median and other quantiles from this plot.

As already mentioned, if the normalization has worked, the ECDFs of the different samples should be overlapping.

 multiecdf( counts(DESeq2Table, normalized = T)[idx.nz ,],
    xlab="mean counts", xlim=c(0, 1000))

image.png

To further assess systematic differences between the samples, we can also plot pairwise mean–average plots: We plot the average of the log–transformed counts vs the fold change per gene for each of the sample pairs.

The function combn helps us to automatically generate all sample pairs and the function MDPlot from the EDASeq package then generates the the pairwise MA plots.

pdf("pairwiseMAs.pdf")
  MA.idx = t(combn(1:8, 2))
  for( i in  seq_along( MA.idx[,1])){ 
   MDPlot(counts(DESeq2Table, normalized = T)[idx.nz ,], 
        c(MA.idx[i,1],MA.idx[i,2]), 
    main = paste( colnames(DESeq2Table)[MA.idx[i,1]], " vs ",
     colnames(DESeq2Table)[MA.idx[i,2]] ), ylim = c(-3,3))
    }
dev.off()

## png 
##   2

The plots look good, if there is no systematic shift between the samples, the log–fold–change should scatter around zero, which is the case here.

网友评论

本文标题：RNA-seq nomalization

本文链接：https://www.haomeiwen.com/subject/ggmryqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

RNA-seq nomalization

7 Normalization

相关文章

RNA-seq nomalization

RNA-seq 检测变异之 GATK 最佳实践流程

RNA-seq无比对直接定量(Kallisto - sleuth

STAR 使用手册

39个转录组分析工具，120种组合评估

无生物学重复RNA-seq分析 CORNAS: coverag

深入理解RNA-seq

生成htseq-count的input文件

还在单细胞测序？单核更出色

Day1-君君

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读