Tow main problems in library normalization
Problem1 Adjusting for differences in library sizes
Problem1Problem2 Adjusting for differences in library composition
Problem2We’ll start with a small dataset to illustrate how DESeq2 scales the different samples.
The goal is to calculate a scaling for each sample. The scaling factor has to take read depth and library coposition into account.
Step 1 Take the log of all values
Step1Step 2 Average Each Row
Step2One thing cool about the average of log values is that this average is not easily swayed by outliers. Averages calculated with logs are called “Geometric Averages”.
Step 3 Filter out Genes with Infinity
In general, this step filters out genes with zero read counts in one or more samples.
In theory, this helps focus the scaling factors on the house keeping genes
Step 5 Calculate the median of the ratios for each sample
Step5Step 6 Convert the medians to “normal numbers” to get the final scaling factors for each sample
The median values are exponents for e.
Step 7 Divide the original read counts by the scaling factors
Step7Summary of DESeq2’s Library Size Scaling Factor
Logs eliminate all genes that are only transcribed in one sample type (liver vs. spleen). They also help smooth over outlier read counts (via the Geometric Mean).
The median further downplays genes that soak up a lot of the reads, putting more emphasis on moderately expressed genes.
网友评论