搬运自哈佛生物信息课程:《Introduction to Single-cell RNA-seq》
Count Normalization and Principal Component Analysis
After attaining our high quality single cells, the next step in the single-cell RNA-seq (scRNA-seq) analysis workflow is to perform clustering. The goal of clustering is to separate different cell types into unique clusters of cells. To perform clustering, we determine the genes that are most different in their expression between cells. Then, we use these genes to determine which correlated genes sets are responsible for the largest differences in expression between cells.
获得高质量的单细胞数据后,在单细胞RNA-seq (scRNA-seq)分析工作流中的下一步是执行聚类分群。聚类的目的是将不同类型的细胞分离成独特的细胞簇。为了进行聚类,我们确定在细胞之间表达最不同/变化最大的基因(HVGs)。然后,我们使用这些基因来确定哪些相关的基因是造成细胞间表达差异最大的原因。
1、Count normalization
First one is count normalization, which is essential to make accurate comparisons(精准比较) of gene expression between cells (or samples). The counts of mapped reads for each gene is proportional to the expression of RNA (“interesting”) in addition to many other factors (“uninteresting”). Normalization is the process of scaling raw count values to account for the “uninteresting” factors. In this way the expression levels are more comparable between and/or within cells.
Each cell in scRNA-seq will have a differing number of reads associated with it. So to accurately compare expression between cells, it is necessary to normalize for sequencing depth.(在scRNA-seq中每个细胞都有不同数量的reads与之关联。为了准确比较细胞间的表达,对测序深度进行标准化是有必要的。)
2、Principal Component Analysis (PCA)
You could draw a line through the data in the direction representing the most variation, which is on the diagonal in this example. The maximum variation in the dataset is between the genes that make up the two endpoints of this line.
We could also rotate the entire plot and view the lines representing the variation as left-to-right and up-and-down. We see most of the variation in the data is left-to-right (longer line) and the second most variation in the data is up-and-down (shorter line). You can now think of these lines as the axes that represent the variation. These axes are essentially the “Principal Components”, with PC1 representing the most variation in the data and PC2 representing the second most variation in the data.
If we had three samples/cells, then we would have an extra direction in which we could have variation (3D). Therefore, if we have N samples/cells we would have N-directions of variation or principal components (PC)! Once these PCs have been calculated, the PC that deals with the largest variation in the dataset is designated PC1, and the next one is designated PC2 and so on.
确定PCs后,则需要对每个PC进行评分,按照以下步骤对所有样本PC对(sample-PC pairs)计算分数:
Sample1 PC1 score = (read count * influence) + ... for all genes
For our 2-sample example, the following is how the scores would be calculated:
## Sample1
PC1 score = (4 * -2) + (1 * -10) + (8 * 8) + (5 * 1) = 51
PC2 score = (4 * 0.5) + (1 * 1) + (8 * -5) + (5 * 6) = -7
## Sample2
PC1 score = (5 * -2) + (4 * -10) + (8 * 8) + (7 * 1) = 21
PC2 score = (5 * 0.5) + (4 * 1) + (8 * -5) + (7 * 6) = 8.5