1. 样本组成
99份病毒灭活处理的血清样本:分为对照(健康)组、疑似但实为普通流感组、新冠感染轻症组、新冠感染重症组。
临床信息表-1
临床信息表-2
2. 样本处理过程
- 5 μL serum 溶解在 50 μL lysis buffer (8 M urea in 100 mM triethylammonium bicarbonate, TEAB) ;还原、烷基化、两步trypsin酶切、 TMTpro 16-plex标记;
- 预分120个组分,最终合并为40个组分,Q Exactive HF-X DDA检测;
- PD搜库: Homo sapiens fasta database downloaded from UniprotKB on 07 Jan 2020 and the SARS-CoV-2 virus fasta downloaded from NCBI (version NC_045512.2).
- The peptide-spectrum-match allowed 1% target false discovery rate (FDR) (strict) and 5% target FDR (relaxed). Normalization was performed against the total peptide amount.
-
Quality control:The quality of proteomic data was ensured at multiple levels.
a. First, a mouse liver digest was used for instrument performance evaluation.
b. We also run water samples (buffer A) as blanks every 4 injections to avoid carry-over.
c. Serum samples of four patient groups from both training and validation cohorts were randomly distributed in eight different batches.
d. Six samples were injected in technical replicates. - non-target metabolomics 代谢组学分析:每个样本分成四份,进行四种检测:two for analysis using two separate reverse-phase /ultra-performance liquid chromatography (RP/UPLC)-MS/MS methods with positive ion-mode electrospray ionization (ESI), one for analysis using RP/ UPLC-MS/MS with negative-ion mode ESI, and one for analysis using hydrophilic interaction liquid chromatography (HILIC)/UPLC-MS/MS with negative-ion mode ESI.
-
统计学分析
a . 差异倍数选择:Log2 fold-change (log2 FC) was calculated on the mean of the same patient group for each pair of comparing
groups. The statistical significantly changed proteins or metabolites were selected using the criteria of adjust p value less than 0.05 indicated and absolute log2 FC larger than 0.25.
b. t-test:Two-sided unpaired Welch’s t test was performed for each pair of comparing groups and adjusted p values were calculated using Benjamini & Hochberg correction.
c. 机器学习:From the training cohort, the important features were selected with mean decrease accuracy larger than 3 using random forest containing a thousand trees using R package randomForest (version 4.6.14) random forest analysis with 10-fold cross validation as binary classification of paired severe and non-severe group using combined differentially regulated proteins and metabolites features. The random forest analysis was further performed for a hundred times on the matrix with only the selected important features using normalized additive predicting probability as the final predicting probability and the larger probability as the predictive label. Those selected important features were used for the random forest analysis on the independent validation cohort.
3. Results
为了有效地对单细胞测序数据进行各种处理分析,特别是细胞亚型的鉴定,通常需要首先对单细胞测序数据进行降维。单细胞测序数据的降维方法主要可分为两大类:
1、Dimensionality reduction(降维),降维方法通常是把高维数据通过优化保留原始数据中的关键特征后投射到低维空间,从而可以通过二维或三维的形式把数据展示出来。
常用的降维方法有:
1)PCA(Principle Component Analysis),主成分分析,是一种线性的降维方法;
2)t-SNE(T-distributed stochastic neighbor embedding),是一种非线性的降维方法;
3)UMAP (uniform manifold approximation and projection) (Becht et al., 2018, Nat. Biotechnol.),
4)scvis (Ding et al., 2018, Nat. Commun.)
2、Feature selection(特征选择),主要是通过去除信息含量少的基因而保留信息含量最多的基因来降低数据的维度。
常用的Feature selection的方法有:
1)基于先验信息的方法(如已知细胞的亚型)。比如通过SCDE软件鉴定已知不同细胞亚型间的差异表达基因,然后再基于差异表达基因来聚类分析等。
2)非监督方法。又可细分为:
(i) 基于highly variable genes (HVG) ;
(ii) 基于spike-in,如scLVM (Buettner et al., 2015)和BASiCS (Vallejos et al., 2015)等;
(iii)基于 dropout,如M3Drop (Andrews and Hemberg, 2018)。
参考:https://www.cnblogs.com/aipufu/articles/11470334.html
-
Part 1. Proteomic and metabolomic profiling of COVID-19 sera
蛋白质组共鉴定到894个蛋白和941个代谢物,查看QC样本CV,及UMAP降维后样本分布情况。
Figure S1 -
Part 2. Identification of severe patients using machine learning
选取部分数据蛋白质组及代谢组数据作为训练集进行随机森林机器学习来区分重症新冠患者,找到29个重要的变量,包括22个蛋白和7个代谢物。用建立好的模型对另外10个做验证。
机器学习样本分配 机器学习的结果和模型评估结果 -
Part 3. Proteomic and metabolomic changes in severe COVID-19 sera
新冠患者与非新冠患者共105个差异蛋白和373个差异代谢物,其中有93个蛋白和204个代谢物与新冠的严重程度相关;93个差异蛋白主要富集在activation of the complement system, macrophage function and platelet degranulation三条信号通路中,包括50个蛋白;相应的,代谢物中82个在上述三条信号通路中。文章剩余部分便是对这三条信号通路进行具体阐述了,不再一一介绍。
Differentially expressed proteins in different patient groups in the training cohort.
Differentially expressed metabolites in different patient groups in the training cohort.
Proteins and metabolites regulated in COVID-19 patients but not in non-COVID-19 patients.
Dysregulated proteins in COVID-19 sera.
Dysregulated metabolites in COVID-19 sera.
Key proteins and metabolites characterized in severe COVID-19 patients in a working model.
Identification of specific clusters of proteins and metabolites in COVID-19 patients. 791 proteins (A) and 941 metabolites (B) were clustered using mFuzz into 16 significant discrete clusters, respectively.
4. 后记
整篇文章的分析并不复杂,整体思路:QC(数据可信)➡️ 机器学习区分患者与正常人(分组)➡️ 差异蛋白or代谢物,尤其是与疾病严重程度相关的差异蛋白。(差异分子)➡️pathway 分析,阐述疾病的主要病理特征。
从蛋白质组学的数据来看,这文章选取的差异倍数并不是很大,log(fold change)=1/4(可能是考虑到16标的压缩效应,我的10标都选的是1.2倍),最好有另外一种技术进行验证才更有说服力;机器学习的样本数较少。
网友评论