Deep learning using tumor HLA peptide mass spectrometry datasets improves neoantigen identification

题目：利用肿瘤HLA肽质谱数据集的深度学习改善肿瘤抗原鉴定

作者：

Brendan Bulik-Sullivan, Jennifer Busby […] Roman Yelensky

通讯作者单位：

Roman Yelensky

Gritstone Oncology, Inc., Emeryville, California and Cambridge, Massachusetts, USA.

发表期刊及时间：

Nature Biotechnology

Published: 17 December 2018

摘要：

Neoantigens, which are expressed on tumor cells, are one of the main targets of an effective antitumor T-cell response. Cancer immunotherapies to target neoantigens are of growing interest and are in early human trials, but methods to identify neoantigens either require invasive or difficult-to-obtain clinical specimens, require the screening of hundreds to thousands of synthetic peptides or tandem minigenes, or are only relevant to specific human leukocyte antigen (HLA) alleles. We apply deep learning to a large (N = 74 patients) HLA peptide and genomic dataset from various human tumors to create a computational model of antigen presentation for neoantigen prediction. We show that our model, named EDGE, increases the positive predictive value of HLA antigen prediction by up to ninefold. We apply EDGE to enable identification of neoantigens and neoantigen-reactive T cells using routine clinical specimens and small numbers of synthetic peptides for most common HLA alleles. EDGE could enable an improved ability to develop neoantigen-targeted immunotherapies for cancer patients.

肿瘤抗原在肿瘤细胞中表达，是有效的抗肿瘤T细胞应答的主要靶标之一。用于靶向新抗原的癌症免疫疗法越来越受关注，并且正处于早期人体试验的阶段，但是，鉴定肿瘤抗原的方法要么需要有攻击性的或难以获得的临床标本，需要筛选数百至数千种合成肽或串联小基因，或仅与特定人类白细胞抗原（HLA）等位基因有关。我们将深度学习应用于来自各种人类肿瘤的大量（N = 74名患者）HLA肽和基因组数据集，从而搭建一个用于肿瘤抗原预测的抗原呈递计算模型。我们的模型（命名为EDGE）将HLA抗原预测的阳性预测值提高了九倍。利用常规临床标本和少量合成肽，在大多数常见的HLA等位基因中EDGE模型的应用可以鉴定出肿瘤抗原以及肿瘤抗原反应性T细胞。 EDGE能够为癌症患者们提高开发肿瘤抗原靶向免疫疗法的能力。

图表选析

Figure 3: Architecture and features of the model. 图3. 模型的体系结构和功能。

(a) The architecture of our neural network (NN), with the subcomponents of the network active in a single patient with six HLA alleles. Pr, probability. (b) The learned dependence of HLA presentation on each sequence position for peptides of lengths 8–11 for two common HLA alleles. See Supplementary Figure 3a, b, c for learned motifs for all alleles. (c) Observed (dark blue) values are the proportion all detected peptides in the test samples found at each peptide length. Predicted (light blue) values are the sum of probabilities of all proteome peptides of length k over the total sum of probabilities of all peptides of lengths 8–11 (i.e., the expected proportion of presented peptides of each length). (d) Observed (dark blue) values are the proportion all detected peptides in the test samples found from genes at each mRNA expression TPM level. Predicted (light blue) values are the sum of probabilities assigned to all proteome peptides at the TPM level over the total sum of probabilities of all peptides. (e) Test set prevalence of detected peptides binned by learned per-gene propensity of presentation (x axis) and RNA expression (y-axis) of the source genes.

(a) 我们神经网络（NN）的体系结构，其中网络的子组件用了在具有6个HLA等位基因的一个患者。复杂度，概率。 (b) 对两个常见HLA等位基因，长度为8-11的肽的HLA呈递对每个序列位置的学习依赖性。所有等位基因的学习模块见补充图3a, b, c。 (c) 观察值（深蓝色）是测试样品中所有检测到的肽在每个肽长度的比例。预测值（浅蓝色）是在总的所有长度为8-11的肽段中，长度为k的所有蛋白质组肽概率的总和（即每个长度的呈递肽的期望比例）。 (d) 观察值（深蓝色）是测试样品中所有检测到的肽在每个mRNA表达TPM水平的比例。预测值（浅蓝色）是在TPM水平上分配给所有蛋白质组肽的概率与所有肽的概率总和的总和。 (e) 通过学习每个基因的呈递偏好（x轴）和RNA表达（y-轴）将检测肽的数据分箱，得到的测试集普遍性。