Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules
自动地化学设计通过数据驱动的分子连续表示
ABSTRACT: We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.
摘要:我们报告了一种将分子的离散表示转换为和从一个多维连续表示的方法。这个模型允许我们生成新的分子,通过化合物的开放式空间进行有效的探索和优化。一个深度神经网络在数十万种现有的化学结构上进行训练,以构建三个耦合函数:一个编码器、一个解码器和一个预测器。编码器将分子的离散表示转换为实值连续向量,解码器将这些连续向量转换为离散的分子表示。该预测器从分子的潜在连续向量表示来估计化学性质。分子的连续表示允许我们通过在潜在空间中执行简单的操作来自动生成新的化学结构,如解码随机向量,干扰已知的化学结构,或在分子之间的插值。连续表示还允许使用强大的基于梯度的优化,以有效地指导搜索优化的功能化合物。我们在类药物分子领域,也在一组少于9个重原子的分子中演示了我们的方法。
INTRODUCTION
介绍
The goal of drug and material design is to identify novel molecules that have certain desirable properties. We view this as an optimization problem, in which we are searching for the molecules that maximize our quantitative desiderata. However, optimization in molecular space is extremely challenging, because the search space is large, discrete, and unstructured. Making and testing new compounds are costly and timeconsuming, and the number of potential candidates is overwhelming. Only about 108 substances have ever been synthesized,1 whereas the range of potential drug-like molecules is estimated to be between 1023 and 1060.2 Virtual screening can be used to speed up this search.3−6 Virtual libraries containing thousands to hundreds of millions of candidates can be assayed with first-principles simulations or statistical predictions based on learned proxy models, and only the most promising leads are selected and tested experimentally.
药物和材料设计的目标是识别具有某些理想特性的新分子。我们认为这是一个优化问题,在其中,我们正在寻找分子,最大化我们的定量需求。然而,分子空间的优化是极具有挑战性的,因为搜索空间是大的、离散的和非结构化的。制造和测试新的化合物是昂贵的和耗时的,而且潜在的候选化合物的数量是压倒性的。只有大约108种物质被合成,而潜在的类似药物分子的范围估计在1023到1060之间,虚拟筛选可以用来加快这一搜索。3−6包含数千到数亿个候选对象的虚拟库可以通过第一性原理模拟或基于学习到的代理模型的统计预测进行分析,只有最有前途的线索被选择并通过实验测试。
However, even when accurate simulations are available,7 computational molecular design is limited by the search strategy used to explore chemical space. Current methods either exhaustively search through a fixed library,8,9 or use discrete local search methods such as genetic algorithms10−15 or similar discrete interpolation techniques.16−18 Although these techniques have led to useful new molecules, these approaches still face large challenges. Fixed libraries are monolithic, costly to fully explore, and require hand-crafted rules to avoid impractical chemistries. The genetic generation of compounds requires manual specification of heuristics for mutation and crossover rules. Discrete optimization methods have difficulty effectively searching large areas of chemical space because it is not possible to guide the search with gradients.
然而,即使有精确的模拟,计算分子设计也受到用于探索化学空间的搜索策略的限制。目前的方法要么通过一个固定的库进行穷尽搜索,要么使用离散的局部搜索方法,如遗传算法或类似的离散插值技术。尽管这些技术已经导致了有用的新分子,但这些方法仍然面临着巨大的挑战。固定库是整体的,完全探索的成本很高,并且需要手工制作的规则以避免不切实际的化学。化合物的遗传生成需要手动指定突变和交叉规则的启发式方法。离散优化方法难以有效地搜索大面积的化学空间,因为不可能用梯度来引导搜索。
A molecular representation method that is continuous, datadriven, and can easily be converted into a machine-readable molecule has several advantages. First, hand-specified mutation rules are unnecessary, as new compounds can be generated automatically by modifying the vector representation and then decoding. Second, if we develop a differentiable model that maps from molecular representations to desirable properties, we can enable the use of gradient-based optimization to make larger jumps in chemical space. Gradient-based optimization can be combined with Bayesian inference methods to select compounds that are likely to be informative about the global optimum. Third, a data-driven representation can leverage large sets of unlabeled chemical compounds to automatically build an even larger implicit library, and then use the smaller set of labeled examples to build a regression model from the continuous representation to the desired properties. This lets us take advantage of large chemical databases containing millions of molecules, even when many properties are unknown for most compounds.
一种是连续的、数据分割的、可以很容易地转换为机器可读的分子的分子表示方法有几个优点。首先,手工指定的突变规则是不必要的,因为可以通过修改向量表示,然后解码来自动生成新的化合物。其次,如果我们开发一个可微模型,从分子表示映射到理想的性质,我们就可以使用基于梯度的优化在化学空间中进行更大的跳跃。基于梯度的优化可以与贝叶斯推理方法相结合,以选择可能提供全局最优信息的化合物。第三,数据驱动的表示可以利用大量未标记化合物来自动构建更大的隐式库,然后使用更小的标记示例集来构建从连续表示到所需属性的回归模型。这让我们能够利用包含数百万个分子的大型化学数据库,即使在大多数化合物的许多特性都是未知的。
Recent advances in machine learning have resulted in powerful probabilistic generative models that, after being trained on real examples, are able to produce realistic synthetic samples. Such models usually also produce low-dimensional continuous representations of the data being modeled, allowing interpolation or analogical reasoning for natural images,19 text,20 speech, and music.21,22 We apply such generative models to chemical design, using a pair of deep networks trained as an autoencoder to convert molecules represented as SMILES strings into a continuous vector representation. In principle, this method of converting from a molecular representation to a continuous vector representation could be applied to any molecular representation, including chemical fingerprints,23 convolutional neural networks on graphs,24 similar graphconvolutions,25 and Coulomb matrices.26 We chose to use SMILES representation because this representation can be readily converted into a molecule.
机器学习的最新进展导致了强大的概率生成模型,经过真实的例子训练后,能够产生真实的合成样本。这种模型通常也会产生被建模数据的低维连续表示,允许对自然图像、文本、语音和音乐进行插值或类比推理。我们将这种生成模型应用于化学设计,使用一对被训练为自动编码器的深度网络,将以SMILES representation
的分子转换为连续的向量表示。原则上,这种从分子表示转换为连续向量表示的方法可以应用于任何分子表示,包括化学指纹、图形上的卷积神经网络、类似的图形卷积、库仑矩阵。我们选择使用SMILES representation
,因为这种表征可以很容易地转化为一个分子。
Using this new continuous vector-valued representation, we experiment with the use of continuous optimization to produce novel compounds. We trained the autoencoder jointly on a property prediction task: we added a multilayer perceptron that predicts property values from the continuous representation generated by the encoder, and included the regression error in our loss function. We then examined the effects that joint training had on the latent space, and tested optimization in this latent space for new molecules that optimize our desired properties.
利用这种新的连续向量值表示,我们实验使用连续优化来产生新的化合物。我们在一个属性预测任务上联合训练自动编码器:我们添加了一个多层感知器,它从编码器生成的连续表示中预测属性值,并在我们的损失函数中包含回归误差。然后,我们检查了联合训练对潜在空间的影响,并测试了在这个潜在空间中优化新分子,以优化我们期望的属性。
Representation and Autoencoder Framework. The autoencoder in comprised of two deep networks: an encoder network to convert each string into a fixed-dimensional vector, and a decoder network to convert vectors back into strings (Figure 1a). The autoencoder is trained to minimize error in reproducing the original string; i.e., it attempts to learn the identity function. Key to the design of the autoencoder is the mapping of strings through an information bottleneck. This bottleneck-here the fixed-length continuous vector-induces the network to learn a compressed representation that captures the most statistically salient information in the data. We call the vector-encoded molecule the latent representation of the molecule.
表示和自动编码器框架中的自动编码器由两个深度网络组成:一个编码器网络将每个字符串转换为一个固定维的向量,另一个解码器网络将向量转换回字符串(图1a)。自动编码器被训练成最小化再现原始字符串时的错误;也就是说,它试图学习身份函数。自动编码器设计的关键是通过信息瓶颈映射字符串。这个瓶颈-在这里,固定长度的连续向量-诱导网络学习一个压缩表示,以捕获数据中统计上最显著的信息。我们称载体编码的分子为分子的潜在表示。
For unconstrained optimization in the latent space to work, points in the latent space must decode into valid SMILES strings that capture the chemical nature of the training data. Without this constraint, the latent space learned by the autoencoder may be sparse and may contain large “dead areas”, which decode to invalid SMILES strings. To help ensure that points in the latent space correspond to valid realistic Figure 1. (a) A diagram of the autoencoder used for molecular design, including the joint property prediction model. Starting from a discrete molecular representation, such as a SMILES string, the encoder network converts each molecule into a vector in the latent space, which is effectively a continuous molecular representation. Given a point in the latent space, the decoder network produces a corresponding SMILES string. A mutilayer perceptron network estimates the value of target properties associated with each molecule. (b) Gradient-based optimization in continuous latent space. After training a surrogate model f(z) to predict the properties of molecules based on their latent representation z, we can optimize f(z) with respect to z to find new latent representations expected to have high values of desired properties. These new latent representations can then be decoded into SMILES strings, at which point their properties can be tested empirically. ACS Central Science Research Article DOI: 10.1021/acscentsci.7b00572 ACS Cent. Sci. 2018, 4, 268−276 269 molecules, we chose to use a variational autoencoder (VAE)27 framework. VAEs were developed as a principled approximateinference method for latent-variable models, in which each datum has a corresponding, but unknown, latent representation. VAEs generalize autoencoders, adding stochasticity to the encoder which combined with a penalty term encourages all areas of the latent space to correspond to a valid decoding. The intuition is that adding noise to the encoded molecules forces the decoder to learn how to decode a wider variety of latent points and find more robust representations. Variational autoencoders with recurrent neural network encoding/decoding were proposed by Bowman et al. in the context of written English sentences, and we followed their approach closely.20 To leverage the power of recent advances in sequence-to-sequence autoencoders for modeling text, we used the SMILES28 representation, a commonly used text encoding for organic molecules. We also tested InChI29 as an alternative string representation, but found it to perform substantially worse than SMILES, presumably due to a more complex syntax that includes counting and arithmetic.
为了在潜在空间中进行无约束优化,潜在空间中的点必须解码为有效的SMILES字符串,以捕捉训练数据的化学性质。如果没有这个约束,自动编码器学习的潜在空间可能是稀疏的,并且可能包含大的“死区”,这些死区会解码为无效的字符串。帮助确保潜在空间中的点对应于有效的真实图1。(a) 用于分子设计的自动编码器图,包括联合特性预测模型。编码器网络从离散的分子表示(如SMILES字符串)开始,将每个分子转换为潜在空间中的向量,这实际上是一个连续的分子表示。给定潜在空间中的一个点,解码器网络生成相应的SMILES字符串。多层感知器网络估计与每个分子相关的目标属性值。(b)连续潜在空间中基于梯度的优化。在训练代理模型f(z)以根据分子的潜在表征z预测分子的性质后,我们可以优化f(z)相对于z的关系,以找到预期具有高期望性质值的新的潜在表征。然后,这些新的潜在表示可以被解码成微笑字符串,在这一点上可以对其属性进行经验测试。我们选择使用变分自动编码器(VAE)27框架。VAE是一种用于潜在变量模型的原则性近似干扰方法,其中每个数据都有一个对应但未知的潜在表示。VAEs推广了自动编码器,增加了编码器的随机性,与惩罚项相结合,鼓励潜在空间的所有区域对应于有效解码。直觉是,向编码的分子添加噪声迫使解码器学习如何解码更广泛的潜在点,并找到更稳健的表示。Bowman等人在书面英语句子中提出了具有递归神经网络编码/解码的变分自动编码器,我们密切遵循了他们的方法。20为了利用序列到序列自动编码器在文本建模方面的最新进展,我们使用了SMILES28表示法,这是一种常用的有机分子文本编码。我们还测试了InChI29作为替代字符串表示,但发现它的性能比SMILES差得多,这可能是由于更复杂的语法,包括计算
The character-by-character nature of the SMILES representation and the fragility of its internal syntax (opening and closing cycles and branches, allowed valences, etc.) can still result in the output of invalid molecules from the decoder, even with the variational constraint. When converting a molecule from a latent representation to a molecule, the decoder model samples a string from the probability distribution over characters in each position generated by its final layer. As such, multiple SMILES strings are possible from a single latent space representation. We employed the open source cheminformatics suite RDKit30 to validate the chemical structures of output molecules and discard invalid ones. While it would be more efficient to limit the autoencoder to generate only valid strings, this postprocessing step is lightweight and allows for greater flexibility in the autoencoder to learn the architecture of the SMILES.
SMILES表示法的逐字符性质及其内部语法的脆弱性(开环、闭环和分支、允许价等)仍然可能导致解码器输出无效分子,即使有变化约束。当将分子从潜在表示转换为分子时,解码器模型从其最后一层生成的每个位置的字符概率分布中采样字符串。因此,可以从单个潜在空间表示中获得多个微笑字符串。我们使用开源化学信息套件RDKit30来验证输出分子的化学结构,并丢弃无效分子。虽然将自动编码器限制为仅生成有效字符串更有效,但此后处理步骤是轻量级的,并允许自动编码器更灵活地学习微笑的架构。
To enable molecular design, the chemical structures encoded in the continuous representation of the autoencoder need to be correlated with the target properties that we are seeking to optimize. Therefore, we added a model to the autoencoder that predicts the properties from the latent space representation. This autoencoder was then trained jointly on the reconstruction task and a property prediction task; an additional multilayer perceptron (MLP) was used to predict the property from the latent vector of the encoded molecule. To propose promising new candidate molecules, we can start from the latent vector of an encoded molecule and then move in the direction most likely to improve the desired attribute. The resulting new candidate vectors can then be decoded into corresponding molecules (Figure 1b).
为了实现分子设计,在自动编码器的连续表示中编码的化学结构需要与我们正在寻求优化的目标特性相关联。因此,我们在自动编码器中添加了一个模型,该模型可以从潜在空间表示中预测属性。然后对该自动编码器进行重建任务和属性预测任务的联合训练;使用额外的多层感知器(MLP)从编码分子的潜在向量预测属性。为了提出有前途的新候选分子,我们可以从编码分子的潜在向量开始,然后朝着最有可能改善所需属性的方向移动。然后,可以将得到的新候选向量解码为相应的分子(图1b)。
图1。(a) 用于分子设计的自动编码器图,包括联合特性预测模型。编码器网络从离散的分子表示(如SMILES字符串)开始,将每个分子转换为潜在空间中的向量,这实际上是一个连续的分子表示。给定潜在空间中的一个点,解码器网络生成相应的SMILES字符串。多层感知器网络估计与每个分子相关的目标属性值。(b) 连续潜在空间中基于梯度的优化。在训练代理模型f(z)以根据分子的潜在表征z预测分子的性质后,我们可以优化f(z)相对于z的关系,以找到预期具有高期望性质值的新的潜在表征。然后,这些新的潜在表示可以被解码成微笑字符串,在这一点上可以对其属性进行经验测试。
Two autoencoder systems were trained: one with 108 000 molecules from the QM9 data set of molecules with fewer than Figure 2. Representations of the sampling results from the variational autoencoder. (a) Kernel Density Estimation (KDE) of each latent dimension of the autoencoder, i.e., the distribution of encoded molecules along each dimension of our latent space representation; (b) histogram of sampled molecules for a single point in the latent space; the distances of the molecules from the original query are shown by the lines corresponding to the right axis; (c) molecules sampled near the location of ibuprofen in latent space. The values below the molecules are the distance in latent space from the decoded molecule to ibuprofen; (d) slerp interpolation between two molecules in latent space using six steps of equal distance. ACS Central Science Research Article DOI: 10.1021/acscentsci.7b00572 ACS Cent. Sci. 2018, 4, 268−276 270 9 heavy atoms31 and another with 250 000 drug-like commercially available molecules extracted at random from the ZINC database.32 We performed random optimization over hyperparameters specifying the deep autoencoder architecture and training, such as the choice between a recurrent or convolutional encoder, the number of hidden layers, layer sizes, regularization, and learning rates. The latent space representations for the QM9 and ZINC data sets had 156 dimensions and 196 dimensions, respectively.
我们训练了两个自动编码器系统:一个有108000个QM9数据集中的分子含有少于9个重原子31,另一个含有250000个类似药物的商用分子,从锌数据库中随机提取。32我们对指定深度自动编码器架构和训练的超参数进行了随机优化,例如循环或卷积编码器的选择、隐藏层的数量、层大小、正则化和学习速率。QM9和ZINK数据集的潜在空间表示分别为156维和196维。
网友评论