Lion：优化算法的符号发现

作者: Valar_Morghulis | 来源:发表于2023-02-15 07:49 被阅读0次

2018-07-11
优化方法总结
机器学习中的优化算法·学习率（基本概念）
未来正在形成
12 SVM - SMO - 初始β变量的选择、总结
优化器
8. 优化案例
Task07
爬山算法
2018-10-12

Symbolic Discovery of Optimization Algorithms

13 Feb 2023

https://arxiv.org/abs/2302.06675

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le

[Google, UCLA]

https://github.com/google/automl/tree/master/lion

https://github.com/google/automl/blob/master/lion/lion_pytorch.py

摘要：我们提出了一种将算法发现表述为程序搜索的方法，并将其应用于发现深度神经网络训练的优化算法。我们利用高效的搜索技术来探索无限且稀疏的程序空间。为了弥补代理任务和目标任务之间的巨大泛化差距，我们还引入了程序选择和简化策略。我们的方法发现了一种简单有效的优化算法，Lion（EvoLved Sign Momentum）。它比Adam更有内存效率，因为它只跟踪动量。与自适应优化器不同，它的更新对于通过符号运算计算的每个参数具有相同的大小。我们将Lion与广泛使用的优化器（如Adam和Adafactor）进行比较，以训练不同任务的各种模型。在图像分类方面，Lion将ImageNet上的ViT精度提高了2%，并将JFT上的预训练计算节省了5倍。在视觉语言对比学习方面，我们在ImageNet上实现了88.3%的零样本和91.1%的微调精度，分别超过了之前的最佳结果2%和0.1%。在扩散模型上，Lion通过获得更好的FID分数和将训练计算减少2.3倍而优于Adam。对于自回归、掩码语言建模和微调，Lion表现出与Adam相似或更好的性能。我们对Lion的分析表明，它的性能增益随着训练批量的增加而增加。它还需要比Adam更小的学习率，因为符号函数产生的更新范数更大。此外，我们还检查了Lion的局限性，并确定了其改进很小或在统计上不显著的情况。Lion的实现是公开的。

We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3% zero-shot and 91.1% fine-tuning accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. The implementation of Lion is publicly available.

6 限制

搜索的局限性尽管努力减少搜索空间的限制，但它仍然受到流行的一阶优化算法的启发，导致对类似算法的偏见。它还缺乏构建高级二阶算法所需的功能（Anil等人，2020；Gupta等人，2018；Martens和Grosse，2015）。搜索成本仍然很大，算法简化需要人工干预。进一步减少搜索空间中的偏差以发现更多新的算法并提高搜索效率是重要的未来方向。当前的程序结构非常简单，因为我们无法很好地使用更高级的程序结构，例如条件、循环语句和定义新函数。探索如何融入这些元素有可能打开新的可能性。

Lion的局限性尽管我们努力在尽可能多的任务上评估Lion，但评估仅限于选定的任务。在视觉任务上，当使用强大的增强时，Lion和AdamW之间的性能差距会缩小。Lion还执行了与AdamW类似的几个任务，包括：（1）Imagen文本到图像基础模型，（2）在大规模内部数据集上训练的自回归语言模型的困惑，这可以说是上下文学习基准中更可靠的度量，以及（3）C4上的掩蔽语言建模。这些任务有一个共同的特点，即数据集庞大且质量高，这会减少优化器之间的差异。另一个潜在的限制是批量大小。尽管人们经常放大批处理大小以实现更多的并行性，但如果批处理大小较小（小于64），Lion的性能可能不会比AdamW好。此外，Lion仍然需要bfloat16中的动量跟踪，这对于训练巨型模型来说可能很昂贵。一个潜在的解决方案是将动量因子化以节省内存。