Efficiently Modeling Long Sequences with Structured State Spaces
ICLR 2022 (Outstanding Paper HM)
序列建模的一个中心目标是设计一个单一的原则性模型,该模型可以处理一系列模态和任务中的序列数据,特别是长期依赖性。尽管包括RNN、CNN和Transformer在内的传统模型有专门的变体来捕获长依赖关系,但它们仍然难以扩展到10000步或更多步的非常长的序列。最近一种很有前途的方法提出了通过模拟基本状态空间模型(SSM)x′(t)=Ax(t)+Bu(t,y(t)=Cx(t)+Du(t(t))来建模序列,并表明对于状态矩阵A的适当选择,该系统可以在数学和经验上处理长程相关性。然而,这种方法具有令人望而却步的计算和内存需求,使得它作为通用序列建模解决方案不可行。我们提出了基于SSM新参数化的结构化状态空间序列模型(S4),并表明它可以比现有方法更有效地计算,同时保持其理论优势。我们的技术涉及用低秩校正来调节A,允许它稳定地对角化,并将SSM简化为经过充分研究的柯西核计算。S4在各种既定基准上取得了强大的经验结果,包括(i)与更大的2-D ResNet相比,连续CIFAR-10的准确率为91%,没有数据增加或辅助损失,同时对远程竞技场基准测试中的每一项任务执行60倍更快的SoTA(iii),包括解决所有先前工作都失败的长度为16k的具有挑战性的Path-X任务,同时与所有竞争对手一样高效。
A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) x′(t)=Ax(t)+Bu(t),y(t)=Cx(t)+Du(t), and showed that for appropriate choices of the state matrix A, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning A with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91\% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60× faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.
网友评论