Abstract
Serverless computing promises cost-efficiency and elasticity for high-productive software development. To achieve this, the serverless sandbox system must address two challenges: strong isolation between function instances, and low startup latency to ensure user experience. While strong isolation can be provided by virtualization-based sandboxes, the initialization of sandbox and application causes non-negligible startup overhead. Conventional sandbox systems fall short in low-latency startup due to their application-agnostic nature: they can only reduce the latency of sandbox initialization through hypervisor and guest kernel customization, which is inadequate and does not mitigate the majority of startup overhead.
This paper proposes Catalyzer, a serverless sandbox system design providing both strong isolation and extremely fast function startup. Instead of booting from scratch, Catalyzer restores a virtualization-based function instance from a well-formed checkpoint image and thereby skips the initialization on the critical path (init-less). Catalyzer boosts the restore performance by on-demand recovering both user-level memory state and system state. We also propose a new OS primitive, sfork (sandbox fork), to further reduce the startup latency by directly reusing the state of a running sandbox instance. Fundamentally, Catalyzer removes the initialization cost by reusing the state, which enables general optimizations for diverse serverless functions. The evaluation shows that Catalyzer reduces startup latency by orders of magnitude, achieves <1ms latency in the best case, and significantly reduces the end-to-end latency for real-world workloads.
无服务器计算为高生产力软件开发带来了成本效益和弹性。为此,无服务器沙箱系统必须解决两个挑战:功能实例之间的强隔离,以及确保用户体验的低启动延迟。虽然基于虚拟化的沙箱可以提供强隔离,但沙箱和应用程序的初始化会导致不可忽略的启动开销。由于其与应用程序无关的性质,传统沙箱系统在低延迟启动方面存在不足:它们只能通过虚拟机管理程序和来宾内核定制来减少沙箱初始化的延迟,这是不够的,并且不能减轻大部分启动开销。
本文提出了 Catalyzer,这是一种无服务器沙箱系统设计,可提供强隔离和极快的功能启动。 Catalyzer 不是从头启动,而是从格式良好的检查点映像恢复基于虚拟化的功能实例,从而跳过关键路径上的初始化(无初始化)。 Catalyzer 通过按需恢复用户级内存状态和系统状态来提高恢复性能。我们还提出了一个新的 OS 原语 sfork(沙箱分叉),通过直接重用正在运行的沙箱实例的状态来进一步减少启动延迟。从根本上说,Catalyzer 通过重用状态来消除初始化成本,从而实现对各种无服务器功能的一般优化。评估表明,Catalyzer 将启动延迟降低了几个数量级,在最佳情况下实现了 <1 毫秒的延迟,并显着降低了实际工作负载的端到端延迟。
1 Introduction
Serverless computing, the new trending paradigm in cloud computing, liberates developers from the distraction of managing servers and has already been supported by many platforms, including Amazon Lambda [2], IBM Cloud Function [1], Microsoft Azure Functions [3], and Google Cloud Functions [7]. In serverless computing, the unit of computation is a function. When a service request is received, the serverless platform allocates an ephemeral execution sandbox and instantiates a user-defined function to handle the request. This computing model shifts the responsibility of dynamically managing cloud resources to cloud providers, allowing the developers to focus purely on their application logic. Besides, cloud providers can manage their resources more efficiently.
无服务器计算是云计算的新趋势范式,将开发人员从管理服务器的注意力中解放出来,并且已经得到许多平台的支持,包括 Amazon Lambda [2]、IBM Cloud Function [1]、Microsoft Azure Functions [3] 和谷歌云函数 [7]。在无服务器计算中,计算单位是一个函数。当收到服务请求时,无服务器平台会分配一个临时执行沙箱并实例化用户定义的函数来处理请求。这种计算模型将动态管理云资源的责任转移给了云提供商,让开发人员可以完全专注于他们的应用程序逻辑。此外,云提供商可以更有效地管理他们的资源。
The ephemeral execution sandboxes are typically containers [1], virtual machines [20, 44] or recently proposed lightweight virtualization designs [6, 8, 19, 35, 37, 41, 45]. However, container instances suffer from isolation issues since they share one kernel, which is error-prone. Virtual machines can achieve better isolation but are too heavy to run serverless functions. Lightweight virtualization designs like Google gVisor [8] and Amazon FireCracker [6] achieve high performance, easy resource management, and strong isolation by customizing the host-guest interfaces, e.g., gVisor uses a process abstraction interface.
临时执行沙箱通常是容器 [1]、虚拟机 [20、44] 或最近提出的轻量级虚拟化设计 [6、8、19、35、37、41、45]。然而,容器实例存在隔离问题,因为它们共享一个内核,这很容易出错。虚拟机可以实现更好的隔离,但太重而无法运行无服务器功能。 Google gVisor [8] 和 Amazon FireCracker [6] 等轻量级虚拟化设计通过自定义主机-访客接口(例如,gVisor 使用进程抽象接口)来实现高性能、易于资源管理和强隔离。
Executing serverless functions with low latency is critical for user experience [21, 24, 28, 32, 38], and is still a significant challenge for virtualization-based sandbox design. To explain the severity, we conduct an end-to-end evaluation on three benchmarks, DeathStar [22], E-business microservices, and image processing functions, and divide the latency into the “execution” part and “boot” part (§6.4). We calculate the “Execution/Overall” ratio of the tested 14 serverless functions, and present the CDF in Figure 1. The ratio of 12 functions (out of 14) in gVisor can not even achieve 30%, indicating that the startup dominates the overall latency. Long startup latency, especially for virtualization-based sandbox, has become a significant challenge for serverless platforms.
以低延迟执行无服务器功能对于用户体验至关重要 [21, 24, 28, 32, 38],并且仍然是基于虚拟化的沙箱设计的重大挑战。为了解释严重性,我们对DeathStar [22]、电子商务微服务和图像处理功能三个基准进行端到端评估,并将延迟分为“执行”部分和“启动”部分(§ 6.4)。我们计算了测试的 14 个 serverless 函数的“执行/总体”比率,并在图 1 中呈现了 CDF。gVisor 中的 12 个函数(共 14 个)的比率甚至不能达到 30%,表明启动在总体上占主导地位潜伏。长启动延迟,尤其是基于虚拟化的沙箱,已成为无服务器平台的重大挑战。
Existing VM-based sandboxes [6, 8, 37] reduce the startup latency through hypervisor customization, e.g., FireCracker can boot a virtual machine (micro VM) and a minimized Linux kernel in 100ms. However, none of them can reduce the application initialization latency like JVM or Python interpreter setup time. Our studies on serverless functions (written by five programming languages) show that most of the startup latency comes from application initialization (Insight I).
现有的基于虚拟机的沙箱 [6, 8, 37] 通过虚拟机管理程序定制减少了启动延迟,例如 FireCracker 可以在 100 毫秒内启动一个虚拟机(微型虚拟机)和最小化的 Linux 内核。但是,它们都不能像 JVM 或 Python 解释器设置时间那样减少应用程序初始化延迟。我们对无服务器函数(由五种编程语言编写)的研究表明,大部分启动延迟来自应用程序初始化(Insight I)。
This paper proposes Catalyzer, a general design to boost startup for serverless computing. The key idea of Catalyzer is to restore an instance from a well-formed checkpoint image and thereby skip the initialization on the critical path. The design is based on two additional insights: First, a serverless function in the execution stage typically accesses only a small fraction of memory and files used in the initialization stage (Insight II), thus we can on-demand recover both application states (e.g., data in memory) and system state (e.g., file handles/descriptors). Second, sandbox instances of the same function possess almost the same initialized state (Insight III), thus it is possible to reuse most of the state of running sandboxes to spawn new ones. Specifically, Catalyzer adopts on-demand recovery of both user-level and system states. And it proposes a new OS primitive, sfork (sandbox fork), to further reduce the startup latency by directly reusing the state of a running sandbox instance. Fundamentally, Catalyzer eliminates the initialization cost by reusing the state, which enables general optimizations on diverse serverless functions.
本文提出了 Catalyzer,这是一种促进无服务器计算启动的通用设计。 Catalyzer 的关键思想是从格式良好的检查点图像中恢复实例,从而跳过关键路径上的初始化。该设计基于两个额外的见解:首先,执行阶段的无服务器函数通常只访问初始化阶段(Insight II)中使用的一小部分内存和文件,因此我们可以按需恢复两个应用程序状态(例如,内存中的数据)和系统状态(例如,文件句柄/描述符)。其次,相同函数的沙箱实例具有几乎相同的初始化状态(洞察 III),因此可以重用运行沙箱的大部分状态来产生新的沙箱。具体来说,Catalyzer 采用用户级和系统状态的按需恢复。并且它提出了一个新的 OS 原语 sfork(sandbox fork),通过直接重用正在运行的沙盒实例的状态来进一步减少启动延迟。从根本上说,Catalyzer 通过重用状态来消除初始化成本,从而可以对各种无服务器功能进行一般优化。
We have implemented Catalyzer based on gVisor. We measure the performance with both micro-benchmarks and real-world applications developed in five programming languages.
The result shows the Catalyzer can achieve <1ms startup latency on C-hello (best case), and <2ms to boot Java SPECjbb, 1000x speedup over baseline gVisor. We also present evaluations on server machines and share our lessons learned from industrial development at Ant Financial. The main contributions of this paper are as follows:
- A detailed analysis of latency overhead on serverless computing (§2).
- A general design of Init-less booting that boosts startup of diverse serverless applications (§3 and §4).
- An implementation of Catalyzer on a state-of-the-art serverless sandbox system, Google gVisor (§5).
- An evaluation with micro-benchmarks and real-world serverless applications proving the efficiency and practicability of Catalyzer (§6).
- The experience of deploying Catalyzer on real platforms (§6.9).
我们已经实现了基于 gVisor 的 Catalyzer。我们使用微基准测试和用五种编程语言开发的实际应用程序来衡量性能。
结果表明,Catalyzer 可以在 C-hello(最佳情况)上实现 <1ms 的启动延迟,以及 <2ms 来启动 Java SPECjbb,比基准 gVisor 加速 1000 倍。我们还展示了对服务器机器的评估,并分享了我们从蚂蚁金服的工业发展中吸取的经验教训。本文的主要贡献如下:
- 无服务器计算延迟开销的详细分析(第 2 节)。
- 无 Init 启动的通用设计,可促进各种无服务器应用程序的启动(第 3 节和第 4 节)。
- Catalyzer 在最先进的无服务器沙箱系统 Google gVisor(第 5 节)上的实现。
- 使用微基准测试和真实世界的无服务器应用程序进行的评估证明了 Catalyzer 的效率和实用性(第 6 节)。
- 在真实平台上部署 Catalyzer 的经验(第 6.9 节)。
2 Serverless Function Startup Breakdown
In this section, we evaluate and analyze the startup latency of serverless platforms with different system sandboxes (i.e., gVisor, FireCracker, Hyper Container, and Docker) and different language runtimes. Based on evaluation and analysis, we present our motivation that serverless functions should be executed with an initialization-less approach.
在本节中,我们评估和分析具有不同系统沙箱(即 gVisor、FireCracker、Hyper Container 和 Docker)和不同语言运行时的无服务器平台的启动延迟。 基于评估和分析,我们提出了我们的动机,即应该使用无初始化方法执行无服务器功能。
2.1 Background
Serverless Platform. In serverless computing, the developer sends a function to the serverless platform to execute. We use the term handler function to represent the target function, which could be written in different languages. The handler function is compiled offline together with a wrapper, which does initialization and invokes the handler function. Wrapped programs (consist of the wrapper and handler function) execute safely within sandboxes, which can be containers [5, 40] or virtual machines (VM) [6, 8, 10]. There is a gateway program running on each server as a daemon, which accepts “invoke function” requests, and starts a sandbox with two arguments: a configuration file and a rootfs containing both the wrapped program and runtime libraries. The arguments are based on OCI specification [12] and compatible with most of the existing serverless platforms.
无服务器平台。在无服务器计算中,开发者向无服务器平台发送一个函数来执行。我们使用术语处理程序函数来表示目标函数,它可以用不同的语言编写。处理程序函数与包装器一起离线编译,包装器进行初始化并调用处理程序函数。包装的程序(由包装器和处理程序函数组成)在沙箱中安全执行,沙箱可以是容器 [5, 40] 或虚拟机 (VM) [6, 8, 10]。每个服务器上都有一个网关程序作为守护进程运行,它接受“调用函数”请求,并使用两个参数启动一个沙箱:一个配置文件和一个包含包装程序和运行时库的 rootfs。这些参数基于 OCI 规范 [12],并与大多数现有的无服务器平台兼容。
gVisor Case Study. In this paper, we propose a general optimization to achieve sub-millisecond startup even for VM-based sandboxes like gVisor. In the following text, we will take gVisor as an example for analysis, implementation, and evaluation. For evaluation, we use server machines (§6.1) to reveal performance improvement in the industrial environment.
gVisor 案例研究。在本文中,我们提出了一种通用优化方法,即使对于 gVisor 等基于 VM 的沙箱,也能实现亚毫秒级启动。在下文中,我们将以 gVisor 为例进行分析、实现和评估。为了进行评估,我们使用服务器机器(第 6.1 节)来揭示工业环境中的性能改进。
On a serverless platform, the first step of invoking a function is to prepare a sandbox. In the case of gVisor, the sandbox preparation includes four operations: configuration parsing, virtualization resource allocation (e.g., VCPUs and guest memory regions), root file system mounting, and guest kernel initialization (Figure 2). The guest kernel consists of two user processes: a sandbox process and an I/O process. The sandbox process sets up the virtualized resource, e.g., the extended page table (EPT)1, and prepares the guest kernel. The I/O process mounts the root file system according to the configuration file. Figure 2 shows that sandbox initialization takes non-negligible time (22.3ms) in gVisor. Since sandbox initialization depends on function-specific configurations, it is hard to use techniques like caching [31, 40] to reduce sandbox initialization overhead. The critical path of startup refers to the period from when the “Gateway process” got a request until the handler executed. We use the term offline to represent the non-critical path operations (e.g., caching).
在无服务器平台上,调用函数的第一步是准备沙箱。在 gVisor 的情况下,沙箱准备包括四个操作:配置解析、虚拟化资源分配(例如,VCPU 和来宾内存区域)、根文件系统挂载和来宾内核初始化(图 2)。来宾内核由两个用户进程组成:沙箱进程和 I/O 进程。沙盒进程设置虚拟化资源,例如扩展页表 (EPT)1,并准备客户内核。 I/O进程根据配置文件挂载根文件系统。图 2 显示沙箱初始化在 gVisor 中花费了不可忽略的时间(22.3 毫秒)。由于沙箱初始化取决于特定于函数的配置,因此很难使用缓存 [31, 40] 等技术来减少沙箱初始化开销。启动的关键路径是指从“网关进程”收到请求到处理程序执行的这段时间。我们使用术语离线来表示非关键路径操作(例如,缓存)。
After sandbox initialization, the sandbox runs the wrapped program specified in the configuration file. Taking Java as an example, the wrapped program first starts a JVM to initialize Java runtime (e.g., loading class files), then executes the user-provided handler function. We define the application initialization latency as the period from when the wrapped program starts until the handler function is ready to run. As the following evaluation shows, the application initialization latency dominates the total startup latency.
沙箱初始化后,沙箱运行配置文件中指定的封装程序。以Java为例,被包装的程序首先启动一个JVM初始化Java运行时(如加载类文件),然后执行用户提供的处理函数。我们将应用程序初始化延迟定义为从包装程序启动到处理程序函数准备好运行的时间段。如以下评估所示,应用程序初始化延迟在总启动延迟中占主导地位。
2.2 A Quantitative Analysis on Startup Optimizations
The design space of serverless sandboxes is shown in Figure 3.
Cache-based Optimizations. Many systems adopt the idea of caching for serverless function startup [17, 39, 40]. For example, Zygote is a cache-based design for optimizing latency, which has been used in Android [14] to instantiate new Java applications. SOCK [40] leverages the Zygote idea for serverless computing. By creating a cache of pre-warmed Python interpreters, functions can be launched with an interpreter that has already loaded the necessary libraries, thus achieving high startup performance. SAND [17] allows instances of the same application function to share the sandbox which contains the function codes and its libraries. However, there are two reasons that caching is far from ideal. First, a single machine is capable of running thousands of serverless functions, so caching all the functions in memory will introduce a high resource overhead. Caching policies are also hard to be determined in the real world. Second, caching does not help with the tail latency, which is dominated by the “cold boot” in most cases.
基于缓存的优化。许多系统采用无服务器功能启动缓存的想法 [17, 39, 40]。例如,Zygote 是一种基于缓存的优化延迟设计,已在 Android [14] 中用于实例化新的 Java 应用程序。 SOCK [40] 利用 Zygote 思想进行无服务器计算。通过创建预热 Python 解释器的缓存,可以使用已经加载必要库的解释器启动函数,从而实现高启动性能。 SAND [17] 允许相同应用程序函数的实例共享包含函数代码及其库的沙箱。但是,缓存远非理想的原因有两个。首先,单台机器能够运行数千个无服务器功能,因此将所有功能缓存在内存中会带来很高的资源开销。在现实世界中也很难确定缓存策略。其次,缓存对尾部延迟没有帮助,在大多数情况下,这是由“冷启动”主导的。
Optimizations on Sandbox Initialization. Besides caching, sandbox systems also optimize their initialization through customization. For example, SOCK [40] proposes a lean container, which is a customized container design for serverless computing, to mitigate the overhead of sandbox initialization. Compared with container-based approaches, VM-based sandboxes [6, 8, 10] provide stronger isolation and also introduce more costs to sandbox initialization. Researchers have proposed numerous lightweight virtualization techniques [6, 19, 26, 36, 37] to solve performance and resource utilization issues [18, 23, 25, 29] in traditional heavy-weight virtualization systems. These proposals have already stimulated significant interest in the serverless computing industry (e.g., Google’s gVisor [8] and Amazon’s FireCracker [6]).
Further, the lightweight virtualization techniques adopt various ways to optimize startup latency: by customizing guest kernels [26, 36], customizing hypervisors [19, 37], or a combination of the two [6, 8]. For instance, FireCracker [6] can boot a virtual machine (micro VM) and a minimized Linux kernel in 100ms. Although different in design and implementation, today’s virtualization-based sandboxes have one common limitation: they can not mitigate the application initialization latency like JVM or Python interpreter.
To understand the latency overhead (including sandbox and application initialization), we evaluate the startup latency of four widely used sandboxes (i.e., gVisor, FireCracker, Hyper Container, and Docker) with different workloads, and present the latency distribution in Figure 4. The evaluation uses the sandbox runtime directly and does not count the cost of container management. The settings are the same as described in §6.1.
We highlight several interesting findings from the evaluation. First, much of the latency overhead comes from application initialization. Second, compared with C language (142ms startup latency in gVisor), the startup latency is much higher for high-level languages like Java and Python. The main reason is that high-level languages usually need to initialize a language runtime (e.g., JVM) before loading application codes. Third, sandbox initialization is stable for different workloads and dominates the latency overhead for simple functions like Python Hello.
The evaluation shows that much of the startup latency comes from application initialization instead of the sandbox. However, none of the existing virtualization-based sandboxes can reduce the application initialization latency caused by JVM or Python interpreters.
沙盒初始化优化。除了缓存,沙箱系统还通过定制优化它们的初始化。例如,SOCK [40] 提出了一种精益容器,这是一种用于无服务器计算的定制容器设计,以减轻沙箱初始化的开销。与基于容器的方法相比,基于 VM 的沙箱 [6, 8, 10] 提供了更强的隔离性,并为沙箱初始化引入了更多成本。研究人员提出了许多轻量级虚拟化技术 [6, 19, 26, 36, 37] 来解决传统重量级虚拟化系统中的性能和资源利用率问题 [18, 23, 25, 29]。这些提议已经激发了对无服务器计算行业的极大兴趣(例如,谷歌的 gVisor [8] 和亚马逊的 FireCracker [6])。
此外,轻量级虚拟化技术采用各种方法来优化启动延迟:通过自定义来宾内核 [26, 36]、自定义管理程序 [19, 37] 或两者的组合 [6, 8]。例如,FireCracker [6] 可以在 100 毫秒内启动一个虚拟机(微型 VM)和一个最小化的 Linux 内核。虽然在设计和实现上有所不同,但今天基于虚拟化的沙箱有一个共同的限制:它们不能像 JVM 或 Python 解释器那样减轻应用程序初始化延迟。
为了了解延迟开销(包括沙箱和应用程序初始化),我们评估了四种广泛使用的沙箱(即 gVisor、FireCracker、Hyper Container 和 Docker)在不同工作负载下的启动延迟,并在图 4 中展示了延迟分布。评估直接使用沙箱运行时,不计算容器管理的成本。设置与§6.1 中描述的相同。
我们强调了评估中的几个有趣的发现。首先,大部分延迟开销来自应用程序初始化。其次,与C语言(gVisor中142ms启动延迟)相比,Java和Python等高级语言的启动延迟要高得多。主要原因是高级语言通常需要在加载应用程序代码之前初始化语言运行时(例如 JVM)。第三,沙箱初始化对于不同的工作负载是稳定的,并且在 Python Hello 等简单函数的延迟开销中占主导地位。
评估表明,大部分启动延迟来自应用程序初始化而不是沙箱。然而,现有的基于虚拟化的沙箱都不能减少由 JVM 或 Python 解释器引起的应用程序初始化延迟。
Checkpoint/Restore-based Optimizations. Checkpoint/restore (C/R) is a technique to save the state of a running sandbox into a checkpoint image. The saved state includes both the application state (in the sandbox) and the sandbox state (e.g., the hypervisor). Then, the sandbox can be restored from the image and run seamlessly. Replayable Execution [43] leverages C/R techniques to mitigate the application initialization cost, but only applies to container-based systems. Compared with other C/R systems, Replayable optimizes memory loading using an on-demand approach for boosting startup latency. However, our evaluation shows virtualization-based sandboxes incur high overhead to recover system state during the restore, which is omitted by the prior art.
The major benefit of C/R is that it can transform the application initialization costs into the sandbox restore costs (init-less). We generalize the idea as Init-less booting, shown in Figure 5. First, a func-image (short for function image) is generated offline, which saves initialized state of a serverless function (Offline initialization). The func-image could be saved to both local or remote storage, and a serverless platform needs to fetch a func-image first. After that, the platform can re-use the state saved in the func-image to boost the function startup (func-load).
基于检查点/恢复的优化。检查点/恢复 (C/R) 是一种将正在运行的沙箱的状态保存到检查点图像中的技术。保存的状态包括应用程序状态(在沙箱中)和沙箱状态(例如,管理程序)。然后,沙箱可以从映像中恢复并无缝运行。 Replayable Execution [43] 利用 C/R 技术来降低应用程序初始化成本,但仅适用于基于容器的系统。与其他 C/R 系统相比,Replayable 使用按需方法来优化内存加载,以提高启动延迟。然而,我们的评估表明,基于虚拟化的沙箱在恢复期间恢复系统状态会产生高开销,而现有技术忽略了这一点。
C/R 的主要好处是它可以将应用程序初始化成本转换为沙箱恢复成本(无初始化)。我们将这个想法概括为 Init-less 启动,如图 5 所示。首先,离线生成一个 func-image(函数图像的缩写),它保存了无服务器函数的初始化状态(离线初始化)。 func-image 可以保存到本地或远程存储,无服务器平台需要先获取 func-image。之后,平台可以重新使用保存在 func-image 中的状态来提升功能启动(func-load)。
Challenges. C/R (checkpoint/restore) techniques re-use serialized state (mostly application state) of a process to diminish application initialization cost but rely on re-do operations to recover system state (i.e., in-kernel state like the opened files). A re-do operation recovers the state of a checkpointed instance and is necessary for correctness and compatibility. For example, a C/R system will re-do “open()” operations to re-open files that are opened in a checkpointed process. However, re-do operations introduce performance overhead, especially for virtualization-based sandboxes.
To analyze the performance effect, we implement a C/R-based init-less booting system on gVisor, called gVisor-restore, using gVisor-provided checkpoint and restore [4] mechanism. We add a new syscall in gVisor to trap at the entry point of serverless functions. We use the term, func-entry point, to indicate the entry point of a serverless function, which is either specified by developers or at the default location: the point right before the wrapped program invoking the handler function. The syscall is invoked by the func-entry point annotation and will block until checkpoint operation begins.
We evaluate the startup latency of gVisor-restore using different applications and compare it with unmodified gVisor. We use the sandbox runtime directly (i.e., runsc for gVisor) to exclude container management costs. As the result (Figure 6) shows, gVisor-restore successfully eliminates the application initialization overhead and achieves 2x–5x speedup over gVisor. However, the startup latency is still high (400ms for a Java SPECjbb application and >100ms in other cases). Figure 2 suggests that gVisor-restore spends 135.9ms on guest kernel recovery, which can be classified into “Recover Kernel” and “Reconnect I/O” in the figure. The “Recover Kernel” means recovering non-I/O system state, e.g., thread information, while I/O reconnection is for recovering I/O system state, e.g., re-open a “suppose opened” file. For reusable state (“App memory” in the figure), the gVisor C/R mechanism compresses the saved data to reduce the storage overhead and needs to decompress, deserialize, and load the data into memory on the restore critical path, costing 128.8ms for a SPECjbb application. During the restore process in the SPECjbb case, gVisor recovers more than 37,838 objects (e.g., threads/tasks, mounts, sessionLists, timers, and etc.) in the guest kernel and loads 200MB memory data.
Prior container-based C/R systems [43] have exploited on-demand paging to boost application state recovery, but still, recover all the system state in the critical path.
挑战。 C/R(检查点/恢复)技术重用进程的序列化状态(主要是应用程序状态)来减少应用程序初始化成本,但依靠重做操作来恢复系统状态(即内核状态,如打开的文件) .重做操作会恢复检查点实例的状态,这对于正确性和兼容性来说是必要的。例如,C/R 系统将重新执行“open()”操作以重新打开在检查点进程中打开的文件。但是,重做操作会引入性能开销,尤其是对于基于虚拟化的沙箱。
为了分析性能影响,我们在 gVisor 上实现了一个基于 C/R 的 init-less 启动系统,称为 gVisor-restore,使用 gVisor 提供的检查点和恢复 [4] 机制。我们在 gVisor 中添加了一个新的系统调用以在无服务器函数的入口点捕获。我们使用术语 func-entry point 来表示无服务器函数的入口点,它要么由开发人员指定,要么位于默认位置:就在调用处理程序函数的包装程序之前的点。系统调用由 func-entry point 注释调用,并将阻塞,直到检查点操作开始。
我们使用不同的应用程序评估 gVisor-restore 的启动延迟,并将其与未修改的 gVisor 进行比较。我们直接使用沙箱运行时(即 gVisor 的 runc)来排除容器管理成本。结果(图 6)显示,gVisor-restore 成功地消除了应用程序初始化开销,并且比 gVisor 实现了 2 到 5 倍的加速。但是,启动延迟仍然很高(Java SPECjbb 应用程序为 400 毫秒,其他情况下为 >100 毫秒)。图2表明gVisor-restore在guest kernel恢复上花费了135.9ms,在图中可以分为“Recover Kernel”和“Reconnect I/O”。 “Recover Kernel”是指恢复非I/O系统状态,例如线程信息,而I/O重连是为了恢复I/O系统状态,例如重新打开一个“假设打开”的文件。对于可重用状态(图中“App memory”),gVisor C/R机制对保存的数据进行压缩以减少存储开销,需要在恢复关键路径上对数据进行解压、反序列化、加载到内存中,耗时128.8ms对于 SPECjbb 应用程序。在 SPECjbb 案例的恢复过程中,gVisor 在访客内核中恢复了超过 37,838 个对象(例如,线程/任务、挂载、会话列表、计时器等)并加载 200MB 内存数据。
之前的基于容器的 C/R 系统 [43] 已经利用按需分页来促进应用程序状态恢复,但仍然可以恢复关键路径中的所有系统状态。
2.3 Overview
Our evaluation and analysis motivate us to propose Catalyzer, an init-less booting design for virtualization-based sandboxes, which is equipped with novel techniques to overcome the high latency on the restore process.
我们的评估和分析促使我们提出 Catalyzer,这是一种基于虚拟化沙箱的无初始化启动设计,它配备了新技术来克服恢复过程中的高延迟。
As shown in Figure 7, Catalyzer defines three kinds of booting: cold boot, warm boot, and fork boot. Precisely, cold boot means that the platform must create a sandbox instance from func-image through the restore. Warm boot means there are running instances for the requested function; thus, Catalyzer can boost the restore by sharing the in-memory state of running instances. Fork boot in Catalyzer needs a dedicated sandbox template, a sandbox contains the initialized state, to skip the initialization. Fork boot is a hot-boot mechanism [11, 40]—a platform that knows a function may be invoked soon and prepares the running environment for the function. The significant contribution is that fork boot is scalable to boot any number of instances from a single template, while prior hot boot can only serve limited instances (depending on the cache size).
Catalyzer adopts a hybrid approach combining C/R-based init-less booting and a new OS primitives to implement the cold, warm, and fork boot. Since a serverless function in the execution stage typically accesses only a small fraction of both memory and files used in the initialization stage, Catalyzer introduces on-demand restore for the cold and warm boot to optimize the recovery of both applications and system state (§3). In addition, Catalyzer proposes a new OS primitive, sfork (sandbox fork), to reduce the startup latency in fork boot by directly reusing the state of a template sandbox (§4). Fork boot can achieve faster startup than the warm boot, but also introduces more memory overhead; thus, fork boot is more suitable for frequently invoked (hot) functions.
如图 7 所示,Catalyzer 定义了三种启动方式:冷启动、热启动和分叉启动。确切地说,冷启动意味着平台必须从 func-image 到恢复创建一个沙箱实例。热启动意味着所请求的函数有正在运行的实例;因此,Catalyzer 可以通过共享正在运行的实例的内存状态来促进恢复。 Catalyzer 中的 fork boot 需要一个专用的沙箱模板,沙箱包含初始化状态,跳过初始化。 Fork boot 是一种热启动机制 [11, 40]——一个知道一个函数可能很快被调用并为该函数准备运行环境的平台。重要的贡献是 fork boot 可扩展以从单个模板启动任意数量的实例,而之前的热启动只能服务有限的实例(取决于缓存大小)。
Catalyzer 采用混合方式结合基于 C/R 的 init-less 启动和新的 OS 原语来实现冷启动、暖启动和分叉启动。由于执行阶段的无服务器功能通常只访问初始化阶段使用的内存和文件的一小部分,因此 Catalyzer 为冷启动和热启动引入了按需恢复,以优化应用程序和系统状态的恢复(§3 )。此外,Catalyzer 提出了一个新的 OS 原语 sfork(沙箱叉),通过直接重用模板沙箱的状态(§4)来减少叉启动时的启动延迟。 Fork boot 可以实现比热启动更快的启动,但也引入了更多的内存开销;因此,fork boot 更适合频繁调用(热)的功能。
3 On-demand Restore
The performance overhead of restore comes from two parts. First, the application and system state need to be uncompressed, deserialized (only metadata), and loaded into memory. Second, re-do operations are necessary to recover system state, including multi-threaded contexts, virtualization sandbox, and I/O connections.
As shown in Figure 8-a, Catalyzer accelerates restore by splitting the process into three parts: offline preparation, critical path restore, and on-demand recovery. The preparation work, like uncompression and deserialization, is mostly performed offline in the checkpoint stage. The loading of application state and recovering of I/O-related system state are delayed with on-demand paging and I/O re-connection. Thus, Catalyzer only performs minimized work on the critical path, i.e., recovering non-I/O system state.
Specifically, Catalyzer proposes four techniques. First, overlay memory is a new memory abstraction that allows Catalyzer to directly map a func-image into memory, boosting application state loading (for cold boot). Sandboxes running the same function can share a “base memory mapping”, further omitting file mapping cost (for warm boot). Second, separated state recovery decouples deserialization from system state recovery on the critical path. Third, on-demand I/O reconnection delays I/O state recovery. Last, virtualization sandbox Zygote provides generalized virtualization sandboxes that are function-independent and can be used to reduce sandbox construction overhead.
恢复的性能开销来自两部分。 首先,应用程序和系统状态需要解压缩、反序列化(仅元数据)并加载到内存中。 其次,重做操作是恢复系统状态所必需的,包括多线程上下文、虚拟化沙箱和 I/O 连接。
如图 8-a 所示,Catalyzer 通过将过程拆分为三个部分来加速恢复:离线准备、关键路径恢复和按需恢复。准备工作,如解压缩和反序列化,大多在检查点阶段离线进行。应用程序状态的加载和 I/O 相关系统状态的恢复因按需分页和 I/O 重新连接而延迟。因此,Catalyzer 仅在关键路径上执行最小化的工作,即恢复非 I/O 系统状态。
具体来说,Catalyzer 提出了四种技术。首先,覆盖内存是一种新的内存抽象,它允许 Catalyzer 直接将 func-image 映射到内存中,从而提升应用程序状态加载(用于冷启动)。运行相同功能的沙箱可以共享“基本内存映射”,进一步省略文件映射成本(用于热启动)。其次,分离状态恢复将反序列化与关键路径上的系统状态恢复分离。第三,按需 I/O 重新连接延迟了 I/O 状态恢复。最后,虚拟化沙箱 Zygote 提供了通用的虚拟化沙箱,这些沙箱与功能无关,可用于减少沙箱构建开销。
图 8. 按需恢复。 (a) 与之前的方法相比,Catalyzer 利用离线准备和按需恢复来消除关键路径上的大部分工作。 (b) Overlay memory允许一个func-image直接映射到内存中来构建Base-EPT,Base-EPT也可以通过copy-on-write的方式在不同的实例之间共享。 (c) 操作流程显示了如何使用按需恢复实例化 gVisor 沙箱。
3.1 Overlay Memory
The overlay memory is a design for on-demand application state loading through copy-on-write of file-based mmap. As shown in Figure 8-b, the design allows a “base memory mapping” to be shared among sandboxes running the same function, and relies on memory copy-on-write to ensure privacy.
Overlay memory uses a well-formed func-image for direct mapping, which contains uncompressed and page-aligned application state. During a cold boot, Catalyzer loads application state by directly mapping the func-image into memory (map-file operation). Catalyzer maintains two layered EPTs for each sandbox. The upper one is called Private-EPT, and the lower one is Base-EPT. Private-EPT is private to each sandbox, while Base-EPT is shared and read-only. During a warm boot, Catalyzer directly maps the Base-EPT for the new sandbox with the share-mapping operation. The main benefit comes from the avoidance of costly file loading.
The platform constructs the hardware EPT by merging entries from the Private-EPT with the Base-EPT, i.e., using the entries of Private-EPT if the entries are valid, otherwise using the entries of Base-EPT. The construction is efficient and triggerd by hardware. Base-EPT is read-only thus can be inherited by new sandboxes through mmap, while the Private-EPT is established using copy-on-write when an EPT violation happens on the Base-EPT.
覆盖内存是一种通过基于文件的 mmap 的写时复制按需加载应用程序状态的设计。如图 8-b 所示,该设计允许在运行相同功能的沙箱之间共享“基本内存映射”,并依靠内存复制来确保隐私。
覆盖内存使用格式良好的 func-image 进行直接映射,其中包含未压缩和页面对齐的应用程序状态。在冷启动期间,Catalyzer 通过直接将 func-image 映射到内存(映射文件操作)来加载应用程序状态。 Catalyzer 为每个沙箱维护两个分层的 EPT。上层称为Private-EPT,下层称为Base-EPT。 Private-EPT 对每个沙箱都是私有的,而 Base-EPT 是共享和只读的。在热启动期间,Catalyzer 使用共享映射操作直接映射新沙箱的 Base-EPT。主要好处来自避免了昂贵的文件加载。
平台通过合并来自 Private-EPT 的条目与 Base-EPT 来构建硬件 EPT,即,如果条目有效,则使用 Private-EPT 的条目,否则使用 Base-EPT 的条目。构建高效且由硬件触发。 Base-EPT 是只读的,因此可以通过 mmap 被新的沙箱继承,而 Private-EPT 是在 Base-EPT 上发生 EPT 违规时使用写时复制建立的。
3.2 Separated State Recovery
C/R relies on metadata of system state (represented by objects in the sandbox) for re-do operation, which is serialized before saving into checkpoint images and deserialized during the restore. The system state includes all guest OS internal state, e.g., the thread list and timers. However, such process is non-trivial for sandboxes implemented by high-level languages (e.g., Golang for gVisor), as the language abstraction hides the arrangement of state data. Even with the help of serialization tools such as Protobuf [16], metadata objects have to be processed one-by-one to recover, which can cause huge overhead when the number of objects is large (e.g., 37,838 objects are recovered for SPECjbb application in gVisor-restore, consuming >50ms).
Catalyzer proposes separated state recovery to overcome the challenge, by decoupling deserialization from state recovery. During offline preparation, Catalyzer saves partially deserialized metadata objects into func-images. Specifically, Catalyzer first re-organizes the discrete in-memory objects into continuous memory; thus they can be mapped back to memory through mmap operation instead of one-by-one deserialization. Then, Catalyzer zeros pointers in objects with placeholders, and records all (pointer) reference relationships in a relation table, which stores a map from offsets of pointers to offsets of pointer values. The metadata objects and the relation table together constitute the partially deserialized objects. The partially means that Catalyzer needs to deserialize pointers during runtime using the relation table.
With the func-image, Catalyzer accomplishes state recovery in two stages: loading the partially deserialized objects from a func-image (stage-1), reconstructing the object relationships (e.g., pointer relation) and recovering system state in parallel (stage-2). First, objects as well as the saved relation table will be mapped to the sandbox’s memory with overlay memory. Second, the object reference relationships are re-established by replacing all placeholders with real pointers through the relation table, and non-I/O system state are established on the critical path. Since each update is independent, this stage can be carried out in parallel. The design does not depend on a specific memory layout, which is better for portability so that a func-image can run on different machines.
C/R 依赖系统状态的元数据(由沙箱中的对象表示)进行重做操作,在保存到检查点图像之前进行序列化,并在恢复期间反序列化。系统状态包括所有客户操作系统内部状态,例如线程列表和计时器。然而,对于由高级语言(例如,gVisor 的 Golang)实现的沙箱来说,这样的过程并不简单,因为语言抽象隐藏了状态数据的排列。即使借助 Protobuf [16] 等序列化工具,元数据对象也必须逐一处理才能恢复,当对象数量较多时(例如,SPECjbb 应用程序恢复 37,838 个对象),这会造成巨大的开销在 gVisor-restore 中,消耗 >50 毫秒)。
Catalyzer 通过将反序列化与状态恢复分离,提出了分离状态恢复来克服挑战。在离线准备期间,Catalyzer 将部分反序列化的元数据对象保存到 func-images 中。具体来说,Catalyzer 首先将离散的内存中对象重新组织成连续的内存;因此它们可以通过 mmap 操作而不是一一反序列化映射回内存。然后,Catalyzer 将带有占位符的对象中的指针归零,并将所有(指针)引用关系记录在关系表中,该表存储从指针偏移量到指针值偏移量的映射。元数据对象和关系表共同构成了部分反序列化的对象。这部分意味着 Catalyzer 需要在运行时使用关系表反序列化指针。
通过 func-image,Catalyzer 分两个阶段完成状态恢复:从 func-image 加载部分反序列化的对象(stage-1),重建对象关系(例如,指针关系)和并行恢复系统状态(stage-2) )。首先,对象以及保存的关系表将通过覆盖内存映射到沙箱的内存中。其次,通过关系表将所有占位符替换为真实指针,重新建立对象引用关系,在关键路径上建立非I/O系统状态。由于每次更新都是独立的,这个阶段可以并行进行。该设计不依赖于特定的内存布局,这有利于移植性,以便一个 func-image 可以在不同的机器上运行。
3.3 On-demand I/O Reconnection
The numerous I/O operations performed in restore (e.g., opening files) add high latency on the critical path. Inspired by our insight that many of the I/O-related state (e.g., files) will not be used after restore, Catalyzer adopts an on-demand I/O reconnection design. For example, a previously opened file “/home/user/hello.txt” may only be accessed for specific requests. Those unused I/O connections can not be eliminated even with a proper point in the checkpoint, because existing serverless functions are usually running with language runtime (e.g., JVM) and third-party libraries, in which the developers have no idea whether they will access some rarely used connections.
Thus, we can re-establish the connections lazily—only reestablish when the connections are used. To achieve this, I/O reconnection is performed asynchronously on the restore critical path, and the sandbox guest kernel maintains the I/O connection status, i.e., a file descriptor will be passed to functions but tagged as not re-opened yet in the guest kernel.
We observe that for a specific function, the I/O connections that are immediately used after booting are mostly deterministic. Thus, we introduce an I/O cache mechanism to further mitigate the latency of I/O reconnection. The I/O connection operations performed during cold boot are saved in cache, which are used by Catalyzer to guide a sandbox (in warm boot) to establish these connections on the critical path. Specifically, the cache stores the file paths and the operations on the path, so Catalyzer can use the information as a hint to re-connect these I/O first. For I/O connections missed in the cache (i.e., the non-deterministic connections), Catalyzer will use the on-demand strategy to establish the needed I/O connections.
还原中执行的大量 I/O 操作(例如,打开文件)在关键路径上增加了高延迟。受我们洞察很多 I/O 相关状态(例如文件)在恢复后将不再使用的启发,Catalyzer 采用了按需 I/O 重新连接设计。例如,之前打开的文件“/home/user/hello.txt”只能针对特定请求进行访问。那些未使用的 I/O 连接即使在检查点中设置适当的点也无法消除,因为现有的无服务器功能通常与语言运行时(例如 JVM)和第三方库一起运行,开发人员不知道他们是否会访问一些很少使用的连接。
因此,我们可以懒惰地重新建立连接——只有在使用连接时才重新建立。为此,在恢复关键路径上异步执行 I/O 重新连接,沙箱客户机内核维护 I/O 连接状态,即文件描述符将传递给函数,但在来宾内核。
我们观察到,对于特定功能,启动后立即使用的 I/O 连接大多是确定性的。因此,我们引入了 I/O 缓存机制来进一步减轻 I/O 重新连接的延迟。冷启动期间执行的 I/O 连接操作保存在缓存中,Catalyzer 使用缓存来引导沙箱(在热启动中)在关键路径上建立这些连接。具体来说,缓存存储了文件路径和路径上的操作,因此 Catalyzer 可以使用这些信息作为提示,先重新连接这些 I/O。对于缓存中丢失的 I/O 连接(即非确定性连接),Catalyzer 将使用按需策略来建立所需的 I/O 连接。
3.4 Virtualization Sandbox Zygote
On the restore critical path, a sandbox is constructed before application state loading and system state recovery. Challenges of reducing sandbox construction latency lie in two factors: first, sandbox construction depends on functionspecific information (e.g., the path of rootfs), thus techniques like caching do not help; second, a sandbox is tightly coupled with system resources that are not directly re-usable (e.g., namespace and hardware virtualization resources).
Catalyzer proposes a Virtualization Sandbox Zygote design that separates the function-dependent configuration from a general sandbox (Sandbox Zygote) and leverages a cache of Zygotes to mitigate sandbox construction overhead. A Zygote is a generalized virtualization sandbox used to generate a function-specific sandbox during the restore. As described in Figure 2, a sandbox is constructed with a configuration file and a rootfs. Catalyzer proposes a base configuration and a base rootfs, which separate out function-specific details. Catalyzer caches a Zygote by parsing the base configuration file, allocating virtualization resources (e.g., VCPU) and mounting the base rootfs. Upon function invocation, Catalyzer specializes a sandbox from a Zygote by importing function-specific binaries/libraries, and appending the function-specific configuration in the Zygote. Virtualization Zygotes can be used in both cold boot and warm boot in Catalyzer
在恢复关键路径上,在应用状态加载和系统状态恢复之前构建沙箱。减少沙盒构建延迟的挑战在于两个因素:第一,沙盒构建依赖于功能特定的信息(例如,rootfs 的路径),因此缓存等技术无济于事;其次,沙箱与不可直接重用的系统资源(例如,命名空间和硬件虚拟化资源)紧密耦合。
Catalyzer 提出了一种虚拟化沙箱 Zygote 设计,它将依赖于功能的配置与通用沙箱 (Sandbox Zygote) 分开,并利用 Zygotes 的缓存来减轻沙箱构建开销。 Zygote 是一种通用的虚拟化沙箱,用于在恢复期间生成特定于功能的沙箱。如图 2 所示,沙箱由一个配置文件和一个 rootfs 构建。 Catalyzer 提出了一个基本配置和一个基本 rootfs,它们分离出特定于功能的细节。 Catalyzer 通过解析基本配置文件、分配虚拟化资源(例如 VCPU)和挂载基本 rootfs 来缓存 Zygote。在函数调用时,Catalyzer 通过导入特定于函数的二进制文件/库,并在 Zygote 中附加特定于函数的配置,从 Zygote 中专门化沙箱。虚拟化 Zygotes 可用于 Catalyzer 中的冷启动和热启动
3.5 Putting All Together
The three elements, overlay memory, separated state, and I/O connections, are all included in the func-image. The workflow of cold boot and warm boot is shown in Figure 8-c. First, function-specific configuration and its func-image (in “App rootFS”) is passed to a Zygote to specialize a sandbox. Second, the function-specific rootfs indicated by the configuration is mounted for the sandbox. Then, the sandbox recovers system state using separated state recovery. After that, Catalyzer maps the Base-EPT’s memory to the gVisor process as read-only for warm boot, in which copy-on-write is used to preserve the privacy of the sandbox’s memory. For cold boot, Catalyzer needs to establish the Base-EPT first by mapping the func-image into memory. At last, the guest kernel asynchronously recovers I/O connections, and I/O cache assists the process for warm boot.
覆盖内存、分离状态和 I/O 连接这三个元素都包含在 func-image 中。冷启动和热启动的工作流程如图 8-c 所示。首先,将特定于功能的配置及其功能映像(在“App rootFS”中)传递给 Zygote 以专门化沙箱。其次,为沙箱挂载了配置所指示的特定于功能的 rootfs。然后,沙箱使用分离状态恢复来恢复系统状态。之后,Catalyzer 将 Base-EPT 的内存映射到 gVisor 进程作为热启动的只读进程,其中使用写时复制来保护沙箱内存的隐私。对于冷启动,Catalyzer 需要首先通过将 func-image 映射到内存来建立 Base-EPT。最后,来宾内核异步恢复 I/O 连接,I/O 缓存辅助热启动过程。
4 sfork: Sandbox fork
Based on our Insight III, Catalyzer proposes a new OS primitive, sfork (sandbox fork), to further reduce the startup latency by reusing the state of a running “template sandbox” directly. The term, “template sandbox”, means a special sandbox for a specific function that has no information about user requests; thus, it can be leveraged to instantiate sandboxes to serve requests. The basic workflow is shown in Figure 9-a. First, a template sandbox is generated through template initialization, containing clean system state at the func-entry point; then, when a request of the function arrives, the template sandbox will sfork itself to reuse the initialized state directly. The state here includes both user state (application and runtime) and guest kernel state.
**Challenges. **An intuitive choice is to use the traditional fork to implement sfork. However, it is challenging to keep system state consistent by using fork only. First, most OS kernels (e.g., Linux) can only support single-thread fork, which means the information of multi-threading will be lost after forking. Second, fork is not suitable for sandbox creation, during which a child process will inherit its parent’s shared memory mappings, file descriptors and other system state that are not supposed to be shared between sandboxes. Third, fork will clone all the user state in memory, some of which may depend on system state that have been changed. For example, given a common case where the template sandbox issues getpid syscall and uses the return value to name a variable during initialization, the PID is changed in the forked sandbox, but the variable is not, leading to undefined behavior.
The clone syscall provides more flexibility with many options, but is still not sufficient. One major limitation is the handling of shared memory (mapped with MAP_SHARED flag). If a child sandbox inherits the shared memory, it will violate the isolation between parent and child sandboxes; if not, it may change the semantics of MAP_SHARED.
Template Initialization. To overcome the challenges, Catalyzer relies on user-space handling of most inconsistent state and only introduces minimal kernel modifications. We classify syscalls into three groups, denied, handled and allowed. The allowed and handled syscalls are listed in Table 1. The handled syscalls require user-space logic to fix related system state after sfork for consistency explicitly. For example, clone creates a new thread context for a sandbox, and the multi-threaded contexts should be recovered after sfork (Challenge-1). The denied syscalls are removed from the sandbox since they may lead to non-deterministic system state modification. We illustrate how Catalyzer keeps the multi-threaded contexts and reuses inherited file descriptors (Challenge-2) after sfork with two novel techniques, transient single-thread and stateless overlay rootFS. The only modification to the kernel is adding a new flag, CoW flag, for shared memory mapping. We take advantage of Linux container technologies (USER and PID namespaces) to maintain system state like user id and process id consistent after sfork (Challenge-3).
基于我们的 Insight III,Catalyzer 提出了一个新的 OS 原语 sfork(沙箱叉),通过直接重用正在运行的“模板沙箱”的状态来进一步减少启动延迟。术语“模板沙箱”,是指用于特定功能的特殊沙箱,它没有用户请求的信息;因此,它可以用来实例化沙箱来服务请求。基本工作流程如图 9-a 所示。首先,通过模板初始化生成一个模板沙箱,在函数入口点包含干净的系统状态;然后,当函数的请求到达时,模板沙箱将自己分叉以直接重用初始化状态。这里的状态包括用户状态(应用程序和运行时)和来宾内核状态。
挑战。一个直观的选择是使用传统的fork来实现sfork。然而,仅使用 fork 来保持系统状态一致是具有挑战性的。首先,大多数OS内核(例如Linux)只能支持单线程fork,这意味着fork后多线程的信息会丢失。其次,fork 不适合沙箱创建,在此期间子进程将继承其父进程的共享内存映射、文件描述符和其他不应在沙箱之间共享的系统状态。第三,fork 将克隆内存中的所有用户状态,其中一些可能取决于已更改的系统状态。例如,假设模板沙箱在初始化期间发出 getpid 系统调用并使用返回值命名变量的常见情况,在分叉的沙箱中更改了 PID,但变量没有更改,从而导致未定义的行为。
克隆系统调用通过许多选项提供了更大的灵活性,但仍然不够。一个主要的限制是共享内存的处理(用 MAP_SHARED 标志映射)。如果子沙箱继承了共享内存,就会违反父子沙箱之间的隔离;如果不是,它可能会改变 MAP_SHARED 的语义。
模板初始化。为了克服这些挑战,Catalyzer 依靠用户空间处理大多数不一致的状态,并且只引入最少的内核修改。我们将系统调用分为三组,拒绝、处理和允许。允许和处理的系统调用列在表 1 中。处理的系统调用需要用户空间逻辑来明确在 sfork 之后修复相关的系统状态以保持一致性。比如clone为沙箱创建了一个新的线程上下文,sfork之后应该恢复多线程上下文(Challenge-1)。从沙箱中删除被拒绝的系统调用,因为它们可能导致不确定的系统状态修改。我们说明了 Catalyzer 如何在 sfork 之后保留多线程上下文并重用继承的文件描述符(Challenge-2),并使用两种新技术,即瞬态单线程和无状态覆盖 rootFS。对内核的唯一修改是为共享内存映射添加一个新标志 CoW 标志。我们利用 Linux 容器技术(USER 和 PID 命名空间)在 sfork(挑战 3)后保持系统状态,如用户 id 和进程 id 一致。
4.1 Multi-threading Fork
Sandboxes implemented using Golang (e.g., gVisor) are naturally multi-threaded, because the Golang runtime uses multiple threads for garbage collection and other background works. Specifically, threads in Golang can be classified into three categories: runtime threads, scheduling threads, and blocking threads (Figure 9-b). The runtime threads are responsible for providing runtime functionalities like garbage collection and preemption. They are long-running and transparent to the developers. The scheduling threads (M-threads in Figure 9-b) implement the co-routine mechanism in Golang (i.e., Go routine). When a Go routine switches to the blocked state (e.g., executing blocking system calls like accept), Golang runtime will dedicate an OS thread to the Go routine.
Catalyzer proposes a transient single-thread mechanism to support multi-threaded sandbox fork. With the mechanism, a multi-threaded program can temporarily merge all the threads into a single thread (i.e., the transient singlethread), which can be expanded to a multi-threaded one after sfork. The process is shown in Figure 9-b. First, we modify the Golang runtime in Catalyzer to support stoppable runtime threads. When the runtime threads are notified to enter the transient single-thread state, they will save the thread contexts in the memory and terminate themselves. Then, the number of scheduling threads can be configured to one through Golang runtime. In addition, we add a time-out in all blocking threads, and the threads will check whether they should terminate for entering the transient single-thread state when the time-out is triggered. Finally, the Golang program will only keep the m0 thread in the transient singlethread state, and expand to multiple threads again after sfork. Our modification is only used for template sandbox generation, and will not affect program behaviors after sfork.
使用 Golang 实现的沙箱(例如 gVisor)自然是多线程的,因为 Golang 运行时使用多个线程进行垃圾收集和其他后台工作。具体来说,Golang 中的线程可以分为三类:运行时线程、调度线程和阻塞线程(图 9-b)。运行时线程负责提供运行时功能,如垃圾收集和抢占。它们长期运行并且对开发人员透明。调度线程(图 9-b 中的 M 线程)在 Golang 中实现了协程机制(即 Go 例程)。当 Go 例程切换到阻塞状态时(例如,执行像 accept 这样的阻塞系统调用),Golang 运行时会将一个 OS 线程专用于 Go 例程。
Catalyzer 提出了一种瞬态单线程机制来支持多线程沙箱fork。通过该机制,多线程程序可以暂时将所有线程合并为一个单线程(即transient singlethread),sfork后可以扩展为多线程。该过程如图9-b所示。首先,我们修改 Catalyzer 中的 Golang 运行时以支持可停止的运行时线程。当运行时线程被通知进入瞬态单线程状态时,它们会将线程上下文保存在内存中并终止自己。然后,可以通过 Golang 运行时将调度线程的数量配置为 1。此外,我们在所有阻塞线程中添加了超时,当超时被触发时,线程会检查是否应该终止以进入瞬态单线程状态。最后,Golang 程序只会将 m0 线程保持在瞬态单线程状态,sfork 后再次扩展为多线程。我们的修改仅用于模板沙箱生成,不会影响sfork后的程序行为。
4.2 Stateless Overlay RootFS
A sforked sandbox will inherit file descriptors and file systems of the template sandbox, which should be handled after sfork. Inspired by existing overlayFS design [13] and the ephemeral nature of serverless functions [27, 28], Catalyzer employs stateless overlay rootFS technique to achieve zerocost handling for file descriptors and the rootFS. The idea is to put all the modification on the rootFS into the memory, which can be automatically cloned during sfork using copy-on-write (Figure 9-c).
Specifically, each sandbox uses two layers of file systems. The upper layer is the in-memory overlayFS, which is private to a sandbox and allows both read and write operations. The overlayFS is backed by an FS server (per-function) which manages the real rootFS. A sandbox cannot directly access the persistent storage for security reasons; thus, it relies on the (read-only) file descriptors received from the FS server to access the rootFS. During sfork, besides the cloned overlayFS, the file descriptors owned by the template sandbox are still valid in the child sandbox since they are read-only and will not violate the isolation guarantee.
Our industrial development lessons show that persistent storage is still required for serverless functions in some cases, e.g., writing logs. Catalyzer allows the FS server to grant some file descriptors of the log files (with the read/write permission) to sandboxes. Overall, the majority of files are sforked with low latency, and only a small number of persistent files are copied for functionalities.
一个sfork的沙箱会继承模板沙箱的文件描述符和文件系统,这些应该在sfork之后处理。受现有覆盖文件系统设计 [13] 和无服务器功能 [27、28] 的短暂性质的启发,Catalyzer 采用无状态覆盖 rootFS 技术来实现文件描述符和 rootFS 的零成本处理。思路是将rootFS上的所有修改都放到内存中,在sfork的时候可以使用copy-on-write自动克隆(图9-c)。
具体来说,每个沙箱使用两层文件系统。上层是内存中的overlayFS,它是沙箱私有的,允许读写操作。 overlayFS 由管理真正的 rootFS 的 FS 服务器(按功能)支持。出于安全原因,沙箱不能直接访问持久存储;因此,它依赖于从 FS 服务器接收的(只读)文件描述符来访问 rootFS。在sfork期间,除了克隆的overlayFS之外,模板沙箱拥有的文件描述符在子沙箱中仍然有效,因为它们是只读的,不会违反隔离保证。
我们的工业发展经验表明,在某些情况下,例如写入日志,无服务器功能仍然需要持久存储。 Catalyzer 允许 FS 服务器将日志文件的一些文件描述符(具有读/写权限)授予沙箱。总体而言,大多数文件都以低延迟进行分叉,只有少数持久文件被复制以用于功能。
4.3 Language Runtime Template for Cold Boot
Although the on-demand restore can provide promising cold boot performance, it relies on a well-formed func-image containing uncompressed data (larger image size). Thus, we propose another choice for the cold boot, using sfork with language runtime template, which is a template sandbox for functions written by the same language. A language runtime template initializes the environment of the wrapped program (e.g., JVM in Java), and will load a real function to serve requests on demand. Such a sandbox should be instantiated differently in different languages, e.g., loading libraries in C or loading Class files in Java. For instance, a single Java runtime template is sufficient to boost our internal functions as most of the functions are written in Java.
尽管按需恢复可以提供有希望的冷启动性能,但它依赖于包含未压缩数据(更大的图像大小)的格式良好的 func-image。 因此,我们为冷启动提出了另一种选择,使用带有语言运行时模板的 sfork,这是一个用于相同语言编写的函数的模板沙箱。 语言运行时模板初始化被包装程序的环境(例如,Java 中的 JVM),并将加载一个真正的函数来按需服务请求。 这样的沙箱应该在不同的语言中以不同的方式实例化,例如,在 C 中加载库或在 Java 中加载类文件。 例如,单个 Java 运行时模板足以增强我们的内部功能,因为大多数功能都是用 Java 编写的。
网友评论