Nightcore: Efficient and Scalabl

作者: 蟹蟹宁 | 来源:发表于2021-11-26 20:45 被阅读0次

Nightcore: Efficient and Scalabl
[Paper Notes] Cache Me If You Ca
基于内存的时序数据库-Gorilla的设计和实现
商务英语 Level 3 Unit 1 -4 Describin
商务英语level3 unit1 part4 Vocabula
【今日学习笔记】之『行为只是表象，重要的是内在的品质』
MobileFaceNets: Face Verificatio
Efficient Text Classification
20170720_DE
20170721_DE

ABSTRACT

The microservice architecture is a popular software engineering approach for building flexible, large-scale online services. Serverless functions, or function as a service (FaaS), provide a simple programming model of stateless functions which are a natural substrate for implementing the stateless RPC handlers of microservices, as an alternative to containerized RPC servers. However, current serverless platforms have millisecond-scale runtime overheads, making them unable to meet the strict sub-millisecond latency targets required by existing interactive microservices.

微服务架构是一种流行的软件工程方法，用于构建灵活的、大规模的在线服务。无服务器函数或函数即服务 (FaaS) 提供了一个简单的无状态函数编程模型，它可以取代容器化 RPC 服务器, 很自然的成为实现微服务的无状态 RPC 处理程序的基础。但是，当前的无服务器平台具有毫秒级的运行时开销，无法满足现有交互式微服务要求的严格的亚毫秒级延迟目标。

交代了场景, 即FaaS for Microservice;
简明的指出了问题及原因, 即无法支持交互式微服务的亚毫秒延迟要求, 原因是FaaS具有毫秒级的运行时开销
运行时开销, 没有指定此开销的具体是什么, 后面知道这是通讯开销, 而不是启动开销

We present Nightcore, a serverless function runtime with microsecond-scale overheads that provides container-based isolation between functions. Nightcore’s design carefully considers various factors having microsecond-scale overheads, including scheduling of function requests, communication primitives, threading models for I/O, and concurrent function executions. Nightcore currently supports serverless functions written in C/C++, Go, Node.js, and Python. Our evaluation shows that when running latency-sensitive interactive microservices, Nightcore achieves 1.36× - 2.93× higher throughput and up to 69% reduction in tail latency

我们展示了 Nightcore，这是一种无服务器函数运行时，具有微秒级开销，可在函数之间提供基于容器的隔离。 Nightcore 的设计仔细考虑了具有微秒级开销的各种因素，包括函数请求的调度、通信原语、I/O 的线程模型和并发函数执行。 Nightcore 目前支持用 C/C++、Go、Node.js 和 Python 编写的无服务器函数。我们的评估表明，在运行对延迟敏感的交互式微服务时，Nightcore 实现了 1.36× - ×2.93 倍的更高吞吐量和高达 69% 的尾部延迟降低.

上一段指出运行时存在延迟, 这里的解决方案就是设计新的运行时.
运行时的确是包括很多东西, 他说仔细考虑了其中各个因素, 然后指出4个方面
最后指出包括的语言, 和性能改进指标
他做的东西比较杂乱, 优化了很多地方, 归纳到一起称之为运行时, 也可以这样想

1 INTRODUCTION

The microservice architecture is a popular software engineering approach for building large-scale online services. It has been widely adopted by large companies such as Amazon, Netflix, LinkedIn, Uber, and Airbnb [1, 4, 5, 42, 47]. The microservice architecture enables a large system to evolve more rapidly because each microservice is developed, tested and deployed independently [36, 49]. Microservices communicate with each other via pre-defined APIs, mostly using remote procedure calls (RPC) [70]. Hence, the dominant design pattern for microservices is that each microservice is an RPC server and they are deployed on top of a container orchestration platform such as Kubernetes [29, 54, 70].

微服务架构是一种流行的软件工程方法，用于构建大规模在线服务。它已被亚马逊、Netflix、LinkedIn、Uber 和 Airbnb 等大公司广泛采用 [1, 4, 5, 42, 47]。微服务架构使大型系统能够更快地发展，因为每个微服务都是独立开发、测试和部署的 [36, 49]。微服务通过预定义的 API 相互通信，主要使用远程过程调用 (RPC) [70]。因此，微服务的主要设计模式是每个微服务都是一个 RPC 服务器，它们部署在容器编排平台（如 Kubernetes）之上 [29, 54, 70]。

Serverless cloud computing enables a new way of building microservice-based applications [10, 18, 44, 52], having the benefit of greatly reduced operational complexity (&2). Serverless functions, or function as a service (FaaS), provide a simple programming model of stateless functions. These functions provide a natural substrate for implementing stateless RPC handlers in microservices, as an alternative to containerized RPC servers. However, readily available FaaS systems have invocation latency overheads ranging from a few to tens of milliseconds [14, 55, 84] (see Table 1), making them a poor choice for latency-sensitive interactive microservices, where RPC handlers only run for hundreds of microseconds to a few milliseconds [70, 83, 100, 101] (see Figure 1). The microservice architecture also implies a high invocation rate for FaaS systems, creating a performance challenge. Taking Figure 1 as an example, one request that uploads a new social media post results in 15 stateless RPCs (blue boxes in the figure). Our experiments on this workload show that 100K RPCs per second is a realistic performance goal, achievable under non-serverless deployment using five 8-vCPU RPC server VMs. For a FaaS system to efficiently support interactive microservices, it should achieve at least two performance goals that are not accomplished by existing FaaS systems: (1) invocation latency overheads are well within 100𝜇s; (2) the invocation rate must scale to 100K/s with a low CPU usage.

Serverless为构建基于微服务的应用程序提供了一种新方法 [10, 18, 44, 52]，其好处是大大降低了操作复杂性 (&2)。无服务器函数或函数即服务 (FaaS) 提供了一种简单的无状态函数编程模型。这些函数为在微服务中实现无状态 RPC 处理程序提供了一个自然的基础，作为容器化 RPC 服务器的替代方案。然而，现成的 FaaS 系统的调用延迟开销从几毫秒到几十毫秒不等 [14, 55, 84]（见表 1），这使得它们成为对延迟敏感的交互式微服务的糟糕选择，其中 RPC 处理程序只运行数百微秒到几毫秒 [70, 83, 100, 101]（见图 1）。微服务架构也意味着 FaaS 系统的高调用率，带来了性能挑战。以图 1 为例，一个上传新社交媒体帖子的请求会产生 15 个无状态 RPC（图中蓝色框）。我们在此工作负载上的实验表明，每秒 100K RPC 是一个现实的性能目标，可以在使用五个 8-vCPU RPC 服务器虚拟机的非无服务器部署下实现。一个 FaaS 系统要想高效支持交互式微服务，至少要达到现有 FaaS 系统无法实现的两个性能目标：（1）调用延迟开销在 100𝜇s 以内； (2) 在低CPU使用的情况下, 调用率必须扩展到 100K/s。

Figure 1: RPC graph of uploading new posts in a microservice-based SocialNetwork application [70]. This graph omits stateful services for data caching and data storage.

FaaS systems	50th	99th	99.9th
AWS Lambda	10.4 ms	25.8 ms	59.9 ms
OpenFaaS [37]	1.09 ms	3.66 ms	5.54 ms
Nightcore (external)	285 𝜇s	536 𝜇s	855 𝜇s
Nightcore (internal)	39 𝜇s	107 𝜇s	154 𝜇s

Table 1: Invocation latencies of a warm nop function.

Serverless 非常适用于微服务架构, 好处是降低了操作复杂性, 并在第二章展开描述此观点
然后指出FaaS的调用延迟开销很大, 不适用于延迟敏感型的交互式微服务
然后给了一个调用延迟的对比表和一个测试场景下每个RPC服务的运行时长. 这个实验对比非常充分的说明, 调用延迟的确是不可容忍的开销.
并在最后提出了设计的目标: 调用延迟和调用率

Some previous studies [62, 98] reduced FaaS runtime overheads to microsecond-scale by leveraging software-based fault isolation (SFI), which weakens isolation guarantees between different functions. We prefer the stronger isolation provided by containers because that is the standard set by containerized RPC servers and provided by popular FaaS systems such as Apache OpenWhisk [50] and OpenFaaS [37]. But achieving our performance goals while providing the isolation of containers is a technical challenge.

之前的一些研究 [62, 98] 通过利用基于软件的故障隔离 (SFI) 将 FaaS 运行时开销减少到微秒级，这削弱了不同功能之间的隔离保证。我们更喜欢容器提供的更强隔离，因为这是容器化 RPC 服务器设置的标准，并由流行的 FaaS 系统（如 Apache OpenWhisk [50] 和 OpenFaaS [37]）提供。但是在提供容器隔离的同时实现我们的性能目标是一项技术挑战。

这里的[98]就是FAASM, 他在这里Diss了FaasM放弃的容器级的隔离机制. 认为容器化是RPC服务的标准, 并且流行的FaaS提供了容器的实现!
真的是写论文全凭一张嘴, 那Lambda等还用轻机呢. 而且Faasm主要是启动的开销, 你用容器是不可能把启动的开销给降下来的! 因此他的场景必须是特定的
最后说, 在保证隔离的前提下实现性能是极大的挑战. 废话, 容器的启动无论如何不可能降低到微妙

We present Nightcore, a serverless function runtime designed and engineered to combine high performance with container-based isolation. Any microsecond-or-greater-scale performance overheads can prevent Nightcore from reaching its performance goal, motivating a "hunt for the killer microseconds" [60] in the regime of FaaS systems.

我们展示了 Nightcore，这是一种无服务器功能运行时，旨在将高性能与基于容器的隔离相结合。任何微秒或更大规模的性能开销都可能阻止 Nightcore 达到其性能目标，从而在 FaaS 系统中激发“寻找杀手级微秒”[60]。

也就是Nightcore需要消除一切毫秒级的开销, 并且是基于容器环境的

Existing FaaS systems like OpenFaaS [37] and Apache OpenWhisk [50] share a generic high-level design: all function requests are received by a frontend (mostly an API gateway), and then forwarded to independent backends where function code executes. The frontend and backends mostly execute on separate servers for fault tolerance, which requires invocation latencies that include at least one network round trip. Although data center networking performance is improving, round-trip times (RTTs) between two VMs in the same AWS region range from 101𝜇s to 237𝜇s [25]. Nightcore is motivated by noticing the prevalence of internal function calls made during function execution (see Figure 1). An internal function call is generated by the execution of a microservice, not generated by a client (in which case it would be an external function call, received by the gateway). What we call internal function calls have been called łchained function callsž in previous work [98]. Nightcore schedules internal function calls on the same backend server that made the call, eliminating a trip through the gateway and lowering latency (&3.2).

现有的 FaaS 系统如 OpenFaaS [37] 和 Apache OpenWhisk [50] 共享一个通用的高级设计：所有功能请求都由前端（主要是 API 网关）接收，然后转发到执行功能代码的独立后端。前端和后端大多在单独的服务器上执行以实现容错，这需要包含至少一次网络往返的调用延迟。尽管数据中心网络性能正在提高，但同一 AWS 区域中两个 VM 之间的往返时间 (RTT) 范围从 101𝜇 到 237𝜇 [25]。 Nightcore 的动机是注意到在函数执行期间进行的内部函数调用的普遍性（参见图 1）。内部函数调用是由微服务的执行生成的，而不是由客户端生成的（在这种情况下，它将是网关接收的外部函数调用）。我们所说的内部函数调用在之前的工作中被称为链式函数调用 [98]。 Nightcore 在进行调用的同一后端服务器上安排内部函数调用，从而消除了通过网关的行程并降低了延迟 (&3.2)。

SL的架构中, 通过网关进行请求的调度转发, 无论是客户端的外部请求还是微服务自身的内部请求(链式调用), 都需要通过网关
这个问题在SAND中是提到过的, 内部请求如果可以放在同一个服务器上是不需要通过网关的, 本文然后通过AWS的VM之间的调用说明这部分的延迟开销
然后指出, 在微服务中, 内部调用是占据多数的, Nightcore 将同一后端服务器的内部调用, 放在服务器内部而不通过网关, 从而消除这部分的延迟
第三点中指出内部调用是占多数的, 比较重要

Nightcore’s support for internal function calls makes most communication local, which means its inter-process communications (IPC) must be efficient. Popular, feature-rich RPC libraries like gRPC work for IPC (over Unix sockets), but gRPC’s protocol adds overheads of ~10𝜇s [60], motivating Nightcore to design its own message channels for IPC (ğ 3.1). Nightcore’s message channels are built on top of OS pipes and transmit fixed-size 1KB messages because previous studies [83, 93] show that 1KB is sufficient for most microservice RPCs. Our measurements show Nightcore’s message channels deliver messages in 3.4𝜇s, while gRPC over Unix sockets takes 13𝜇s for sending 1KB RPC payloads.

Nightcore 对内部函数调用的支持使得大多数通信都在本地进行，这意味着其进程间通信 (IPC) 必须是高效的。流行的、功能丰富的 RPC 库，如 gRPC 为 IPC 工作（通过 Unix 套接字），但 gRPC 的协议增加了约 10𝜇s [60] 的开销，促使 Nightcore 为 IPC 设计自己的消息通道（3.1）。 Nightcore 的消息通道建立在 OS 管道之上，传输固定大小的 1KB 消息，因为之前的研究 [83, 93] 表明 1KB 对大多数微服务 RPC 来说已经足够了。我们的测量显示 Nightcore 的消息通道在 3.4𝜇 秒内传送消息，而基于 Unix 套接字的 gRPC 需要 13𝜇 秒才能发送 1KB RPC 有效负载。

用管道取代Unix套接字进行RPC的通信, 的确是内部调用的优化之一..

Previous work has shown microsecond-scale latencies in Linux’s thread scheduler [60, 92, 100], leading data plane OSes [61, 77, 87, 91, 92, 94] to build their own schedules for lower latency. Nightcore relies on Linux’s scheduler because building an efficient, time-sharing scheduler for microsecond-scale tasks is an ongoing research topic [63, 77, 84, 91, 96]. To support an invocation rate of ≥100K/s, Nightcore’s engine (& 4.1) uses event-driven concurrency [23, 105], allowing it to handle many concurrent I/O events with a small number of OS threads. Our measurements show that 4 OS threads can handle an invocation rate of 100K/s. Furthermore, I/O threads in Nightcore’s engine can wake function worker threads (where function code is executed) via message channels, which ensures the engine’s dispatch suffers only a single wake-up delay from Linux’s scheduler.

之前的工作已经显示 Linux 的线程调度器 [60, 92, 100] 中的微秒级延迟，导致数据平面操作系统 [61, 77, 87, 91, 92, 94] 构建自己的调度器以降低延迟。 Nightcore 依赖于 Linux 的调度程序，因为为微秒级任务构建高效的分时调度程序是一个持续的研究主题 [63, 77, 84, 91, 96]。为了支持 ≥100K/s 的调用率，Nightcore 的引擎（& 4.1）使用事件驱动的并发 [23, 105]，允许它使用少量 OS 线程处理许多并发 I/O 事件。我们的测量表明，4 个操作系统线程可以处理 100K/s 的调用率。此外，Nightcore 引擎中的 I/O 线程可以通过消息通道唤醒函数工作线程（执行函数代码的地方），这确保引擎的调度只受到 Linux 调度程序的单个唤醒延迟。

这里谈到了两个问题:

一个是linux的线程调度, 线程调度存在微秒级的开销, 但是本文不在这里进行优化, 直接使用Linux的线程调度
另一个是并发, 这个和之前我一直在研究的一样, 所谓OS线程, 应该是Epoll这种模式

Existing FaaS systems do not provide concurrency management to applications. However, stage-based microservices create internal load variations even under a stable external request rate [73, 105]. Previous studies [38, 73, 104, 105] indicate overuse of concurrency for bursty loads can lead to worse overall performance. Nightcore, unlike existing FaaS systems, actively manages concurrency providing dynamically computed targets for concurrent function executions that adjust with input load (& 3.3). Nightcore’s managed concurrency flattens CPU utilization (see Figure 4) such that overall performance and efficiency are improved, as well as being robust under varying request rates (& 5.2).

现有的 FaaS 系统不为应用程序提供并发管理。然而，即使在稳定的外部请求率下，基于阶段的微服务也会产生内部负载变化 [73, 105]。之前的研究 [38、73、104、105] 表明，对于突发负载过度使用并发会导致整体性能下降。与现有的 FaaS 系统不同，Nightcore 主动管理并发性，为随输入负载调整的并发函数执行提供动态计算目标（& 3.3）。 Nightcore 的托管并发降低了 CPU 利用率（见图 4），从而提高了整体性能和效率，并在不同的请求率（& 5.2）下保持稳健。

说的是玄之又玄, 没看懂具体要干嘛

3.1 是基于管道的进程通讯取代RPC的套接字通讯
3.2 是内部请求, 减少网关的访问开销
3.3 主动并发管理不知道要干嘛
4.1 是IO线程

We evaluate the Nightcore prototype on four interactive microservices, each with a custom workload. Three are from DeathStarBench [70] and one is from Google Cloud [29]. These workloads are originally implemented in RPC servers, and we port them to Nightcore, as well as OpenFaaS [37] for comparison. The evaluation shows that only by carefully finding and eliminating microsecond-scale latencies can Nightcore use serverless functions to efficiently implement latency-sensitive microservices.

我们在四个交互式微服务上评估 Nightcore 原型，每个微服务都有自定义工作负载。三个来自 DeathStarBench [70]，一个来自 Google Cloud [29]。这些工作负载最初是在 RPC 服务器中实现的，我们将它们移植到 Nightcore 以及 OpenFaaS [37] 以进行比较。评估表明，只有仔细发现并消除微秒级延迟，Nightcore 才能使用无服务器功能高效实现对延迟敏感的微服务。

This paper makes the following contributions.

Nightcore is a FaaS runtime optimized for microsecond scale microservices. It achieves invocation latency overheads under 100𝜇s and efficiently supports invocation rates of 100K/s with low CPU usage. Nightcore is publicly available at GitHub ut-osa/nightcore.

Nightcore’s design uses diverse techniques to eliminate microsecond-scale overheads, including a fast path for internal function calls, low-latency message channels for IPC, efficient threading for I/O, and function executions with dynamically computed concurrency hints (&3).

With containerized RPC servers as the baseline, Nightcore achieves 1.36×ś2.93× higher throughput and up to 69% reduction in tail latency, while OpenFaaS only achieves 29%ś 38% of baseline throughput and increases tail latency by up to 3.4× (&5).

本文做出以下贡献。

Nightcore 是针对微秒级微服务优化的 FaaS 运行时。它实现了低于 100𝜇s 的调用延迟开销，并在 CPU 使用率低的情况下有效支持 100K/s 的调用率。 Nightcore 可在 GitHub ut-osa/nightcore 上公开获得。
Nightcore 的设计使用多种技术来消除微秒级开销，包括内部函数调用的快速路径、IPC 的低延迟消息通道、I/O 的高效线程以及使用动态计算并发提示的函数执行 (&3)。
以容器化的RPC服务器为基准，Nightcore的吞吐量提高了1.36×2.93倍，尾延迟降低了69%，而OpenFaaS仅实现了基准吞吐量的29%×38%，尾延迟提高了3.4倍 (&5)。

2 BACKGROUND

Latency-Sensitive Interactive Microservices. Online services must scale to high concurrency, with response times small enough (a few tens of milliseconds) to deliver an interactive experience [58, 66, 106]. Once built with monolithic architectures, interactive online services are undergoing a shift to microservice architectures [1, 4, 5, 42, 47], where a large application is built by connecting loosely coupled, single-purpose microservices. On the one hand, microservice architectures provide software engineering benefits such as modularity and agility as the scale and complexity of the application grow [36, 49]. On the other hand, staged designs for online services inherently provide better scalability and reliability, as shown in pioneering works like SEDA [105]. However, while the interactive nature of online services implies an end-to-end service-level objective (SLO) of a few tens of milliseconds, individual microservices face more strict latency SLOs - at the sub-millisecond-scale for leaf microservices [100, 110].

Microservice architectures are more complex to operate compared to monolithic architectures [22, 35, 36], and the complexity grows with the number of microservices. Although microservices are designed to be loosely coupled, their failures are usually very dependent. For example, one overloaded service in the system can easily trigger failures of other services, eventually causing cascading failures [3]. Overload control for microservices is difficult because microservices call each other on data-dependent execution paths, creating dynamics that cannot be predicted or controlled from the runtime [38, 48, 88, 111]. Microservices are often comprised of services written in different programming languages and frameworks, further complicating their operational problems. By leveraging fully managed cloud services (e.g., Amazon’s DynamoDB [6], ElasticCache [7], S3 [19], Fargate [12], and Lambda [15]), responsibilities for scalability and availability (as well as operational complexity) are mostly shifted to cloud providers, motivating serverless microservices [20, 33, 41, 43ś45, 52, 53].

延迟敏感的交互式微服务。在线服务必须扩展到高并发，响应时间足够小（几十毫秒）以提供交互式体验 [58、66、106]。一旦使用单体架构构建，交互式在线服务正在向微服务架构转变 [1, 4, 5, 42, 47]，其中通过连接松散耦合的单一用途微服务来构建大型应用程序。一方面，随着应用程序的规模和复杂性的增长，微服务架构提供了软件工程方面的好处，例如模块化和敏捷性 [36, 49]。另一方面，如 SEDA [105] 等开创性工作所示，在线服务的分阶段设计本质上提供了更好的可扩展性和可靠性。然而，虽然在线服务的交互性意味着端到端的服务级别目标 (SLO) 为几十毫秒，但单个微服务面临更严格的延迟 SLO——叶微服务在亚毫秒级 [100, 110]。

与单体架构相比，微服务架构的操作更复杂 [22, 35, 36]，并且复杂性随着微服务数量的增加而增加。尽管微服务被设计为松散耦合，但它们的故障通常非常依赖。例如，系统中一个过载的服务很容易引发其他服务的故障，最终导致级联故障[3]。微服务的过载控制很困难，因为微服务在依赖于数据的执行路径上相互调用，从而产生无法从运行时预测或控制的动态 [38, 48, 88, 111]。微服务通常由用不同编程语言和框架编写的服务组成，从而进一步复杂化了它们的操作问题。通过利用完全托管的云服务（例如，亚马逊的 DynamoDB [6]、ElasticCache [7]、S3 [19]、Fargate [12] 和 Lambda [15]），将可扩展性和可用性（以及操作复杂性）的责任是主要转向云提供商，推动无服务器微服务 [20, 33, 41, 43-45, 52, 53]。

背景的第一段, 交代交互式微服务是非常的延迟敏感, 特别是最后一句, 端到端的延迟可以是几十毫秒, 但是对于单个微服务需要是压毫秒级, 因为一系列微服务链式调用最终才返回结果.
第二段段主要是说:
- 微服务实现非常的复杂
- 我们可以将复杂的部分交给云提供商
- 也就是 serverless microservices

Serverless Microservices.Simplifying the development and management of online services is the largest benefit of building microservices on serverless infrastructure. For example, scaling the service is automatically handled by the serverless runtime, deploying a new version of code is a push-button operation, and monitoring is integrated with the platform (e.g., CloudWatch [2] on AWS). Amazon promotes serverless microservices with the slogan "no server is easier to manage than no server"[44]. However, current FaaS systems have high runtime overheads (Table 1) that cannot always meet the strict latency requirement imposed by interactive microservices. Nightcore fills this performance gap.

Nightcore focuses on mid-tier services implementing stateless business logic in microservice-based online applications. These mid-tier microservices bridge the user-facing frontend and the data storage and fit naturally in the programming model of serverless functions. Online data-intensive (OLDI) microservices [100] represent another category of microservices, where the mid-tier service fans out requests to leaf microservices for parallel data processing. Microservices in OLDI applications are mostly stateful and memory intensive, and therefore are not a good fit for serverless functions. We leave serverless support of OLDI microservices as future work.

The programming model of serverless functions expects function invocations to be short-lived, which seems to contradict the assumption of service-oriented architectures which expect services to be long-running. However, FaaS systems like AWS Lambda allows clients to maintain long-lived connections to their API gateways [8], making a serverless function "service-like". Moreover, because AWS Lambda re-uses execution contexts for multiple function invocations [13], users’ code in serverless functions can also cache reusable resources (e.g., database connections) between invocations for better performance [17].

简化在线服务的开发和管理是在无服务器基础设施上构建微服务的最大好处。例如，扩展服务由无服务器运行时自动处理，部署新版本的代码是一个按钮操作，监控与平台集成（例如 AWS 上的 CloudWatch [2]）。亚马逊以“没有服务器比没有服务器更容易管理”的口号来推广无服务器微服务[44]。然而，当前的 FaaS 系统具有高运行时开销（表 1），无法始终满足交互式微服务强加的严格延迟要求。 Nightcore 填补了这一性能差距。
Nightcore 专注于在基于微服务的在线应用程序中实现无状态业务逻辑的中层服务。这些中间层微服务连接了面向用户的前端和数据存储，并且自然地适合无服务器功能的编程模型。在线数据密集型 (OLDI) 微服务 [100] 代表另一类微服务，其中中间层服务将请求扇出到叶微服务以进行并行数据处理。 OLDI 应用程序中的微服务大多是有状态和内存密集型的，因此不适合无服务器功能。我们将 OLDI 微服务的无服务器支持作为未来的工作。
无服务器函数的编程模型期望函数调用是短暂的，这似乎与期望服务长期运行的面向服务架构的假设相矛盾。然而，像 AWS Lambda 这样的 FaaS 系统允许客户与其 API 网关保持长期连接 [8]，从而使无服务器功能“类似于服务”。此外，由于 AWS Lambda 为多个函数调用重用执行上下文 [13]，无服务器函数中的用户代码还可以缓存调用之间的可重用资源（例如，数据库连接）以获得更好的性能 [17]。

第一段, Serverless 的优势是什么, 但是存在性能问题, 无法满足我们在第一段提出的延迟的要求, 而Nightcore将弥补这一差距
我们不能适用于在线数据密集型的微服务, 因为这是有状态的, 是未来的工作
最后始终没提的一点, 那就是冷启动问题, 冷启动问题是因为Serverless Function的执行周期是很短的, 但是微服务是长期运行了, 这里是存在巨大问题的. 其实它又想用Serverless的托管特性, 又希望能长期运行. 但是这些在Lambda里是很难的, 毕竟Lambda还设置了超时时间

怎么说呢, 缺乏舒服力. 用AWS的无服务计算的概念, 说无服务计算易于管理和开发, 比较适合原来越复杂的微服务架构, 但是Lambda又不是长期运行的, 强行解释了一下, 然后抛弃了这个设定,从而规避冷启动问题, 在其他开源的方案中实现了自己的策略

Optimizing FaaS Runtime Overheads. Reducing start-up latencies, especially cold-start latencies, is a major research focus for FaaS runtime overheads [57, 64, 67, 89, 90, 98]. Nightcore assumes sufficient resources have been provisioned and relevant function containers are in warm states which can be achieved on AWS Lambda by using provisioned concurrency (AWS Lambda strongly recommends provisioned concurrency for latency-critical functions [40]). As techniques for optimizing cold-start latencies [89, 90] become mainstream, they can be applied to Nightcore.

Invocation latency overheads of FaaS systems are largely overlooked, as recent studies on serverless computing focus on data-intensive workloads such as big data analysis [75, 95], video analytics [59, 69], code compilation [68], and machine learning [65, 98], where function execution times range from hundreds of milliseconds to a few seconds. However, a few studies [62, 84] point out that the millisecond-scale invocation overheads of current FaaS systems make them a poor substrate for microservices with microsecond-scale latency targets. For serverless computing to be successful in new problem domains [71, 76, 84], it must address microsecond-scale overheads.

优化 FaaS 运行时开销. 减少启动延迟，尤其是冷启动延迟，是 FaaS 运行时开销的主要研究重点 [57、64、67、89、90、98]。 Nightcore 假设已经预配了足够的资源并且相关功能容器处于暖状态，这可以在 AWS Lambda 上通过使用预置并发来实现（AWS Lambda 强烈建议为延迟关键功能 [40] 提供预置并发）。随着优化冷启动延迟的技术 [89, 90] 成为主流，它们可以应用于 Nightcore。
FaaS 系统的调用延迟开销在很大程度上被忽视了，因为最近对无服务器计算的研究集中在数据密集型工作负载上，例如大数据分析 [75, 95]、视频分析 [59, 69]、代码编译 [68] 和机器学习 [65, 98]，其中函数执行时间从数百毫秒到几秒不等。然而，一些研究 [62, 84] 指出，当前 FaaS 系统的毫秒级调用开销使其成为具有微秒级延迟目标的微服务的不良基础。为了使无服务器计算在新的问题领域取得成功 [71, 76, 84]，它必须解决微秒级的开销。

1. 不谈冷启动问题
2. 不专注于数据密集型的应用
2. 专注于调用延迟, 把这一块的开销降低到微妙级

一直拿Lambda说问题, 比如AWS强大的后背资源, 网关延迟, 网关长连接, 重用实例, 预置实例.
但现在为止对于解决链式调用的SAND没有分析

3 DESIGN

Nightcore is designed to run serverless functions with sub-millisecond-scale execution times, and to efficiently process internal function calls, which are generated during the execution of a serverless function (not by an external client). Nightcore exposes a serverless function interface that is similar to AWS Lambda: users provide stateless function handlers written in supported programming languages. The only addition to this simple interface is that Nightcore’s runtime library provides APIs for fast internal function invocations.

Nightcore 旨在以亚毫秒级的执行时间运行无服务器函数，并有效处理在无服务器函数执行期间生成的内部函数调用（而不是由外部客户端生成）。 Nightcore 公开了一个类似于 AWS Lambda 的无服务器函数接口：用户提供用支持的编程语言编写的无状态函数处理程序。这个简单接口的唯一补充是 Nightcore 的运行时库提供了用于快速内部函数调用的 API。

3.1 System Architecture

Figure 2 depicts Nightcore’s design which mirrors the design of other FaaS systems starting with the separation of frontend and backend. Nightcore’s frontend is an API gateway for serving external function requests and other management requests (e.g., to register new functions), while the backend consists of several independent worker servers. This separation eases the availability and scalability of Nightcore, by making the frontend API gateway fault-tolerant and horizontally scaling backend worker servers. Each worker server runs a Nightcore engine process and function containers, where each function container has one registered serverless function, and each function has only one container on each worker server. Nightcore’s engine directly manages function containers and communicates with worker threads within containers.

图 2 描绘了 Nightcore 的设计，它反映了其他 FaaS 系统的设计，从前端和后端的分离开始。 Nightcore 的前端是一个 API 网关，用于服务外部功能请求和其他管理请求（例如，注册新功能），而后端由许多独立的工作服务器组成。通过使前端 API 网关具有容错性和水平扩展后端工作服务器，这种分离简化了 Nightcore 的可用性和可扩展性。每个工作服务器运行一个 Nightcore 引擎进程和函数容器，其中每个函数容器有一个注册的无服务器函数，每个函数在每个工作服务器上只有一个容器。 Nightcore 的引擎直接管理函数容器并与容器内的工作线程通信。

Figure 2: Architecture of Nightcore (& 3.1)

Internal Function Calls. Nightcore optimizes internal function calls locally on the same worker server, without going through the API gateway. Figure 2 depicts this fast path in Nightcore’s runtime library which executes inside a functioning container. By optimizing the locality of dependent function calls, Nightcore brings performance close to a monolithic design. At the same time, different microservices remain logically independent and they execute on different worker servers, ensuring there is no single point of failure. Moreover, Nightcore preserves the engineering and deployment benefits of microservices such as diverse programming languages and software stacks.

Nightcore’s performance optimization for internal function calls assumes that an individual worker server is capable of running most function containers from a single microservice-based application 1 . We believe this is justified because we measure small working sets for stateless microservices. For example, when running SocialNetwork [70] at its saturation throughput, the 11 stateless microservice containers consume only 432 MB of memory, while the host VM is provisioned with 16 GB. As current datacenter servers have growing numbers of CPUs and increasing memory sizes (e.g., AWS EC2 VMs have up to 96 vCPUs and 192 GB of memory), a single server can support the execution of thousands of containers [98, 109]. When it is not possible to schedule function containers on the same worker server, Nightcore falls back to scheduling internal function calls on different worker servers through the gateway.

内部函数调用。 Nightcore 在同一工作服务器上本地优化内部函数调用，无需通过 API 网关。图 2 描绘了 Nightcore 运行时库中的这条快速路径，该库在函数容器内执行。通过优化依赖函数调用的局部性，Nightcore 使性能接近于单体设计。同时，不同的微服务在逻辑上保持独立，并在不同的工作服务器上执行，确保没有单点故障。此外，Nightcore 保留了微服务的工程和部署优势，例如多种编程语言和软件堆栈。

Nightcore 对内部函数调用的性能优化假设单个工作服务器能够从单个基于微服务的应用程序 1 运行大多数函数容器。我们认为这是合理的，因为我们测量了无状态微服务的小型工作集。例如，当以饱和吞吐量运行 SocialNetwork [70] 时，11 个无状态微服务容器仅消耗 432 MB 内存，而主机 VM 配置为 16 GB。随着当前数据中心服务器的 CPU 数量和内存大小不断增加（例如，AWS EC2 VM 具有多达 96 个 vCPU 和 192 GB 内存），单个服务器能够支持数千个容器的执行 [98, 109]。当无法在同一工作服务器上调度函数容器时，Nightcore 回退到通过网关在不同工作服务器上调度内部函数调用。

Gateway. Nightcore’s gateway (Figure 2.1 ) performs load balancing across worker servers for incoming function requests and forwards requests to Nightcore’s engine on worker servers. The gateway also uses external storage (e.g., Amazon’s S3) for saving function metadata and it periodically monitors resource utilization on all worker servers, to know when it should increase capacity by launching new servers.

网关。 Nightcore 的网关（图 2 ①）为传入的函数请求执行跨工作服务器的负载平衡，并将请求转发到工作服务器上的 Nightcore 引擎。网关还使用外部存储（例如 Amazon 的 S3）来保存功能元数据，并定期监控所有工作服务器上的资源利用率，以了解何时应该通过启动新服务器来增加容量。

Engine. The engine process (Figure 2② ) is the most critical component of Nightcore for achieving microsecond-scale invocation latencies because it invokes functions on each worker server. Night�core’ engine responds to function requests from both the gateway and from the runtime library within function containers. It creates low-latency message channels to communicate with function workers and launchers inside function containers (& 4.1). Nightcore’s engine is event-driven (Figure 5) allowing it to manage hundreds of message channels using a small number of OS threads. Nightcore’s engine maintains two important data structures: (1) Per-function dispatching queues for dispatching function requests to function worker threads (Figure 2③ ); (2) Per-request tracing logs for tracking the life cycle of all inflight function invocations, used for computing the proper concurrency level for function execution (Figure 2④ ).

Engine. 引擎进程（图 2②）是 Nightcore 实现微秒级调用延迟的最关键组件，因为它调用每个工作服务器上的函数。 Nightcore 的引擎响应来自网关和函数容器内的运行时库的函数请求。它创建低延迟消息通道以与函数容器（& 4.1）内的函数工作者和启动器进行通信。 Nightcore 的引擎是事件驱动的（图 5），允许它使用少量操作系统线程管理数百个消息通道。 Nightcore 的引擎维护着两个重要的数据结构： (1) Per-function dispatching queues，用于将函数请求分派给函数工作线程（图 2③）； (2) 每请求跟踪日志，用于跟踪所有飞行中函数调用的生命周期，用于计算函数执行的适当并发级别（图 2④）。

这个Engine, 说白了就是一个webserver, 然后用管道实现和容器进程的通讯, 所谓事件驱动无非就是epoll机制.
最特殊的是有一个并发控制的单元, 估计是根据请求的负载调整线程池的大小

Function Containers. Function containers (Figure 2⑤ ) provide isolated environments for executing user-provided function code. Inside the function container, there is a launcher process, and one or more worker processes depending on the programming language implementation (see & 4.2 for details). Worker threads within worker processes receive function requests from Nightcore’s engine and execute user-provided function code. Worker processes also contain a Nightcore runtime library, exposing APIs for user-provided function code. The runtime library includes APIs for fast internal function calls without going through the gateway. Nightcore’s internal function calls directly contact the dispatcher to enqueue the calls that are executed on the same worker server without having to involve the gateway.

Nightcore has different implementations of worker processes for each supported programming language. The notion of "worker threads" is particularly malleable because different programming languages have different threading models. Furthermore, Nightcore’s engine does not distinguish worker threads from worker processes, as it maintains communication channels with each individual worker thread. For clarity of exposition, we assume the simplest case in this Section (which holds for the C/C++ implementation), where "worker threads" are OS threads (details for other languages in & 4.2).

函数容器。函数容器（图 2⑤）为执行用户提供的函数代码提供了隔离的环境。在函数容器内部，有一个启动进程，以及一个或多个工作进程，具体取决于编程语言的实现（详见 & 4.2）。工作进程中的工作线程接收来自 Nightcore 引擎的功能请求，并执行用户提供的功能代码。工作进程还包含一个 Nightcore 运行时库，为用户提供的函数代码公开 API。运行时库包括无需通过网关即可快速调用内部函数的 API。 Nightcore 的内部函数调用直接联系调度器将在同一工作服务器上执行的调用排入队列，而无需涉及网关。

Nightcore 对每种受支持的编程语言都有不同的工作进程实现。 “工作线程”的概念特别具有延展性，因为不同的编程语言具有不同的线程模型。此外，Nightcore 的引擎不区分工作线程和工作进程，因为它维护与每个单独工作线程的通信通道。为清楚起见，我们假设本节中最简单的情况（适用于 C/C++ 实现），其中“工作线程”是操作系统线程（其他语言的详细信息在 & 4.2 中）。

Isolation in Nightcore. Nightcore provides container-level isolation between different functions but does not guarantee isolation between different invocations of the same function. We believe this is a reasonable trade-off for microservices, as creating a clean isolated execution environment within tens of microseconds is too challenging for current systems. When using RPC servers to implement microservices, different RPC calls of the same service can be concurrently processed within the same process, so Nightcore’s isolation guarantee is as strong as containerized RPC servers.

Previous FaaS systems all have different trade-offs between isolation and performance. OpenFaaS [51] allows concurrent invocations within the same function worker process, which is the same as Nightcore. AWS Lambda [13] does not allow concurrent invocations in the same container/MicroVM but allows execution environments to be re-used by subsequent invocations. SAND [57] has two levels of isolationśdifferent applications are isolated by containers but concurrent invocations within the same application are only isolated by processes. Faasm [98] leverages the software-based fault isolation provided by WebAssembly, allowing a new execution environment to be created within hundreds of microseconds, but it relies on language-level isolation which is weaker than container-based isolation.

Nightcore 中的隔离。 Nightcore 提供了不同功能之间的容器级隔离，但不保证同一功能的不同调用之间的隔离。我们相信这对于微服务来说是一个合理的权衡，因为在几十微秒内创建一个干净的隔离执行环境对于当前系统来说太具有挑战性了。使用RPC服务器实现微服务时，同一服务的不同RPC调用可以在同一个进程内并发处理，所以Nightcore的隔离保证和容器化的RPC服务器一样强。

以前的 FaaS 系统在隔离和性能之间都有不同的权衡。 OpenFaaS [51] 允许在同一个函数工作进程中并发调用，这与 Nightcore 相同。 AWS Lambda [13] 不允许在同一个容器/MicroVM 中并发调用，但允许后续调用重复使用执行环境。 SAND [57] 有两个级别的隔离——不同的应用程序被容器隔离，但同一应用程序内的并发调用仅被进程隔离。 Faasm [98] 利用 WebAssembly 提供的基于软件的故障隔离，允许在数百微秒内创建新的执行环境，但它依赖于比基于容器的隔离弱的语言级隔离。

Message Channels. Nightcore’s message channels are designed for low-latency message passing between its engine and other components, which carry fixed-size 1KB messages. The first 64 bytes of a message is the header which contains the message type and other metadata, while the remaining 960 bytes are message payloads. There are three types of messages relevant to function invocations:

Dispatch, used by the engine for dispatching function requests to worker threads (④ in Figure 3).

Completion, used by function worker threads for sending outputs back to the engine (⑥ in Figure 3), as well as by the engine for sending outputs of internal function calls (⑦ in Figure 3).

Invoke, used by Nightcore’s runtime library for initiating internal function calls ( in Figure 3).

When payload buffers are not large enough for function inputs or outputs, Nightcore creates extra shared memory buffers for exchanging data. In our experiments, these overflow buffers are needed for less than 1% of the messages for most workloads, though HipsterShop needs them for 9.7% of messages. When overflow buffers are required, they fit within 5KB 99.9% of the time. Previous work [83] has shown that 1KB is sufficient for more than 97% of microservice RPCs

消息渠道。 Nightcore 的消息通道专为在其引擎和其他组件之间传递低延迟消息而设计，这些组件携带固定大小的 1KB 消息。消息的前 64 个字节是包含消息类型和其他元数据的标头，而其余 960 个字节是消息有效负载。与函数调用相关的消息有以下三种类型：

Dispatch，引擎用于将函数请求分派给工作线程（图3中的④）。
Completion，由函数工作线程用于将输出发送回引擎（图 3 中的 ⑥），以及由引擎用于发送内部函数调用的输出（图 3 中的 ⑦）。
Invoke，Nightcore 的运行时库使用它来启动内部函数调用（如图 3 所示）。

当有效载荷缓冲区不足以容纳函数输入或输出时，Nightcore 会创建额外的共享内存缓冲区来交换数据。在我们的实验中，对于大多数工作负载，不到 1% 的消息需要这些溢出缓冲区，尽管 HipsterShop 需要它们来处理 9.7% 的消息。当需要溢出缓冲区时，它们在 99.9% 的时间内适合 5KB。之前的工作 [83] 表明 1KB 足以满足 97% 以上的微服务 RPC

3.2 Processing Function Requests

Figure 3 shows an example with both an external and internal function call. Suppose the code of Fn𝑥 includes an invocation of Fn𝑦. In this case, Fn𝑦 is invoked via Nightcore’s runtime API (① ). Then, Nightcore’s runtime library generates a unique ID (denoted by 𝑟𝑒𝑞𝑦) for the new invocation and sends an internal function call request to Nightcore’s engine (② ). On receiving the request, the engine writes 𝑟𝑒𝑞𝑦’s receive timestamp (also ② ). Next, the engine places 𝑟𝑒𝑞𝑦 in the dispatching queue of Fn𝑦 ③. Once there is an idle worker thread for Fn𝑦 and the concurrency level of Fn𝑦 allows, the engine will dispatch 𝑟𝑒𝑞𝑦 to it, and record 𝑟𝑒𝑞𝑦’s dispatch timestamp in its tracing log (④ ). The selected worker thread executes Fn𝑦’s code (⑤ ) and sends the output back to the engine (⑥). On receiving the output, the engine records request 𝑟𝑒𝑞𝑦’s completion timestamp (also ⑥ ), and directs the function output back to Fn𝑥 ’s worker (⑦ ). Finally, execution flow returns back to user-provided Fn𝑥 ⑧ .

图 3 显示了具有外部和内部函数调用的示例。假设 Fn𝑥 的代码包括对 Fn𝑦 的调用。在这种情况下，Fn𝑦 是通过 Nightcore 的运行时 API (① ) 调用的。然后，Nightcore 的运行时库为新的调用生成唯一 ID（用 𝑟𝑒𝑞𝑦 表示），并向 Nightcore 的引擎（②）发送内部函数调用请求。在收到请求时，引擎会写入 𝑟𝑒𝑞𝑦 的接收时间戳（还有 ② ）。接下来，引擎将 𝑟𝑒𝑞𝑦 放入 Fn𝑦 ③ 的调度队列中。一旦 Fn𝑦 有空闲的工作线程并且 Fn𝑦 的并发级别允许，引擎就会向它调度 𝑟𝑒𝑞𝑦，并在其跟踪日志中记录 𝑟𝑒𝑞𝑦 的调度时间戳（④）。选定的工作线程执行 Fn𝑦 的代码（⑤）并将输出发送回引擎（⑥）。在接收到输出时，引擎记录请求 𝑟𝑒𝑞𝑦 的完成时间戳（也 ⑥ ），并将函数输出定向回 Fn𝑥 的工人（⑦）。最后，执行流程返回到用户提供的 Fn𝑥 ⑧ 。

Managing Concurrency for Function Executions (𝜏𝑘 )

Nightcore maintains a pool of worker threads in function containers for concurrently executing functions, but deciding the size of thread pools can be a hard problem. One obvious approach is to always create new worker threads when needed, thereby maximizing the concurrency for function executions. However, this approach is problematic for microservice-based applications, where one function often calls many others. Maximizing the concurrency of function invocations with high fanout can have a domino effect that overloads a server. The problem is compounded when function execution time is short. In such cases, overload happens too quickly for a runtime system to notice it and respond appropriately.

To address the problem, Nightcore adaptively manages the number of concurrent function executions, to achieve the highest useful concurrency level while preventing instantaneous server overload. Following Little’s law, the ideal concurrency can be estimated as the product of the average request rate and the average processing time. For a registered function Fn𝑘, Nightcore’s engine maintains exponential moving averages of its invocation rate (denoted by 𝜆𝑘 ) and function execution time (denoted by 𝑡𝑘 ). Both are computed from request tracking logs. Nightcore uses their product 𝜆𝑘 · 𝑡𝑘 as the concurrency hint (denoted by 𝜏𝑘 ) for function Fn𝑘.

When receiving an invocation request of Fn𝑘, the engine will only dispatch the request if there are fewer than 𝜏𝑘 concurrent executions of Fn𝑘. Otherwise, the request will be queued, waiting for other function executions to finish. In other words, the engine ensures the maximum concurrency of Fn𝑘 is 𝜏𝑘 at any moment. Note that Nightcore’s approach is adaptive because 𝜏𝑘 is computed from two exponential moving averages (𝜆𝑘 and 𝑡𝑘 ), that change over time as new function requests are received and executed. To realize the desired concurrency level, Nightcore must also maintain a worker thread pool with at least 𝜏𝑘 threads. However, the dynamic nature of 𝜏𝑘 makes it change rapidly (see Figure 6), and frequent creation and termination of threads are not performant. To modulate the dynamic values of 𝜏𝑘, Nightcore allows more than 𝜏𝑘 threads to exist in the pool, but only uses 𝜏𝑘 of them. It terminates extra threads when there are more than 2𝜏𝑘 threads.

Nightcore’s managed concurrency is fully automatic, without any knowledge or hints from users. The concurrency hint (𝜏𝑘 ) changes frequently at the scale of microseconds, to adapt to load variation from microsecond-scale microservices (& 5.2). Figure 4 demonstrates the importance of managing concurrency levels instead of maximizing them. Even when running at a fixed input rate, CPU utilization varies quite a bit for both OpenFaaS and Nightcore when the runtime maximizes the concurrency. On the other hand, managing concurrency with hints has a dramatic łflatten-the-curvež benefit for CPU utilization.

Nightcore 在函数容器中维护一个工作线程池，用于并发执行函数，但决定线程池的大小可能是一个难题。一种明显的方法是在需要时始终创建新的工作线程，从而最大化函数执行的并发性。然而，这种方法对于基于微服务的应用程序来说是有问题的，其中一个函数经常调用许多其他函数。最大化具有高扇出的函数调用的并发性可能会产生多米诺骨牌效应，使服务器过载。当函数执行时间很短时，问题会更加复杂。在这种情况下，过载发生得太快，运行时系统无法注意到它并做出适当的响应。

为了解决这个问题，Nightcore 自适应地管理并发函数执行的数量，以实现最高的有用并发级别，同时防止服务器瞬时过载。遵循利特尔定律，理想的并发可以估计为平均请求率和平均处理时间的乘积。对于注册函数 Fn𝑘，Nightcore 的引擎保持其调用率（由 𝜆𝑘 表示）和函数执行时间（由 𝑡𝑘 表示）的指数移动平均值。两者都是从请求跟踪日志中计算出来的。 Nightcore 使用他们的产品 𝜆𝑘 · 𝑡𝑘 作为函数 Fn𝑘 的并发提示（由 𝜏𝑘 表示）。

当接收到 Fn𝑘 的调用请求时，引擎只会在 Fn𝑘 的并发执行次数少于𝜏𝑘 时才会调度该请求。否则，请求将排队等待其他函数执行完成。也就是说，引擎在任何时刻保证Fn𝑘的最大并发为𝜏𝑘。请注意，Nightcore 的方法是自适应的，因为 𝜏𝑘 是根据两个指数移动平均线（𝜆𝑘 和 𝑡𝑘 ）计算的，随着新函数请求的接收和执行，它们随着时间的推移而变化。为了实现所需的并发级别，Nightcore 还必须维护一个至少包含 𝜏𝑘 线程的工作线程池。但是，𝜏𝑘 的动态特性使其变化很快（见图 6），频繁创建和终止线程并没有性能。为了调节 𝜏𝑘 的动态值，Nightcore 允许池中存在多个 𝜏𝑘 线程，但只使用其中的 𝜏𝑘。当线程超过 2𝜏𝑘 时，它会终止额外的线程。

Nightcore 的托管并发是全自动的，无需用户的任何知识或提示。并发提示 (𝜏𝑘 ) 在微秒级频繁更改，以适应微秒级微服务 (& 5.2) 的负载变化。图 4 展示了管理并发级别而不是最大化它们的重要性。即使以固定输入速率运行，当运行时最大化并发性时，OpenFaaS 和 Nightcore 的 CPU 利用率也会有很大差异。另一方面，使用提示管理并发性对 CPU 利用率具有显着的 "flatten-the-curve" 好处。

实现部分我懒得注释了, 说实话, 有点看不上这个文章

4 IMPLEMENTATION

Nightcore’s API gateway and engine consist of 8,874 lines of C++. Function workers are supported in C/C++, Go, Node.js, and Python, requiring 1,588 lines of C++, 893 lines of Go, 57 lines of JavaScript, and 137 lines of Python.

Nightcore’s engine (its most performance-critical component) is implemented in C++. Garbage collection can have a significant impact on latency-sensitive services [107] and short-lived routines [27, 28]. Both OpenFaaS [37] and Apache OpenWhisk [50] are implemented with garbage-collected languages (Go and Scala, respectively), but Nightcore eschews garbage collection in keeping with its theme of addressing microsecond-scale latencies.

Nightcore 的 API 网关和引擎由 8,874 行 C++ 组成。在 C/C++、Go、Node.js 和 Python 中支持函数工作器，需要 1,588 行 C++、893 行 Go、57 行 JavaScript 和 137 行 Python。

Nightcore 的引擎（其性能最关键的组件）是用 C++ 实现的。垃圾收集会对延迟敏感的服务 [107] 和短期例程 [27, 28] 产生重大影响。 OpenFaaS [37] 和 Apache OpenWhisk [50] 都是用垃圾收集语言（分别是 Go 和 Scala）实现的，但 Nightcore 避开了垃圾收集，以符合其解决微秒级延迟的主题。

4.1 Nightcore’s Engine

Figure 5 shows the event-driven design of Nightcore’s engine as it responds to I/O events from the gateway and message channels. Each I/O thread maintains a fixed number of persistent TCP connections to the gateway for receiving function requests and sending back responses, while message channels are assigned to I/O threads with a round-robin policy. Individual I/O threads can only read from and write to their own TCP connections and message channels. Shared data structures including dispatching queues and tracing logs are protected by mutexes, as they can be accessed by different I/O threads.

图 5 显示了 Nightcore 引擎的事件驱动设计，因为它响应来自网关和消息通道的 I/O 事件。每个 I/O 线程维护固定数量的到网关的持久 TCP 连接，用于接收功能请求和发送回响应，而消息通道被分配给具有循环策略的 I/O 线程。单个 I/O 线程只能读取和写入它们自己的 TCP 连接和消息通道。包括调度队列和跟踪日志在内的共享数据结构受互斥锁保护，因为它们可以被不同的 I/O 线程访问。

Event-Driven IO Threads. Nightcore’s engine adopts libuv [32], which is built on top of the epoll system call, to implement its event-driven design. libuv provides APIs for watching events on file descriptors and registering handlers for those events. Each IO thread of the engine runs a libuv event loop, which polls for file descriptor events and executes registered handlers.

Event-Driven IO Threads. Nightcore 的引擎采用建立在 epoll 系统调用之上的 libuv [32] 来实现其事件驱动设计。 libuv 提供了用于观察文件描述符上的事件并为这些事件注册处理程序的 API。引擎的每个 IO 线程运行一个 libuv 事件循环，它轮询文件描述符事件并执行注册的处理程序。

我以前也想着用libuv来着, Nodejs的底层库, 实现基于回调实现异步IO

Message Channels. Nightcore’s messages channels are implemented with two Linux pipes in opposite directions to form a full-duplex connection. Meanwhile, shared memory buffers are used when inline payload buffers are not large enough for function inputs or outputs (& 3.1). Although shared memory allows fast IPC at memory speed, it lacks an efficient mechanism to notify the consumer thread when data is available. Nightcore’s use of pipes and shared memory gets the best of both worlds. It allows the consumer to be eventually notified through a blocking read on the pipe, and at the same time, it provides the low latency and high throughput of shared memory when transferring large message payloads.

As the engine and function workers are isolated in different containers, Nightcore mounts a shared tmpfs directory between their containers, to aid the setup of pipes and shared memory buffers. Nightcore creates named pipes in the shared tmpfs, allowing function workers to connect. Shared memory buffers are implemented by creating files in the shared tmpfs, which are mmaped with the MAP_SHARED flag by both the engine and function workers. Docker by itself supports sharing IPC namespaces between containers [31], but the setup is difficult for Docker’s cluster mode. Nightcore’s approach is functionally identical to IPC namespaces, as Linux’s System V shared memory is internally implemented by tmpfs [46].

Message Channels. Nightcore 的消息通道是通过两个相反方向的 Linux 管道实现的，以形成全双工连接。同时，当内联有效负载缓冲区对于函数输入或输出（& 3.1）不够大时，将使用共享内存缓冲区。尽管共享内存允许以内存速度进行快速 IPC，但它缺乏一种有效的机制来在数据可用时通知使用者线程。 Nightcore 对管道和共享内存的使用两全其美。它允许消费者最终通过管道上的阻塞读取得到通知，同时，它在传输大型消息有效负载时提供共享内存的低延迟和高吞吐量。

由于引擎和函数工作者被隔离在不同的容器中，Nightcore 在它们的容器之间挂载了一个共享的 tmpfs 目录，以帮助设置管道和共享内存缓冲区。 Nightcore 在共享的 tmpfs 中创建命名管道，允许函数工作者连接。共享内存缓冲区是通过在共享 tmpfs 中创建文件来实现的，这些文件由引擎和函数工作人员使用 MAP_SHARED 标志进行映射。 Docker 本身支持在容器之间共享 IPC 命名空间 [31]，但是对于 Docker 的集群模式来说设置起来很困难。 Nightcore 的方法在功能上与 IPC 命名空间相同，因为 Linux 的 System V 共享内存是由 tmpfs [46] 内部实现的。

Communications between Function Worker Threads. Individual worker threads within function containers connect to Nightcore’s engine with a message channel for receiving new function requests and sending responses (④ and ⑥ in Figure 3). A worker thread can be either busy (executing function code) or idle. During the execution of function code, the worker thread’s message channel is also used by Nightcore’s runtime library for internal function calls (② and ⑦ in Figure 3). When a worker thread finishes executing function code, it sends a response message with the function output to the engine and enters the idle state. An idle worker thread is put to sleep by the operating system, but the engine can wake it by writing a function request message to its message channel. The engine tracks the busy/idle state of each worker so there is no queuing at worker threads, the engine only dispatches requests to idle workers.

函数工作线程之间的通信。 函数容器内的各个工作线程通过消息通道连接到 Nightcore 的引擎，用于接收新的函数请求和发送响应（图 3 中的 ④ 和 ⑥）。工作线程可以是忙碌的（正在执行功能代码），也可以是空闲的。在函数代码执行过程中，工作线程的消息通道也被 Nightcore 的运行时库用于内部函数调用（图 3 中的②和⑦）。当工作线程执行完函数代码时，它会向引擎发送带有函数输出的响应消息并进入空闲状态。空闲的工作线程被操作系统置于休眠状态，但引擎可以通过将函数请求消息写入其消息通道来唤醒它。引擎跟踪每个工作人员的忙/闲状态，因此工作线程没有排队，引擎只将请求分派给空闲工作人员。

Mailbox. The design of Nightcore’s engine only allows individual I/O threads to write data to message channels assigned to it (shown as violet arrows in Figure 5). In certain cases, however, an I/O thread needs to communicate with a thread that does not share a message channel. Nightcore routes these requests using per-thread mailboxes. When an I/O thread drops a message in the mailbox of another thread, uv_async_send (using eventfd [24] internally) is called to notify the event loop of the owner thread.

Mailbox. Nightcore 引擎的设计只允许单个 I/O 线程将数据写入分配给它的消息通道（如图 5 中的紫色箭头所示）。但是，在某些情况下，I/O 线程需要与不共享消息通道的线程进行通信。 Nightcore 使用每线程邮箱路由这些请求。当一个 I/O 线程在另一个线程的邮箱中丢弃消息时，会调用 uv_async_send（内部使用 eventfd [24]）来通知所有者线程的事件循环。

**Computing Concurrency Hints (𝜏𝑘 ). ** To properly regulate the amount of concurrent function executions, Nightcore’s engine maintains two exponential moving averages 𝜆𝑘 (invocation rate) and 𝑡𝑘 (processing time) for each function Fn𝑘 (& 3.3). Samples of invocation rates are computed as 1/(interval between consecutive Fn𝑘 invocations), while processing times are computed as intervals between dispatch and completion timestamps, excluding queueing delays (the interval between receive and dispatch timestamps) from sub-invocations. Nightcore uses a coefficient 𝛼 = 10^−3 for computing exponential moving averages

计算并发提示 (𝜏𝑘 )。为了正确调节并发函数执行的数量，Nightcore 的引擎为每个函数 Fn𝑘 (& 3.3) 维护两个指数移动平均线 𝜆𝑘（调用率）和 𝑡𝑘（处理时间）。调用率的样本计算为 1/（连续 Fn𝑘 调用之间的间隔），而处理时间计算为调度和完成时间戳之间的间隔，不包括子调用的排队延迟（接收和调度时间戳之间的间隔）。 Nightcore 使用系数 𝛼 = 10^−3 来计算指数移动平均线

4.2 Function Workers

Nightcore executes user-provided function code in its function worker processes (& 3.1). As different programming languages have different abstractions for threading and I/O, Nightcore has different function worker implementations for them.

Nightcore’s implementation of function workers also includes a runtime library for fast internal function calls. Nightcore’s runtime library exposes a simple API output := nc_fn_call(fn_name, input) to user-provided function code for internal function calls. Furthermore, Nightcore’s runtime library provides Apache Thrift [9] and gRPC [30] wrappers for its function call API, easing porting of existing Thrift-based and gRPC-based microservices to Nightcore.

Nightcore 在其函数工作进程 (& 3.1) 中执行用户提供的函数代码。由于不同的编程语言对线程和 I/O 有不同的抽象，Nightcore 对它们有不同的function worker实现。

Nightcore 的函数工作者实现还包括一个用于快速内部函数调用的运行时库。 Nightcore 的运行时库向用户提供的用于内部函数调用的函数代码公开了一个简单的 API 输出 := nc_fn_call(fn_name, input)。此外，Nightcore 的运行时库为其函数调用 API 提供了 Apache Thrift [9] 和 gRPC [30] 包装器，从而简化了将现有的基于 Thrift 和基于 gRPC 的微服务移植到 Nightcore 的过程。

C/C++. Nightcore’s C/C++ function workers create OS threads for executing user’s code, loaded as dynamically linked libraries. These OS threads map to "worker threads" in Nightcore’s design (& 3.1 and Figure 2). To simplify the implementation, each C/C++ function worker process only runs one worker thread, and the launcher will fork more worker processes when the engine asks for more worker threads.

Go. In Go function workers, worker threads map to goroutines, the user-level threads provided by Go’s runtime, and the launcher only forks one Go worker process. Users’ code are compiled together with Nightcore’s Go worker implementation, as Go’s runtime does not support dynamic loading of arbitrary Go code 2 . Go’s runtime allows dynamically setting the maximum number of OS threads for running goroutines (via runtime.GOMAXPROCS), and Nightcore’s implementation sets it to [worker goroutines/8].

Node.js and Python. Node.js follows an event-driven design where all I/O is asynchronous without depending on multi-threading, while Python is the same when using the asyncio [11] library for I/O. In both cases, Nightcore implements its message channel protocol within their event loops. As there are no parallel threads 3 inside Node.js and Python function workers, launching a new "worker thread" simply means creating a message channel, while the engine’s notion of "worker threads" becomes event-based concurrency [23]. Also, nc_fn_call is an asynchronous API in Node.js and Python workers, rather than being synchronous in C/C++ and Go workers. For Node.js and Python functions, the launcher only forks one worker process.

C/C++。 Nightcore 的 C/C++ 函数工作者创建操作系统线程来执行用户代码，作为动态链接库加载。这些操作系统线程映射到 Nightcore 设计中的“工作线程”（& 3.1 和图 2）。为了简化实现，每个C/C++函数工作进程只运行一个工作线程，当引擎请求更多工作线程时，启动器会fork更多工作进程。

Go。在 Go function workers 中，workerthreads 映射到 goroutines，Go 的运行时提供的用户级线程，并且启动器只 fork 一个 Go 工作进程。用户的代码与 Nightcore 的 Go worker 实现一起编译，因为 Go 的运行时不支持动态加载任意 Go 代码。 Go 的运行时允许动态设置运行 goroutine 的最大操作系统线程数（通过 runtime.GOMAXPROCS），而 Nightcore 的实现将其设置为 [worker goroutines/8]。

Node.js 和 Python。 Node.js 遵循事件驱动的设计，其中所有 I/O 都是异步的，不依赖于多线程，而 Python 在使用 asyncio [11] 库进行 I/O 时是相同的。在这两种情况下，Nightcore 在其事件循环中实现其消息通道协议。由于 Node.js 和 Python 函数工作线程中没有并行线程 3，启动一个新的“工作线程”仅仅意味着创建一个消息通道，而引擎的“工作线程”概念变成了基于事件的并发 [23]。此外，nc_fn_call 是 Node.js 和 Python 工作线程中的异步 API，而不是 C/C++ 和 Go 工作线程中的同步 API。对于 Node.js 和 Python 函数，启动程序只派生一个工作进程。

Nightcore: Efficient and Scalabl
ABSTRACT The microservice architecture is a popular softw...
[Paper Notes] Cache Me If You Ca
Although the P2P design is theoretically the most scalabl...
基于内存的时序数据库-Gorilla的设计和实现
Gorrilla是Facebook在2015年在VLDB发表的论文Gorilla: A Fast, Scalabl...
商务英语 Level 3 Unit 1 -4 Describin
efficient If a product is efficient, it can work well wit...
商务英语level3 unit1 part4 Vocabula
Efficient. If a product is efficient, it can work well wi...
【今日学习笔记】之『行为只是表象，重要的是内在的品质』
【effective vs. efficient】 efficient强调的是效率，而effective强调的是结...
MobileFaceNets: Face Verificatio
MobileFaceNets: Efficient CNNs for Accurate RealTimeFace ...
Efficient Text Classification
Bag of Tricks for Efficient Text Classification
20170720_DE
Differential Evolution - A simple and efficient adaptive ...
20170721_DE
Differential Evolution - A simple and efficient adaptive ...