Spike Demand 需求激增
Microbenchmarks are great at measuring performance "in the small"; for example, measuring the performance of individual methods. But good results do not necessarily translate into macro-scale performance. Real world access patterns and demand loads often run into deeper, systemic, architectural design issues that cannot be discerned at the micro level.
HikariCP has over 1 million users, so from time to time we are approached with challenges encountered "in the wild". Recently, one such challenge led to a deeper investigation: Spike Demand.
Microbenchmarks适合做小功能点的性能测试,例如测试每个独立的方法的性能。但是小模块拥有较高的性能这个事情并不一定能推导出整体系统具备较高的性能。现实世界系统的访问模式和需求负荷通常涉及到更深层次的,系统性的架构设计上的问题,这些问题从微观层面很难分辨出来(每个小的方法逻辑都ok,但是组合到一起性能,因为种种原因整体并不是很理想)。
HikariCP 有超过100W的用户,因此时不时的我们会遇到一些不可预知的问题与挑战。最近,其中一个挑战就使得我们进行了更加深入的调查研究:spike demand(需求激增场景)
The Challenge
The user has an environment where connection creation is expensive, on the order of 150ms; and yet queries typically execute in ~2ms. Long connection setup times can be the result of various factors, alone or in combination: DNS resolution times, encrypted connections with strong encryption (2048/4096 bit), external authentication, database server load, etc.
Generally speaking, for the best performance in response to spike demands, HikariCP recommends a fixed-size pool.
Unfortunately, the user's application is also in an environment where many other applications are connected to the same database, and therefore dynamically-sized pools are desirable -- where idle applications are allowed to give up some of their connections. The user is running the application with HikariCP configured as minimumIdle=5
.
In this environment, the application has periods of quiet, as well as sudden spikes of requests, and periods of sustained activity. The combination of high connection setup times, a dynamically-sized pool requirement, and spike demands is just about the worst case scenario for a connection pool.
The questions ultimately were these:
- If the pool is sitting idle with 5 connections, and is suddenly hit with 50 requests, what should happen?
- Given that a each new connection is going to take 150ms to establish, and given that each request can ultimately be satisfied in ~2ms, shouldn't even a single one of the idle connections be able to handle all the of the requests in ~100ms anyway?
- So, why is the pool size growing [so much]?
We thought these were interesting questions, and HikariCP was indeed creating more connections than we expected...
假设一个用户环境,其中数据库连接的创建代价比较昂贵,需要耗费150ms,然后查询通常只需要2ms。连接初始化时间耗时原因可能是多个因素共同造成的:DNS解析时间,连接通信加密耗时,额外的认证耗时,数据库服务负载等等。
通常来说,为了在需求激增的场景下也能提供较好的性能,HikariCP推荐使用固定大小的线程池
不幸的是,通常情况下用户的应用服务并非独享数据库资源,会有很多其他应用也会使用到同一个数据库,因此动态的数据库连接池连接数量才是期望的--在这种情况下,空闲的连接可以被释放掉以此给其他应用使用。使用HikariCP的应用一般默认被设置为minimumIdle=5,最小的空闲连接数量为5,在连接数小于5的情况下会尽力补充连接数量。
在假定的环境下,应用有时会在正常请求量情况下运行,同时也可能会发生请求量激增的情况并持续一段时间。在高耗时的连接创建,连接数量动态可调,以及请求激增的这3种可能下,连接池面临了最坏的场景。
这些问题从根本上可以归类成如下几点:
- 如果连接池设置空闲连接数为5,然后突然请求突然激增到50,会发生什么情况?
- 假设每个新连接创建耗时150ms,每个请求实际处理耗时2ms,难道不是仅仅需要1个空闲连接就可以在100ms内完成这50个请求的处理吗?
- 那么如果第二点成立的话,为什么连接池大小需要增长(或者增长幅度很大)?
3, 2, 1 ... Go!
In order to explore these questions, we built a simulation and started measuring. The simulation harness code is here.
The constraints are simple:
-
Connection establishment takes 150ms.
-
Query execution takes 2ms.
-
The maximum pool size is 50.
-
The minimum idle connections is 5.
And the simulation is fairly simple:
-
Everything is quiet, and then ... Boom! ... 50 threads, at once, wanting a connection and to execute a query.
-
Take measurements every 250μs (microseconds).
为了探索这些问题,我们建立了一个模拟环境并且基于此进行厕所。模拟环境的代码 在这
限制条件很简单:
-
连接建立耗时150ms
-
查询耗时2ms
-
最大连接池限制为50
-
最小空闲连接数量为5
模拟场景也很简单:
-
刚开始请求量很稳定5个线程并发请求, 然后突然的Boom!一下子增加到了50个并发,这些请求都需要数据库连接来执行query
-
每次观察间隔为250μs (microseconds). 微妙?
Results
After running HikariCP through the simulation, tweaking the code (ultimately a one-line change), and satisfying ourselves that the behavior is as we would wish, we ran a few other pools through the simulation.
The code was run as follows: bash$ ./spiketest.sh 150 <pool> 50
Where 150
is the connection establishment time, <pool>
is one of [hikari, dbcp2, vibur, tomcat, c3p0], and 50
is the number of threads/requests. Note that c3p0 was dropped from the analysis here, as its run time was ~120x that of HikariCP.
3, 2, 1 ... Go!
In order to explore these questions, we built a simulation and started measuring. The simulation harness code is here.
The constraints are simple:
-
Connection establishment takes 150ms.
-
Query execution takes 2ms.
-
The maximum pool size is 50.
-
The minimum idle connections is 5.
And the simulation is fairly simple:
-
Everything is quiet, and then ... Boom! ... 50 threads, at once, wanting a connection and to execute a query.
-
Take measurements every 250μs (microseconds).
为了探索这些问题,我们建立了一个模拟环境并且基于此进行厕所。模拟环境的代码 在这
限制条件很简单:
-
连接建立耗时150ms
-
查询耗时2ms
-
最大连接池限制为50
-
最小空闲连接数量为5
模拟场景也很简单:
-
刚开始请求量很稳定5个线程并发请求, 然后突然的Boom!一下子增加到了50个并发,这些请求都需要数据库连接来执行query
-
每次观察间隔为250μs (microseconds). 微妙?
Results
After running HikariCP through the simulation, tweaking the code (ultimately a one-line change), and satisfying ourselves that the behavior is as we would wish, we ran a few other pools through the simulation.
The code was run as follows: bash$ ./spiketest.sh 150 <pool> 50
Where 150
is the connection establishment time, <pool>
is one of [hikari, dbcp2, vibur, tomcat, c3p0], and 50
is the number of threads/requests. Note that c3p0 was dropped from the analysis here, as its run time was ~120x that of HikariCP.
HikariCP (v2.6.0) raw data
Apache DBCP (v2.1.1) raw data
image-20210615225826385.pngApache Tomcat (v8.0.24) raw data
image-20210615225839502.pngVibur DBCP (v16.1) raw data
image-20210615225859614.pngApache DBCP vs HikariCP
如果你上面的图片没有好好看的话,这里给了一张完整的的对比图。
Apache DBCP 在上面, HikariCP 在下面。
image-20210615225927153.pngCommentary 说明
We'll start by saying that we are not going to comment on the implementation specifics of the other pools, but you may be able to draw inferences by our comments regarding HikariCP.
我们并不想去直接讨论其他线程池的具体实现,但是你也许可以从我们对HikariCP的讨论中看出一些区别出来。
Looking at the HikariCP graph, we couldn't have wished for a better profile; it's about as close to perfect efficiency as we could expect. It is interesting, though not surprising, that the other pool profiles are so similar to each other. Even though arrived at via different implementations, they are the result of a conventional or obvious approach to pool design.
从HikariCP的测试结果图中我们发现,这已经我们最想得到的结果了,没有比他更好的了,它具备了我们期待的接近完美的效率。其他的线程池则很有趣,他们的测试结果都非常接近,但这也在我们意料之中,即使其他的连接池实现各有不同,但他们都采用了传统的符合常识的线程池设计方案。
HikariCP在这个案例下有别于其他线程池的性能表现,是由于我们的最基本指令(最基本的设计原则)。
💡 用户线程应该最大程度的阻塞在线程池的连接获取上(而不是去主动创建连接,ps.最大程度并不是说永远阻塞)。
考虑一个假定的场景:
一个连接池里有5个正在使用的连接,以及0个空闲连接,然后一个新的线程进来了,它需要一个连接用于执行请求。
那么我们的基本准则在这种情况下如何处理呢?我们以一个问题开始然后进行解答。
如果这个新线程进来了并且被指示创建一个新连接,然后这花费了150ms去建立连接,那么如果5个处于使用中的连接其中一个执行完毕并归还到了连接池,那么此时中会发生什么呢?
Apache DBCP2和Viber都最终以45个连接数量结束, Apache Tomcat JDBC最终以40个连接结束,然而HikariCP最终则只以5个连接结束(科学的说是6个,see below)。这就会有显著的,可观测的影响作用于实际的应用部署,那就是35-40个额外的连接资源被耗费了,无法被其他应用使用到(只能被当前应用使用,其他应用部署时候可能连接数量都不够启动不了,本来可以额外启动7,8个应用),除了连接资源外,数据库端35-40个线程资源,以及关联的内存资源也无法被其他应用使用。
我们知道你现在在想什么,“万一这个50个线程持续请求呢?即持续以高并发请求呢?”,答案就是,HikariCP也会增加线程数量。
实际上,HikariCP中只要线程池可用线程数量为0了,大概在持续 800μs之后,它就会开始以异步的方式创建一个新的连接。如果在上面的模拟场景中继续持续的收集指标,你就会发现HikariCP他也会新增一个额外的线程到连接池中。但它只会增加一个线程,因为HikariCP采用省略逻辑(??),因为在结束的时候HikariCP检测到实际已经没有处于等待获取连接状态的请求了,因此之后的连接创建流程会被省略。
HikariCP's profile in this case, and the reason for the difference observed between other pools, is the result of our Prime Directive:
💡 User threads should only ever block on the pool itself.
Consider this hypothetical scenario: There is a pool with five connections in-use, and zero idle (available) connections. Then, a new thread comes in requesting a connection.
"How does the prime directive apply in this case?" We'll answer with a question of our own:
If the thread is directed to create a new connection, and that connection takes 150ms to establish, what happens if one of the five in-use connections is returned to the pool?
Both Apache DBCP2 and Vibur ended the run with 45 connections, Apache Tomcat (inexplicably) with 40 connections, while HikariCP ended the run with 5 (technically six, see below). This has major and measurable effects for real world deployments. That is 35-40 additional connections that are not available to other applications, and 35-40 additional threads and associated memory structures in the database.
We know what you are thinking, "What if the load had been sustained?" The answer is: HikariCP also would have ramped up.
In point of fact, as soon as the pool hit zero available connections, right around 800μs into the run, HikariCP began requesting connections to be added to the pool asynchronously. If the metrics had continued to be collected past the end of the spike -- out beyond 150ms -- you would observe that an additional connection is indeed added to the pool. But only one, because HikariCP employs elision logic; at that point HikariCP would also realize that there is actually no more pending demand, and the remaining connection acquisitions would be elided.
Epilog 收场白
这个场景仅仅表现了众多连接池访问场景中的一种。当遇到其他挑战性的问题时,HikariCP会持续的进行研究和创新改进。和通常一样,谢谢你的浏览光顾赞助。
This scenario represents only one of many access patterns. HikariCP will continue to research and innovate when presented with challenging problems encountered in real world deployments. As always, thank you for your patronage.
网友评论