现象
接口频繁499, 域名报警逐渐增多.
排查
- 查看线上日志:调用服务B接口,超时非常多,最长响应时间几千ms.
- 找到服务B负责人进行查看,因为没有采用相同trace方式,只能给出大概时间范围,但是响应时间并未有超过100ms的.考虑是网络原因.
- 找OP查看网络是否出现问题,自己查看ping延迟不高.等待OP回复.
- OP说网络没问题,但是暂时没其他思路,没考虑应用内部导致.
- 偶然间查看调用服务B接口的配置,发现两个参数
minConnection
与maxConnection
后产生新思路,因为引用的是服务B的sdk,内部是不是采用了连接池,当时配置数量较小,难道是流量上来之后获取连接导致了阻塞? - 找到服务B的同学确认内部实现,经确认后并未采用连接池,这两个参数没作用,是保留参数.
- 又没什么其他思路了,考虑可能应用内部问题,这块补充说明下在开始是看了gc log的发现一次minor gc时间在50-100ms所以排除是gc导致的.
- 实在没其他猜测,gclog再看一遍吧!把gc log拿下来,使用gceasy.io进行解析.
-
分析后发现产生了,几次full gc,我们使用的是G1 gc,产生full gc是比较严重的现象.
fullgc.png - 找到发生full gc之前逐行查看,可以看到堆内存基本上已经满了,对于G1 gc来说已经是非常不健康的状态了.
2020-03-20T11:08:24.499+0800: 1433856.095: [SoftReference, 0 refs, 0.0031353 secs]2020-03-20T11:08:24.502+0800: 1433856.098: [WeakReference, 0 refs, 0.0010686 secs]2020-03-20T11:08:24.503+0800: 1433856.099: [FinalReference, 0 refs, 0.0012377 secs]2020-03-20T11:08:24.504+0800: 1433856.101: [PhantomReference, 0 refs, 0 refs, 0.0023456 secs]2020-03-20T11:08:24.507+0800: 1433856.103: [JNI Weak Reference, 0.0000457 secs], 0.0284242 secs]
[Parallel Time: 15.4 ms, GC Workers: 18]
[GC Worker Start (ms): Min: 1433856077.7, Avg: 1433856077.9, Max: 1433856078.1, Diff: 0.3]
[Ext Root Scanning (ms): Min: 7.9, Avg: 8.8, Max: 15.2, Diff: 7.3, Sum: 157.8]
[Update RS (ms): Min: 0.0, Avg: 0.1, Max: 1.0, Diff: 1.0, Sum: 1.1]
[Processed Buffers: Min: 0, Avg: 0.1, Max: 1, Diff: 1, Sum: 2]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Object Copy (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 3.0]
[Termination (ms): Min: 0.0, Avg: 6.0, Max: 6.8, Diff: 6.8, Sum: 107.2]
[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
[GC Worker Total (ms): Min: 14.9, Avg: 15.1, Max: 15.2, Diff: 0.3, Sum: 271.2]
[GC Worker End (ms): Min: 1433856093.0, Avg: 1433856093.0, Max: 1433856093.0, Diff: 0.0]
[Code Root Fixup: 0.2 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.2 ms]
[Other: 12.6 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 9.7 ms]
[Ref Enq: 0.2 ms]
[Redirty Cards: 0.3 ms]
[Humongous Register: 0.4 ms]
[Humongous Reclaim: 0.1 ms]
[Free CSet: 0.0 ms]
[Eden: 0.0B(204.0M)->0.0B(204.0M) Survivors: 0.0B->0.0B Heap: 4010.2M(4096.0M)->4010.2M(4096.0M)]
# 之前指关注了时间30ms还不错,但是没关注内存的分布情况 因为设置目标暂停时间最大为200ms,为了实现这个暂停时间eden区域已经减少了非常多
[Times: user=0.20 sys=0.00, real=0.03 secs]
2020-03-20T11:08:24.510+0800: 1433856.106: Total time for which application threads were stopped: 0.0454852 seconds, Stopping threads took: 0.0002360 seconds
# 尝试扩展堆大小 但是已经是最大了不能在扩展了
1433856.121: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: allocation request failed, allocation request: 24 bytes]
1433856.121: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion amount: 2097152 bytes, attempted expansion amount: 2097152 bytes]
1433856.121: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap already fully expanded]
# 发生full gc原因是Allocation Failure分配失败.
2020-03-20T11:08:24.525+0800: 1433856.122: [Full GC (Allocation Failure) 2020-03-20T11:08:25.194+0800: 1433856.790: [SoftReference, 4841 refs, 0.0011466 secs]2020-03-20T11:08:25.195+0800: 1433856.791: [WeakReference, 136779 refs, 0.0211710 secs]2020-03-20T11:08:25.216+0800: 1433856.813: [FinalReference, 2718 refs, 0.0040877 secs]2020-03-20T11:08:25.220+0800: 1433856.817: [PhantomReference, 0 refs, 1140 refs, 0.0003886 secs]2020-03-20T11:08:25.221+0800: 1433856.817: [JNI Weak Reference, 0.0000759 secs] 4010M->788M(4096M), 2.6614008 secs]
- 继续向上寻找蛛丝马迹
2020-03-20T11:08:22.214+0800: 1433853.810: [SoftReference, 0 refs, 0.0065309 secs]2020-03-20T11:08:22.221+0800: 1433853.817: [WeakReference, 0 refs, 0.0045773 secs]2020-03-20T11:08:22.225+0800: 1433853.822: [FinalReference, 0 refs, 0.0010537 secs]2020-03-20T11:08:22.226+0800: 1433853.823: [PhantomReference, 0 refs, 0 refs, 0.0019795 secs]2020-03-20T11:08:22.228+0800: 1433853.825: [JNI Weak Reference, 0.0000575 secs] 1433854.362: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 3837788160 bytes, allocation request: 0 bytes, threshold: 2147483600 bytes (50.00 %), source: end of GC]
(to-space exhausted), 1.3927102 secs]
# 比较要命的一句话,代表已经没有region给我们copy对象了
[Parallel Time: 837.7 ms, GC Workers: 18]
[GC Worker Start (ms): Min: 1433852972.1, Avg: 1433852972.2, Max: 1433852972.3, Diff: 0.2]
[Ext Root Scanning (ms): Min: 4.9, Avg: 6.1, Max: 13.7, Diff: 8.8, Sum: 110.1]
[Update RS (ms): Min: 0.9, Avg: 13.2, Max: 20.0, Diff: 19.2, Sum: 237.1]
[Processed Buffers: Min: 2, Avg: 88.1, Max: 159, Diff: 157, Sum: 1586]
[Scan RS (ms): Min: 1.7, Avg: 3.3, Max: 3.8, Diff: 2.1, Sum: 60.3]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Object Copy (ms): Min: 799.5, Avg: 814.1, Max: 827.7, Diff: 28.1, Sum: 14653.6]
[Termination (ms): Min: 0.0, Avg: 0.6, Max: 0.9, Diff: 0.9, Sum: 10.5]
[Termination Attempts: Min: 1, Avg: 1.2, Max: 3, Diff: 2, Sum: 22]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 1.5]
[GC Worker Total (ms): Min: 837.2, Avg: 837.4, Max: 837.6, Diff: 0.3, Sum: 15073.2]
[GC Worker End (ms): Min: 1433853809.5, Avg: 1433853809.6, Max: 1433853809.7, Diff: 0.2]
[Code Root Fixup: 0.1 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.7 ms]
[Other: 554.2 ms]
[Evacuation Failure: 533.0 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 14.8 ms]
[Ref Enq: 0.5 ms]
[Redirty Cards: 0.8 ms]
[Humongous Register: 0.2 ms]
[Humongous Reclaim: 0.6 ms]
[Free CSet: 0.7 ms]
[Eden: 2312.0M(2316.0M)->0.0B(204.0M) Survivors: 38.0M->0.0B Heap: 3941.0M(4096.0M)->3610.3M(4096.0M)]
[Times: user=7.13 sys=0.00, real=1.39 secs]
1433854.363: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 3839885312 bytes, allocation request: 0 bytes, threshold: 2147483600 bytes (50.00 %), source: STW humongous allocation]
For G1 GC, an evacuation failure is very expensive -
- For successfully copied objects, G1 needs to update the references and the regions have to be tenured.
- For unsuccessfully copied objects, G1 will self-forward them and tenure the regions in place.
- 继续向上寻找蛛丝马迹
2020-03-20T11:08:27.709+0800: 1433859.305: [SoftReference, 0 refs, 0.0019565 secs]2020-03-20T11:08:27.711+0800: 1433859.307: [WeakReference, 13 refs, 0.0009893 secs]2020-03-20T11:08:27.712+0800: 1433859.308: [FinalReference, 102 refs, 0.0014151 secs]2020-03-20T11:08:27.713+0800: 1433859.309: [PhantomReference, 0 refs, 64 refs, 0.0021785 secs]2020-03-20T11:08:27.715+0800: 1433859.312: [JNI Weak Reference, 0.0000454 secs] 1433859.315: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 68.78 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]
, 0.0483084 secs]
# GC占用时间已经到68%这么高的地步了,证明应用执行时间很短了,此时吞吐量是严重下降的.
总结
基于以上现象,考虑到几种方式解决:
- 增加堆内存大小, 给GC“喘息”的时间;
- 减小
XX:InitiatingHeapOccupancyPercent
参数,我们设置的是50,可以减小来提前进行concurrent marking cycle
,提前mix gc
; - 增加处理GC的线程数
XX:ConcGCThreads
,加快处理GC速度,减少时间; - 避免发生
to-space exhausted
问题,增加预留空间XX:G1ReservePercent
- 因为发现问题时正好时线上摘掉了一台机器,那么整体来看是单台机器流量变大导致的对象分配速度大于回收的速度,最终导致了full gc,那么我们采用了最简单粗暴的方式扩容,问题消失;
网友评论