接口频繁超时定位

作者: williamlee | 来源:发表于2020-03-24 22:42 被阅读0次

接口频繁超时定位
appium执行报错:socket hang up
记一次cpu负载高的问题定位
Spring Boot 接口频繁超时，Alibaba 开源 Ar
接口超时
Spring Cloud OpenFeign 重试造成插入多条数
Trail：分布式追踪
Linux系统CPU的性能监控及调优
探究 dubbo接口超时时间的赋值逻辑，包含Consumer，P
dubbo频繁调用超时问题

现象

接口频繁499, 域名报警逐渐增多.

排查

查看线上日志:调用服务B接口,超时非常多,最长响应时间几千ms.
找到服务B负责人进行查看,因为没有采用相同trace方式,只能给出大概时间范围,但是响应时间并未有超过100ms的.考虑是网络原因.
找OP查看网络是否出现问题,自己查看ping延迟不高.等待OP回复.
OP说网络没问题,但是暂时没其他思路,没考虑应用内部导致.
偶然间查看调用服务B接口的配置,发现两个参数minConnection与maxConnection后产生新思路,因为引用的是服务B的sdk,内部是不是采用了连接池,当时配置数量较小,难道是流量上来之后获取连接导致了阻塞?
找到服务B的同学确认内部实现,经确认后并未采用连接池,这两个参数没作用,是保留参数.
又没什么其他思路了,考虑可能应用内部问题,这块补充说明下在开始是看了gc log的发现一次minor gc时间在50-100ms所以排除是gc导致的.
实在没其他猜测,gclog再看一遍吧!把gc log拿下来,使用gceasy.io进行解析.
分析后发现产生了,几次full gc,我们使用的是G1 gc,产生full gc是比较严重的现象.

fullgc.png
找到发生full gc之前逐行查看,可以看到堆内存基本上已经满了,对于G1 gc来说已经是非常不健康的状态了.

2020-03-20T11:08:24.499+0800: 1433856.095: [SoftReference, 0 refs, 0.0031353 secs]2020-03-20T11:08:24.502+0800: 1433856.098: [WeakReference, 0 refs, 0.0010686 secs]2020-03-20T11:08:24.503+0800: 1433856.099: [FinalReference, 0 refs, 0.0012377 secs]2020-03-20T11:08:24.504+0800: 1433856.101: [PhantomReference, 0 refs, 0 refs, 0.0023456 secs]2020-03-20T11:08:24.507+0800: 1433856.103: [JNI Weak Reference, 0.0000457 secs], 0.0284242 secs]
   [Parallel Time: 15.4 ms, GC Workers: 18]
      [GC Worker Start (ms): Min: 1433856077.7, Avg: 1433856077.9, Max: 1433856078.1, Diff: 0.3]
      [Ext Root Scanning (ms): Min: 7.9, Avg: 8.8, Max: 15.2, Diff: 7.3, Sum: 157.8]
      [Update RS (ms): Min: 0.0, Avg: 0.1, Max: 1.0, Diff: 1.0, Sum: 1.1]
         [Processed Buffers: Min: 0, Avg: 0.1, Max: 1, Diff: 1, Sum: 2]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 3.0]
      [Termination (ms): Min: 0.0, Avg: 6.0, Max: 6.8, Diff: 6.8, Sum: 107.2]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
      [GC Worker Total (ms): Min: 14.9, Avg: 15.1, Max: 15.2, Diff: 0.3, Sum: 271.2]
      [GC Worker End (ms): Min: 1433856093.0, Avg: 1433856093.0, Max: 1433856093.0, Diff: 0.0]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.2 ms]
   [Other: 12.6 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 9.7 ms]
      [Ref Enq: 0.2 ms]
      [Redirty Cards: 0.3 ms]
      [Humongous Register: 0.4 ms]
      [Humongous Reclaim: 0.1 ms]
      [Free CSet: 0.0 ms]
   [Eden: 0.0B(204.0M)->0.0B(204.0M) Survivors: 0.0B->0.0B Heap: 4010.2M(4096.0M)->4010.2M(4096.0M)]
# 之前指关注了时间30ms还不错,但是没关注内存的分布情况 因为设置目标暂停时间最大为200ms,为了实现这个暂停时间eden区域已经减少了非常多
 [Times: user=0.20 sys=0.00, real=0.03 secs] 
2020-03-20T11:08:24.510+0800: 1433856.106: Total time for which application threads were stopped: 0.0454852 seconds, Stopping threads took: 0.0002360 seconds
# 尝试扩展堆大小 但是已经是最大了不能在扩展了
 1433856.121: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: allocation request failed, allocation request: 24 bytes]
 1433856.121: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion amount: 2097152 bytes, attempted expansion amount: 2097152 bytes]
 1433856.121: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap already fully expanded]
# 发生full gc原因是Allocation Failure分配失败.
2020-03-20T11:08:24.525+0800: 1433856.122: [Full GC (Allocation Failure) 2020-03-20T11:08:25.194+0800: 1433856.790: [SoftReference, 4841 refs, 0.0011466 secs]2020-03-20T11:08:25.195+0800: 1433856.791: [WeakReference, 136779 refs, 0.0211710 secs]2020-03-20T11:08:25.216+0800: 1433856.813: [FinalReference, 2718 refs, 0.0040877 secs]2020-03-20T11:08:25.220+0800: 1433856.817: [PhantomReference, 0 refs, 1140 refs, 0.0003886 secs]2020-03-20T11:08:25.221+0800: 1433856.817: [JNI Weak Reference, 0.0000759 secs] 4010M->788M(4096M), 2.6614008 secs]

继续向上寻找蛛丝马迹

2020-03-20T11:08:22.214+0800: 1433853.810: [SoftReference, 0 refs, 0.0065309 secs]2020-03-20T11:08:22.221+0800: 1433853.817: [WeakReference, 0 refs, 0.0045773 secs]2020-03-20T11:08:22.225+0800: 1433853.822: [FinalReference, 0 refs, 0.0010537 secs]2020-03-20T11:08:22.226+0800: 1433853.823: [PhantomReference, 0 refs, 0 refs, 0.0019795 secs]2020-03-20T11:08:22.228+0800: 1433853.825: [JNI Weak Reference, 0.0000575 secs] 1433854.362: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 3837788160 bytes, allocation request: 0 bytes, threshold: 2147483600 bytes (50.00 %), source: end of GC]
 (to-space exhausted), 1.3927102 secs]
# 比较要命的一句话,代表已经没有region给我们copy对象了
   [Parallel Time: 837.7 ms, GC Workers: 18]
      [GC Worker Start (ms): Min: 1433852972.1, Avg: 1433852972.2, Max: 1433852972.3, Diff: 0.2]
      [Ext Root Scanning (ms): Min: 4.9, Avg: 6.1, Max: 13.7, Diff: 8.8, Sum: 110.1]
      [Update RS (ms): Min: 0.9, Avg: 13.2, Max: 20.0, Diff: 19.2, Sum: 237.1]
         [Processed Buffers: Min: 2, Avg: 88.1, Max: 159, Diff: 157, Sum: 1586]
      [Scan RS (ms): Min: 1.7, Avg: 3.3, Max: 3.8, Diff: 2.1, Sum: 60.3]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 799.5, Avg: 814.1, Max: 827.7, Diff: 28.1, Sum: 14653.6]
      [Termination (ms): Min: 0.0, Avg: 0.6, Max: 0.9, Diff: 0.9, Sum: 10.5]
         [Termination Attempts: Min: 1, Avg: 1.2, Max: 3, Diff: 2, Sum: 22]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 1.5]
      [GC Worker Total (ms): Min: 837.2, Avg: 837.4, Max: 837.6, Diff: 0.3, Sum: 15073.2]
      [GC Worker End (ms): Min: 1433853809.5, Avg: 1433853809.6, Max: 1433853809.7, Diff: 0.2]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.7 ms]
   [Other: 554.2 ms]
      [Evacuation Failure: 533.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 14.8 ms]
      [Ref Enq: 0.5 ms]
      [Redirty Cards: 0.8 ms]
      [Humongous Register: 0.2 ms]
      [Humongous Reclaim: 0.6 ms]
      [Free CSet: 0.7 ms]
   [Eden: 2312.0M(2316.0M)->0.0B(204.0M) Survivors: 38.0M->0.0B Heap: 3941.0M(4096.0M)->3610.3M(4096.0M)]
 [Times: user=7.13 sys=0.00, real=1.39 secs] 
 1433854.363: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 3839885312 bytes, allocation request: 0 bytes, threshold: 2147483600 bytes (50.00 %), source: STW humongous allocation]

For G1 GC, an evacuation failure is very expensive -

For successfully copied objects, G1 needs to update the references and the regions have to be tenured.

For unsuccessfully copied objects, G1 will self-forward them and tenure the regions in place.

继续向上寻找蛛丝马迹

2020-03-20T11:08:27.709+0800: 1433859.305: [SoftReference, 0 refs, 0.0019565 secs]2020-03-20T11:08:27.711+0800: 1433859.307: [WeakReference, 13 refs, 0.0009893 secs]2020-03-20T11:08:27.712+0800: 1433859.308: [FinalReference, 102 refs, 0.0014151 secs]2020-03-20T11:08:27.713+0800: 1433859.309: [PhantomReference, 0 refs, 64 refs, 0.0021785 secs]2020-03-20T11:08:27.715+0800: 1433859.312: [JNI Weak Reference, 0.0000454 secs] 1433859.315: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 68.78 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]
, 0.0483084 secs]
# GC占用时间已经到68%这么高的地步了,证明应用执行时间很短了,此时吞吐量是严重下降的.

总结

基于以上现象,考虑到几种方式解决:

增加堆内存大小, 给GC“喘息”的时间;
减小XX:InitiatingHeapOccupancyPercent参数,我们设置的是50,可以减小来提前进行concurrent marking cycle,提前mix gc;
增加处理GC的线程数 XX:ConcGCThreads,加快处理GC速度,减少时间;
避免发生to-space exhausted问题,增加预留空间XX:G1ReservePercent
因为发现问题时正好时线上摘掉了一台机器,那么整体来看是单台机器流量变大导致的对象分配速度大于回收的速度,最终导致了full gc,那么我们采用了最简单粗暴的方式扩容,问题消失;

接口频繁超时定位
现象接口频繁499, 域名报警逐渐增多. 排查查看线上日志:调用服务B接口,超时非常多,最长响应时间几千ms....
appium执行报错:socket hang up
运行appium报错：造成这个原因的情况很多，常见的有元素定位过程中超时，还有网络请求接口超时，导致socket...
记一次cpu负载高的问题定位
背景生产服务出现请求慢，接口超时，服务器cpu占用高，需要排查定位问题。初步定位 1、根据反馈，初步定位为cp...
Spring Boot 接口频繁超时，Alibaba 开源 Ar
公司有个渠道系统，专门对接三方渠道使用，没有什么业务逻辑，主要是转换报文和参数校验之类的工作，起着一个承上启下的作...
接口超时
当设计的业务流程或者功能需要调用其他接口实现请求与响应的时候，可能由于网络等原因导致的接口超时导致业务中断或者功能...
Spring Cloud OpenFeign 重试造成插入多条数
问题描述我们在调试接口时，接口很容易超时，当然线上环境因为网络抖动、接口响应慢等，也造成接口超时，强大的feig...
Trail：分布式追踪
在又拍云，即使是应用层服务也依赖到其他服务，而那些服务又依赖到了更多服务。当一个接口超时时，定位接口的性能瓶颈...
Linux系统CPU的性能监控及调优
性能优化是一个老生常谈的话题，典型的性能问题如页面响应慢、接口超时，服务器负载高、并发数低，数据库频繁死锁等。尤其...
探究 dubbo接口超时时间的赋值逻辑，包含Consumer，P
首先，我们都知道 dubbo 调用，consumer 可以配置接口超时时间，provider 也可以配置超时时间，...
dubbo频繁调用超时问题
今天项目中web调用service虽然成功，但报如下问题 Caused by: com.alibaba.dubbo...