一、背景
最近有报es搜索超时异常,服务器CPU飙高(达130%到),于是开始进行排查
二、排查
1.项目组设置接口请求超时500ms,不是一直稳定复现,所以需要抓到尽可能详细的可分析日志,于是采取
1)在es_server_api(对es二次封装的接口) 中查询的底层代码中增加了日志信息(耗时+查询语句)。跑了段时间发现确实有不少耗时超过500ms甚至1s
2)通过es的head插件开启es的慢查询日志,监测底层搜索详细信息,设置方式如下,针对es6.2.4版本,亲测有效 (网上有的地址和设置项不对):
Put http://xx.xxx.xxx.xxxx:9200//myindex/_settings
{
"index.search.slowlog.threshold.query.warn":"5s",
"index.search.slowlog.threshold.query.info":"2s",
"index.search.slowlog.threshold.query.debug":"1s",
"index.search.slowlog.threshold.query.trace":"400ms",
"index.search.slowlog.threshold.fetch.warn":"1s",
"index.search.slowlog.threshold.fetch.info":"800ms",
"index.search.slowlog.threshold.fetch.debug":"500ms",
"index.search.slowlog.threshold.fetch.trace":"200ms"
}
由于ES是设置的query_then_fetch,所以对两个阶段的阈值进行了设置,开启后出现了慢查询日志,耗时现象同项目中打印的日志一致。
2.根据记录下来的查询语句,通过es的head插件执行,有时很快有时慢,和服务器上访问的效果一致,于是对查询语句进行详细的捕捉,方式是在查询语句中增加"profile":true,就可以打印出每个查询获取的过程及耗时情况。具体每项设置及含义参考:https://www.elastic.co/guide/en/elasticsearch/reference/5.4/_profiling_queries.html
也没发现蛛丝马迹。
3.查看es的gc.log,发现gc.log中几乎每秒都要触发一次GC Full GC (Allocation Failure)。内存不够用,又没有内存可回收,所以GC也不断。怪不得CPU这么高,大部分时间都用在gc上面了。于是查看es的jvm配置信息,发现其设置为:
# Heap configuration
-Xms1g
-Xmx1g
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly [制止JVM自动]
说明配置的内存小了,于是将其中一个节点配置升级到3G,重启后慢查询日志捕捉不到了,日志打印的耗时也降低了很多,基本上在10ms以内,重启了另一台也正常。
gc.log分析过程:
----step1-----
2019-08-21T09:20:23.852+0800: 26063583.125: [GC (CMS Initial Mark) [1 CMS-initial-mark: 707833K(707840K)] 734480K(1014528K), 0.0141329 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2019-08-21T09:20:23.866+0800: 26063583.139: Total time for which application threads were stopped: 0.0153949 seconds, Stopping threads took: 0.0001766 seconds
----step2-----
2019-08-21T09:20:23.866+0800: 26063583.139: [CMS-concurrent-mark-start]
2019-08-21T09:20:24.006+0800: 26063583.279: [CMS-concurrent-mark: 0.140/0.140 secs] [Times: user=0.28 sys=0.00, real=0.14 secs]
----step3-----
2019-08-21T09:20:24.006+0800: 26063583.280: [CMS-concurrent-preclean-start]
2019-08-21T09:20:24.098+0800: 26063583.371: [CMS-concurrent-preclean: 0.092/0.092 secs] [Times: user=0.11 sys=0.00, real=0.09 secs]
----step4-----
2019-08-21T09:20:24.098+0800: 26063583.371: [CMS-concurrent-abortable-preclean-start]
2019-08-21T09:20:24.098+0800: 26063583.371: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
----step5-----
2019-08-21T09:20:24.099+0800: 26063583.372: [GC (CMS Final Remark) [YG occupancy: 151902 K (306688 K)]2019-08-21T09:20:24.099+0800: 26063583.372: [Rescan (parallel) , 0.0238185 secs]2019-08-21T09:20:24.123+0800: 26063583.396: [weak refs processing, 0.0003479 secs]2019-08-21T09:20:24.123+0800: 26063583.397: [class unloading, 0.0332278 secs]2019-08-21T09:20:24.157+0800: 26063583.430: [scrub symbol table, 0.0102765 secs]2019-08-21T09:20:24.167+0800: 26063583.440: [scrub string table, 0.0013097 secs][1 CMS-remark: 707833K(707840K)] 859736K(1014528K), 0.0691496 secs] [Times: user=0.14 sys=0.00, real=0.07 secs]
2019-08-21T09:20:24.168+0800: 26063583.442: Total time for which application threads were stopped: 0.0703531 seconds, Stopping threads took: 0.0001407 seconds
----step6-----
2019-08-21T09:20:24.169+0800: 26063583.442: [CMS-concurrent-sweep-start]
2019-08-21T09:20:24.206+0800: 26063583.480: [CMS-concurrent-sweep: 0.038/0.038 secs] [Times: user=0.04 sys=0.00, real=0.04 secs]
----step7-----
2019-08-21T09:20:24.207+0800: 26063583.480: [CMS-concurrent-reset-start]
2019-08-21T09:20:24.208+0800: 26063583.481: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2019-08-21T09:20:24.945+0800: 26063584.218: [GC (Allocation Failure) 2019-08-21T09:20:24.945+0800: 26063584.218: [ParNew: 306407K->306407K(306688K), 0.0000269 secs]2019-08-21T09:20:24.945+0800: 26063584.218: [CMS: 707833K->707833K(707840K), 0.3487631 secs] 1014241K->729888K(1014528K), [Metaspace: 70690K->70690K(1116160K)], 0.3489630 secs] [Times: user=0.35 sys=0.00, real=0.35 secs]
2019-08-21T09:20:25.294+0800: 26063584.567: Total time for which application threads were stopped: 0.3502331 seconds, Stopping threads took: 0.0000746 seconds
分析:其中对于 CMS回收器来说,总共分四大步骤(7个小步骤),其中initial mark 和remark 阶段属于stop the world 的,需要重点关注。由图上log可知 两个阶段的停机时长为 0.0153949s + 0.070353s,约为85毫秒,但最后一条显示发生了GC。从日志来看,堆区总大小为1014528K,回收前堆区大小为1014241K,回收后堆区大小为729888K,从比例来看,回收后内存依然占到72%,结合es的JVM参数-XX:CMSInitiatingOccupancyFraction=75 来看,虽然回收了但很快内存占用率又达到75% 从而导致需要频繁的gc。
三、处理
调整es可使用的内存大小。编辑es/config/jvm.options
,调整了Xms和Xmx的大小,由原来默认的1g调整为3g。(官方建议这个值不要超过物理内存的50%,也不要超过32G。详见官网说明)接着重启es就好了。cpu降下来了,也没有新的慢查询产生了,基本都在10ms以内返回数据。
四、参考文章
1.日志分析方法可参考:https://blog.csdn.net/z69183787/article/details/86526648
2.profile api 解析 :https://www.elastic.co/guide/en/elasticsearch/reference/5.4/_profiling_queries.html
网友评论