ES问题排查及优化

作者: 城市里永远的学习者 | 来源:发表于2019-08-22 18:11 被阅读0次

ES问题排查及优化
Linux下tcp socket通信优化与问题排查
记一次OOM查询处理过程
【Elasticsearch】index [index] blo
Java内存优化之POI Excel（二）
ES集群性能优化及维护
ES segment optimize
服务端性能问题排查及优化 ---内存问题分析
服务端性能问题排查及优化---CPU高问题分析
服务端性能问题排查及优化---高延迟问题分析

一、背景
最近有报es搜索超时异常，服务器CPU飙高（达130%到），于是开始进行排查

二、排查
1.项目组设置接口请求超时500ms，不是一直稳定复现，所以需要抓到尽可能详细的可分析日志，于是采取
1）在es_server_api(对es二次封装的接口) 中查询的底层代码中增加了日志信息(耗时+查询语句)。跑了段时间发现确实有不少耗时超过500ms甚至1s
2）通过es的head插件开启es的慢查询日志，监测底层搜索详细信息，设置方式如下，针对es6.2.4版本，亲测有效 (网上有的地址和设置项不对)：
Put http://xx.xxx.xxx.xxxx:9200//myindex/_settings
{
"index.search.slowlog.threshold.query.warn":"5s",
"index.search.slowlog.threshold.query.info":"2s",
"index.search.slowlog.threshold.query.debug":"1s",
"index.search.slowlog.threshold.query.trace":"400ms",
"index.search.slowlog.threshold.fetch.warn":"1s",
"index.search.slowlog.threshold.fetch.info":"800ms",
"index.search.slowlog.threshold.fetch.debug":"500ms",
"index.search.slowlog.threshold.fetch.trace":"200ms"
}
由于ES是设置的query_then_fetch，所以对两个阶段的阈值进行了设置，开启后出现了慢查询日志，耗时现象同项目中打印的日志一致。

2.根据记录下来的查询语句，通过es的head插件执行，有时很快有时慢，和服务器上访问的效果一致，于是对查询语句进行详细的捕捉，方式是在查询语句中增加"profile":true，就可以打印出每个查询获取的过程及耗时情况。具体每项设置及含义参考：https://www.elastic.co/guide/en/elasticsearch/reference/5.4/_profiling_queries.html
也没发现蛛丝马迹。

3.查看es的gc.log，发现gc.log中几乎每秒都要触发一次GC Full GC (Allocation Failure)。内存不够用，又没有内存可回收，所以GC也不断。怪不得CPU这么高，大部分时间都用在gc上面了。于是查看es的jvm配置信息，发现其设置为：

# Heap configuration
-Xms1g
-Xmx1g 
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly  [制止JVM自动]

说明配置的内存小了，于是将其中一个节点配置升级到3G，重启后慢查询日志捕捉不到了，日志打印的耗时也降低了很多，基本上在10ms以内，重启了另一台也正常。
gc.log分析过程：

----step1-----
2019-08-21T09:20:23.852+0800: 26063583.125: [GC (CMS Initial Mark) [1 CMS-initial-mark: 707833K(707840K)] 734480K(1014528K), 0.0141329 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
2019-08-21T09:20:23.866+0800: 26063583.139: Total time for which application threads were stopped: 0.0153949 seconds, Stopping threads took: 0.0001766 seconds
----step2-----
2019-08-21T09:20:23.866+0800: 26063583.139: [CMS-concurrent-mark-start]
2019-08-21T09:20:24.006+0800: 26063583.279: [CMS-concurrent-mark: 0.140/0.140 secs] [Times: user=0.28 sys=0.00, real=0.14 secs] 
----step3-----
2019-08-21T09:20:24.006+0800: 26063583.280: [CMS-concurrent-preclean-start]
2019-08-21T09:20:24.098+0800: 26063583.371: [CMS-concurrent-preclean: 0.092/0.092 secs] [Times: user=0.11 sys=0.00, real=0.09 secs] 
----step4-----
2019-08-21T09:20:24.098+0800: 26063583.371: [CMS-concurrent-abortable-preclean-start]
2019-08-21T09:20:24.098+0800: 26063583.371: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
----step5-----
2019-08-21T09:20:24.099+0800: 26063583.372: [GC (CMS Final Remark) [YG occupancy: 151902 K (306688 K)]2019-08-21T09:20:24.099+0800: 26063583.372: [Rescan (parallel) , 0.0238185 secs]2019-08-21T09:20:24.123+0800: 26063583.396: [weak refs processing, 0.0003479 secs]2019-08-21T09:20:24.123+0800: 26063583.397: [class unloading, 0.0332278 secs]2019-08-21T09:20:24.157+0800: 26063583.430: [scrub symbol table, 0.0102765 secs]2019-08-21T09:20:24.167+0800: 26063583.440: [scrub string table, 0.0013097 secs][1 CMS-remark: 707833K(707840K)] 859736K(1014528K), 0.0691496 secs] [Times: user=0.14 sys=0.00, real=0.07 secs] 
2019-08-21T09:20:24.168+0800: 26063583.442: Total time for which application threads were stopped: 0.0703531 seconds, Stopping threads took: 0.0001407 seconds
----step6-----
2019-08-21T09:20:24.169+0800: 26063583.442: [CMS-concurrent-sweep-start]
2019-08-21T09:20:24.206+0800: 26063583.480: [CMS-concurrent-sweep: 0.038/0.038 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
----step7-----
2019-08-21T09:20:24.207+0800: 26063583.480: [CMS-concurrent-reset-start]
2019-08-21T09:20:24.208+0800: 26063583.481: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 

2019-08-21T09:20:24.945+0800: 26063584.218: [GC (Allocation Failure) 2019-08-21T09:20:24.945+0800: 26063584.218: [ParNew: 306407K->306407K(306688K), 0.0000269 secs]2019-08-21T09:20:24.945+0800: 26063584.218: [CMS: 707833K->707833K(707840K), 0.3487631 secs] 1014241K->729888K(1014528K), [Metaspace: 70690K->70690K(1116160K)], 0.3489630 secs] [Times: user=0.35 sys=0.00, real=0.35 secs] 
2019-08-21T09:20:25.294+0800: 26063584.567: Total time for which application threads were stopped: 0.3502331 seconds, Stopping threads took: 0.0000746 seconds

分析：其中对于 CMS回收器来说，总共分四大步骤（7个小步骤），其中initial mark 和remark 阶段属于stop the world 的，需要重点关注。由图上log可知两个阶段的停机时长为 0.0153949s + 0.070353s，约为85毫秒，但最后一条显示发生了GC。从日志来看，堆区总大小为1014528K，回收前堆区大小为1014241K，回收后堆区大小为729888K，从比例来看，回收后内存依然占到72%，结合es的JVM参数-XX:CMSInitiatingOccupancyFraction=75 来看，虽然回收了但很快内存占用率又达到75% 从而导致需要频繁的gc。

三、处理
调整es可使用的内存大小。编辑es/config/jvm.options，调整了Xms和Xmx的大小，由原来默认的1g调整为3g。（官方建议这个值不要超过物理内存的50%，也不要超过32G。详见官网说明）接着重启es就好了。cpu降下来了，也没有新的慢查询产生了，基本都在10ms以内返回数据。

四、参考文章
1.日志分析方法可参考：https://blog.csdn.net/z69183787/article/details/86526648
2.profile api 解析：https://www.elastic.co/guide/en/elasticsearch/reference/5.4/_profiling_queries.html