在Graylog 3.0.2上发现了每隔几天就异常自动重启的现象,经过排查是OOM。
但是只出现在预生产环境,生产/集成/测试环境都没有。
并且在自动重启(OOM)之前,graylog的日志吞吐量非常低,每秒5、6条。
Pod内报错如下:
2020-03-13 07:25:04,698 WARN [ProxiedResource] - Unable to call [http://10.233.110.59:9000/api/system/metrics/multiple](http://10.233.110.59:9000/api/system/metrics/multiple) on node <051f6d15-ac92-4fb3-af82-e21a67a12d2d> -
{}
java.net.ConnectException: Failed to connect to /10.233.110.59:9000
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:247) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:165) ~[graylog.jar:?]
at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:257) ~[graylog.jar:?]
at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135) ~[graylog.jar:?]
at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114) ~[graylog.jar:?]
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:61) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) ~[graylog.jar:?]
at okhttp3.RealCall.execute(RealCall.java:77) ~[graylog.jar:?]
at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) ~[graylog.jar:?]
at org.graylog2.shared.rest.resources.ProxiedResource.lambda$getForAllNodes$0(ProxiedResource.java:78) ~[graylog.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_212]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_212]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_212]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_212]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_212]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_212]
at okhttp3.internal.platform.Platform.connectSocket(Platform.java:129) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:245) ~[graylog.jar:?]
在Grafana上观察,预生产环境的Graylog的Pod,总是在开始运行之后不久就会把内存吃满,
然后持续到自动重启(OOM)。
对比其他环境的Graylog版本,发现3.1.0版本的Graylog没出现这个问题。
起初认为是版本低了有BUG,后来才发现,预生产环境给Graylog分配的heapSize是6G,跟Pod的Memory Limit一样。
而其他环境都是6G Memory Limit --- 3G heapSize。
然后把预生产的heapSize调整为3G之后,恢复正常。
image.png
Details:
https://community.graylog.org/t/graylog-server-extremely-slow-and-taking-whole-ram-capacity/14490
网友评论