因为发生问题的为生产环境,为了脱敏,以下所有信息都是通过自己本地代码模拟的,只能大致还原一下当时情况,但是排查思路都是想通的
背景
通过top命令发现cpu满载
Mem: 16333644k total, 9472968k used, 6860676k free, 165616k buffers
Swap: 0k total, 0k used, 0k free, 6665292k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40433 root 20 0 7588m 112m 11m S 900.7 0.7 47:53.80 java
1552 root 20 0 121m 13m 8524 S 0.7 0.1 14:37.75 AliYunDun
3581 root 20 0 9750m 2.0g 13m S 0.7 12.9 298:30.20 java
1 root 20 0 19360 1612 1308 S 0.0 0.0 0:00.81 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.14 migration/0
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32022 root 20 0 24.5g 4.6g 36716 R 99.9 1.2 18:58.02 java
32028 root 20 0 24.5g 4.6g 36716 R 99.9 1.2 18:57.85 java
32030 root 20 0 24.5g 4.6g 36716 R 99.9 1.2 18:57.76 java
32033 root 20 0 24.5g 4.6g 36716 R 99.9 1.2 18:57.84 java
32037 root 20 0 24.5g 4.6g 36716 R 99.9 1.2 18:57.73 java
32020 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:58.59 java
32021 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:57.88 java
32024 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:57.69 java
32025 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:57.62 java
32026 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:58.32 java
32027 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:57.84 java
32029 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:57.79 java
32031 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:57.60 java
32032 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:58.02 java
32036 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:57.62 java
32038 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:58.01 java
32039 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:57.55 java
32044 root 20 0 24.5g 4.6g 36716 R 99.7 1.2 18:57.88 java
分析
cpu满载的产生原因
- FullGC
- 老年代空间不足(或老年代空间极小)
- 方法区空间不足
- 显示调用Sytem.gc()
- Minor GC进入老年代的数据的平均大小>老年代可用内存
- 由Eden区,Survivor S0(from)区向S1(to)区复制时,对象大小>To Space可用内存,则把该对象转存到老年代,且老年代的可用内存小于该对象大小(进入老年代的时机)
- 应用层有比较耗CPU的操作
- 业务层有大量的线程申请和消耗
- 业务代码死循环导致频繁的FGC
- 业务层有比较耗时的计算
- 其他
排查过程
故障1 错误的启动参数
第一步:查看cpu占用高进程
top
Mem: 16333644k total, 9472968k used, 6860676k free, 165616k buffers
Swap: 0k total, 0k used, 0k free, 6665292k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7622 root 20 0 7588m 112m 11m S 300.7 0.7 47:53.80 java
1552 root 20 0 121m 13m 8524 S 0.7 0.1 14:37.75 AliYunDun
3581 root 20 0 9750m 2.0g 13m S 0.7 12.9 298:30.20 java
1 root 20 0 19360 1612 1308 S 0.0 0.0 0:00.81 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.14 migration/0
第二步:查看高进程中cpu占用高线程
top -Hp 7622
top - 17:43:15 up 5 days, 7:31, 1 user, load average: 0.99, 0.97, 0.91
Tasks: 32 total, 1 running, 31 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.7%us, 8.9%sy, 0.0%ni, 87.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16333644k total, 9592504k used, 6741140k free, 165700k buffers
Swap: 0k total, 0k used, 0k free, 6781620k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7452 root 20 0 7588m 112m 11m R 99.9 0.7 50:47.43 java
7623 root 20 0 7588m 112m 11m S 0.3 0.7 0:02.08 java
7638 root 20 0 7588m 112m 11m S 0.0 0.7 0:00.00 java
7639 root 20 0 7588m 112m 11m S 0.0 0.7 0:00.23 java
7627 root 20 0 7588m 112m 11m S 0.0 0.7 0:02.09 java
7625 root 20 0 7588m 112m 11m S 0.0 0.7 0:02.12 java
7632 root 20 0 7588m 112m 11m S 0.0 0.7 0:02.07 java
root@server:/usr/local/application/jdk/jdk1.8.0_144# ps H -eo user,pid,ppid,tid,time,%cpu,cmd | grep java | sort -k6 -nr |head -10
root 7622 4578 7452 00:00:02 90.9 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root 7622 4578 7623 00:00:02 90.8 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root 7622 4578 7638 00:00:02 80.0 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root 7622 4578 7636 00:00:02 17.8 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root 7622 4578 7639 00:00:01 8.5 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root 7622 4578 7627 00:00:00 3.5 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root 7622 4578 7628 00:00:00 3.4 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root 7622 4578 7625 00:00:00 3.4 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root 7622 4578 7632 00:00:00 3.2 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
root 7622 4578 7630 00:00:00 3.2 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
第三步:转换线程ID
printf "%x\n" 7452
1d1c
第四步:定位cpu占用线程
[hh_mpp_cd@sjsy-hh305-hx13w ~]$ jps
40433 DBService-1.2.7.jar
20584 Server
37417 QuorumPeerMain
13705 jar
30892 Jps
[hh_mpp_cd@sjsy-hh305-hx13w ~]$ jstack 7622|grep -10 1d1c
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:834)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1415)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
- locked <0x000000079df35438> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
"http-nio-9372-exec-31" #5185 daemon prio=5 os_prio=0 tid=0x000000079df35438 nid=0x1d1c runnable [0x00007f86f9ad7000]
java.lang.Thread.State: RUNNABLE //cpu占用高 该线程状态应为RUNNABLE
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
--
"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f879c526000 nid=0x9e15 in Object.wait() [0x00007f8732dec000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
- locked <0x00000006c00b4170> (a java.lang.ref.Reference$Lock)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
"VM Thread" os_prio=0 tid=0x00007f879c01f000 nid=0x1d1c runnable //VM Thread表示垃圾回收的线程
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f879c01f000 nid=0x1d1c runnable //1d1c 这个占用cpu高的线程为GC线程
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f879c021000 nid=0x9df4 runnable
"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f879c022800 nid=0x9df5 runnable
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f879c024800 nid=0x9df6 runnable
"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f879c026800 nid=0x9df7 runnable
"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f879c028000 nid=0x9df8 runnable
排查多个线程号,发现都是gc。
到此可以判断是gc消耗cpu过高
此时通过jstat统计gc
[hh_mpp_cd@sjsy-hh305-hx13w ~]$:/# jstat -gcutil 7622 1000 10
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 0.00 0.00 75.07 59.09 59.60 3259 0.919 6517 7.715 8.635
0.00 0.00 0.00 0.08 59.09 59.60 3306 0.930 6611 7.822 8.752
0.00 0.00 0.00 0.08 59.09 59.60 3351 0.943 6701 7.924 8.867
0.00 0.00 0.00 0.08 59.09 59.60 3397 0.955 6793 8.029 8.984
可以发现FGC次数升的很快。频繁FGC不管是那种原因都是因为老年代的空间频繁达到老年代的threshold值,此时正常反应应该是
- 查询大对象出现在哪里
- 哪里出现了内存泄漏导致对象回收不了
- 但是这里可以在回过头来看一下启动参数,cpu满载时查看的点有:
- heap分配大小
- gc算法是什么:CMS 并发回收是非常消耗cpu的
查看启动参数
jinfo -flags 16458
Non-default VM flags: -XX:CICompilerCount=15 -XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=4294967296
-XX:MaxMetaspaceSize=4294967296 -XX:MaxNewSize=1431306240 -XX:MetaspaceSize=536870912 -XX:MinHeapDeltaBytes=524288
-XX:NewSize=715653120 -XX:OldSize=1431830528 -XX:+UseCompressedClassPointers
-XX:+UseCompressedOops -XX:+UseFastUnorderedTimeStamps -XX:+UseParallelGC
Command line: -Xmn4096m -Xmx4096m -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=4096
发现 年轻代分配大小=整个heap空间
image.png
old区只有0.5M
所以这会导致频繁FGC,所以下回排查之前可以先看一下启动参数是如何配置的,确保启动参数没有错误。
故障2
改完启动参数后,运行一段时间,依旧cpu满载
第一步.首先定位cpu满载原因
gc统计信息,发现FGC频繁
root@server:/usr/local/application/jvm# jstat -gc 7998 1000 10
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
1024.0 1024.0 0.0 0.0 1024.0 1024.0 7168.0 7050.2 31104.0 28789.7 4224.0 3726.6 165 0.567 336 7.753 8.320
1024.0 1024.0 0.0 0.0 1024.0 1024.0 7168.0 7048.1 31104.0 28838.2 4224.0 3730.8 165 0.567 384 8.713 9.280
1024.0 1024.0 0.0 0.0 1024.0 1011.2 7168.0 6972.9 31104.0 28869.7 4224.0 3740.3 165 0.567 429 9.678 10.246
dump信息,发现gc原因:Ergonomics(JVM自适应调整导致的GC)
2021-06-20 18:58:31.491 WARN 8153 --- [ main] ConfigServletWebServerApplicationContext : Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'management.info-org.springframework.boot.actuate.autoconfigure.info.InfoContributorProperties': Initialization of bean failed; nested exception is java.lang.OutOfMemoryError: GC overhead limit exceeded
[Full GC (Ergonomics) [PSYoungGen: 2048K->1979K(2560K)] [ParOldGen: 7120K->7120K(7168K)] 9168K->9099K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0242829 secs] [Times: user=0.09 sys=0.01, real=0.02 secs]
[Full GC (Ergonomics) [PSYoungGen: 2048K->1964K(2560K)] [ParOldGen: 7120K->7120K(7168K)] 9168K->9084K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0217299 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]
[Full GC (Ergonomics) [PSYoungGen: 2048K->1969K(2560K)] [ParOldGen: 7120K->7120K(7168K)] 9168K->9089K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0204122 secs] [Times: user=0.06 sys=0.01, real=0.02 secs]
[Full GC (Ergonomics) [PSYoungGen: 2048K->1955K(2560K)] [ParOldGen: 7120K->7119K(7168K)] 9168K->9075K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0211435 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
2021-06-20 18:58:31.636 INFO 8153 --- [ main] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'
[Full GC (Ergonomics) [PSYoungGen: 2048K->1887K(2560K)] [ParOldGen: 7119K->7119K(7168K)] 9167K->9006K(9728K), [Metaspace: 31871K->31871K(1079296K)], 0.0228763 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
[Full GC (Ergonomics) [PSYoungGen: 2048K->1784K(2560K)] [ParOldGen: 7119K->7042K(7168K)] 9167K->8827K(9728K), [Metaspace: 31894K->31894K(1079296K)], 0.0564103 secs] [Times: user=0.27 sys=0.05, real=0.06 secs]
Heap
PSYoungGen total 2560K, used 2027K [0x00000000ffd00000, 0x0000000100000000, 0x0000000100000000)
eden space 2048K, 99% used [0x00000000ffd00000,0x00000000ffefaec8,0x00000000fff00000)
from space 512K, 0% used [0x00000000fff00000,0x00000000fff00000,0x00000000fff80000)
to space 512K, 0% used [0x00000000fff80000,0x00000000fff80000,0x0000000100000000)
ParOldGen total 7168K, used 7042K [0x00000000ff600000, 0x00000000ffd00000, 0x00000000ffd00000)
object space 7168K, 98% used [0x00000000ff600000,0x00000000ffce0ab0,0x00000000ffd00000)
Metaspace used 31912K, capacity 34276K, committed 34560K, reserved 1079296K
class space used 4152K, capacity 4584K, committed 4608K, reserved 1048576K
[Full GC (Ergonomics) [PSYoungGen: 2048K->1785K(2560K)] [ParOldGen: 7042K->7027K(7168K)] 9090K->8812K(9728K), [Metaspace: 31912K->31912K(1079296K)], 0.0434741 secs] [Times: user=0.20 sys=0.00, real=0.04 secs]
至此可以定位到此次cpu占用高的原因了:频繁FGC
第二步.分析线程行为和内存占用
jstack查看线程快照查看具体线程行为,此时一定要多拿几个线程来看,看哪一行为占用cpu最高,去分析具体线程行为,像一般的死锁,死循环都是比较好分析的线程行为,但是有些时候可能通过线程快照并不能发现问题所在,此时需要通过dump信息分析一下到底哪个类内存占用比较高。
jmap -histo:live 7823
此时因为cpu满载,jmap输出存活对象很慢,此时需要截取
jmap -histo pid|head -n 10 查看前10位
jmap -histo pid | sort -k 2 -g -r 查看对象数最多的对象,按降序输出
jmap -histo pid | sort -k 3 -g -r 查看内存的对象,按降序输出
root@server:/usr/local/application/jdk/jdk1.8.0_144# jmap -histo 7823|head -n 10
num #instances #bytes class name
----------------------------------------------
1: 2154 9671400 [B
2: 30968 2989952 [C
3: 7113 782216 java.lang.Class
4: 30680 736320 java.lang.String
5: 21712 694784 java.util.concurrent.ConcurrentHashMap$Node
6: 3360 584440 [I
7: 7129 341112 [Ljava.lang.Object;
定位到大对象的位置,上面信息如果可以定位到大对象的位置,直接找大对象即可,如果定位不到,比如像上面这样的,就需要通过一些分析工具了
第三步.借助分析工具分析dump文件
此时可以输出dump文件通过jprofiler等工具查看,该对象的具体引用。
image.png
image.png
可可以看见那个类占用的内存最多,TestDto占用了49%,确实可能会导致Full GC,去找对象的调用,但是这并不能直接说明问题,这个dump文件只能定位出来到底是哪个类占用内存比较多,如果先通过这个分析就想定位问题有时候会失准,一定要结合当时的gc统计信息和线程快照进行问题的再次确认。
补充
排查过程中与自己模拟的过程中,发现一个现象,一旦springboot项目启动后,就算OOM也不会中断掉进程,所以就算发现项目进程还存在,但是他也可能已经发生了内存溢出或者内存泄漏。
网友评论