美文网首页问题分析
线上cpu满载排查过程

线上cpu满载排查过程

作者: 乙腾 | 来源:发表于2021-06-20 19:44 被阅读0次

    因为发生问题的为生产环境,为了脱敏,以下所有信息都是通过自己本地代码模拟的,只能大致还原一下当时情况,但是排查思路都是想通的

    背景

    通过top命令发现cpu满载

    Mem:  16333644k total,  9472968k used,  6860676k free,   165616k buffers
    Swap:        0k total,        0k used,        0k free,  6665292k cached
    
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND     
    40433 root      20   0 7588m 112m  11m S 900.7  0.7  47:53.80 java       
     1552 root      20   0  121m  13m 8524 S  0.7  0.1  14:37.75 AliYunDun   
     3581 root      20   0 9750m 2.0g  13m S  0.7 12.9 298:30.20 java        
        1 root      20   0 19360 1612 1308 S  0.0  0.0   0:00.81 init        
        2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd    
        3 root      RT   0     0    0    0 S  0.0  0.0   0:00.14 migration/0 
    
      PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
    32022 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:58.02 java
    32028 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:57.85 java
    32030 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:57.76 java
    32033 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:57.84 java
    32037 root  20   0   24.5g   4.6g  36716 R 99.9  1.2  18:57.73 java
    32020 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:58.59 java
    32021 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.88 java
    32024 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.69 java
    32025 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.62 java
    32026 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:58.32 java
    32027 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.84 java
    32029 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.79 java
    32031 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.60 java
    32032 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:58.02 java
    32036 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.62 java
    32038 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:58.01 java
    32039 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.55 java
    32044 root  20   0   24.5g   4.6g  36716 R 99.7  1.2  18:57.88 java
    

    分析

    cpu满载的产生原因

    • FullGC
      • 老年代空间不足(或老年代空间极小)
      • 方法区空间不足
      • 显示调用Sytem.gc()
      • Minor GC进入老年代的数据的平均大小>老年代可用内存
      • 由Eden区,Survivor S0(from)区向S1(to)区复制时,对象大小>To Space可用内存,则把该对象转存到老年代,且老年代的可用内存小于该对象大小(进入老年代的时机)
    • 应用层有比较耗CPU的操作
      • 业务层有大量的线程申请和消耗
      • 业务代码死循环导致频繁的FGC
      • 业务层有比较耗时的计算
      • 其他

    排查过程

    故障1 错误的启动参数

    第一步:查看cpu占用高进程

    top
    
    Mem:  16333644k total,  9472968k used,  6860676k free,   165616k buffers
    Swap:        0k total,        0k used,        0k free,  6665292k cached
    
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND     
     7622 root      20   0 7588m 112m  11m S 300.7  0.7  47:53.80 java       
     1552 root      20   0  121m  13m 8524 S  0.7  0.1  14:37.75 AliYunDun   
     3581 root      20   0 9750m 2.0g  13m S  0.7 12.9 298:30.20 java        
        1 root      20   0 19360 1612 1308 S  0.0  0.0   0:00.81 init        
        2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd    
        3 root      RT   0     0    0    0 S  0.0  0.0   0:00.14 migration/0 
    

    第二步:查看高进程中cpu占用高线程

    top -Hp 7622
    
    top - 17:43:15 up 5 days,  7:31,  1 user,  load average: 0.99, 0.97, 0.91
    Tasks:  32 total,   1 running,  31 sleeping,   0 stopped,   0 zombie
    Cpu(s):  3.7%us,  8.9%sy,  0.0%ni, 87.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:  16333644k total,  9592504k used,  6741140k free,   165700k buffers
    Swap:        0k total,        0k used,        0k free,  6781620k cached
    
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    7452 root      20   0 7588m 112m  11m R 99.9  0.7  50:47.43 java
    7623 root      20   0 7588m 112m  11m S  0.3  0.7   0:02.08 java
    7638 root      20   0 7588m 112m  11m S  0.0  0.7   0:00.00 java
    7639 root      20   0 7588m 112m  11m S  0.0  0.7   0:00.23 java
    7627 root      20   0 7588m 112m  11m S  0.0  0.7   0:02.09 java
    7625 root      20   0 7588m 112m  11m S  0.0  0.7   0:02.12 java
    7632 root      20   0 7588m 112m  11m S  0.0  0.7   0:02.07 java
    
    
    root@server:/usr/local/application/jdk/jdk1.8.0_144# ps H -eo user,pid,ppid,tid,time,%cpu,cmd | grep java | sort -k6 -nr |head -10
    root        7622    4578    7452 00:00:02 90.9 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    root        7622    4578    7623 00:00:02 90.8 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    root        7622    4578    7638 00:00:02 80.0 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    root        7622    4578    7636 00:00:02 17.8 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    root        7622    4578    7639 00:00:01  8.5 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    root        7622    4578    7627 00:00:00  3.5 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    root        7622    4578    7628 00:00:00  3.4 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    root        7622    4578    7625 00:00:00  3.4 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    root        7622    4578    7632 00:00:00  3.2 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    root        7622    4578    7630 00:00:00  3.2 java -Xms20m -Xmx20m -XX:+PrintGCDetails -XX:+PrintCommandLineFlags -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/application/jvm/ -jar prometheus-0.0.1-SNAPSHOT.jar
    
    

    第三步:转换线程ID

    printf "%x\n" 7452          
    1d1c
    

    第四步:定位cpu占用线程

    [hh_mpp_cd@sjsy-hh305-hx13w ~]$ jps
    40433 DBService-1.2.7.jar
    20584 Server
    37417 QuorumPeerMain
    13705 jar
    30892 Jps
    [hh_mpp_cd@sjsy-hh305-hx13w ~]$ jstack 7622|grep -10 1d1c
            at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490)
            at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
            at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
            at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
            at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
            at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408)
            at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
            at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:834)
            at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1415)
            at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
            - locked <0x000000079df35438> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
            at java.lang.Thread.run(Thread.java:748)
    
    "http-nio-9372-exec-31" #5185 daemon prio=5 os_prio=0 tid=0x000000079df35438 nid=0x1d1c runnable [0x00007f86f9ad7000]
       java.lang.Thread.State: RUNNABLE                   //cpu占用高  该线程状态应为RUNNABLE
            at java.net.SocketOutputStream.socketWrite0(Native Method)
            at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
            at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    --
    "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f879c526000 nid=0x9e15 in Object.wait() [0x00007f8732dec000]
       java.lang.Thread.State: WAITING (on object monitor)
            at java.lang.Object.wait(Native Method)
            at java.lang.Object.wait(Object.java:502)
            at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
            - locked <0x00000006c00b4170> (a java.lang.ref.Reference$Lock)
            at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
    
    "VM Thread" os_prio=0 tid=0x00007f879c01f000 nid=0x1d1c runnable   //VM Thread表示垃圾回收的线程
    
    "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f879c01f000 nid=0x1d1c runnable   //1d1c 这个占用cpu高的线程为GC线程
    
    "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f879c021000 nid=0x9df4 runnable
    
    "GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f879c022800 nid=0x9df5 runnable
    
    "GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f879c024800 nid=0x9df6 runnable
    
    "GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f879c026800 nid=0x9df7 runnable
    
    "GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f879c028000 nid=0x9df8 runnable
    

    排查多个线程号,发现都是gc。
    到此可以判断是gc消耗cpu过高
    此时通过jstat统计gc

    [hh_mpp_cd@sjsy-hh305-hx13w ~]$:/# jstat -gcutil 7622 1000 10
      S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
      0.00   0.00   0.00  75.07  59.09  59.60   3259    0.919  6517    7.715    8.635
      0.00   0.00   0.00   0.08  59.09  59.60   3306    0.930  6611    7.822    8.752
      0.00   0.00   0.00   0.08  59.09  59.60   3351    0.943  6701    7.924    8.867
      0.00   0.00   0.00   0.08  59.09  59.60   3397    0.955  6793    8.029    8.984
    

    可以发现FGC次数升的很快。频繁FGC不管是那种原因都是因为老年代的空间频繁达到老年代的threshold值,此时正常反应应该是

    • 查询大对象出现在哪里
    • 哪里出现了内存泄漏导致对象回收不了
    • 但是这里可以在回过头来看一下启动参数,cpu满载时查看的点有:
      • heap分配大小
      • gc算法是什么:CMS 并发回收是非常消耗cpu的
        查看启动参数
    jinfo -flags 16458
    Non-default VM flags: -XX:CICompilerCount=15 -XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=4294967296 
        -XX:MaxMetaspaceSize=4294967296 -XX:MaxNewSize=1431306240 -XX:MetaspaceSize=536870912 -XX:MinHeapDeltaBytes=524288 
            -XX:NewSize=715653120 -XX:OldSize=1431830528 -XX:+UseCompressedClassPointers
                -XX:+UseCompressedOops -XX:+UseFastUnorderedTimeStamps -XX:+UseParallelGC
    Command line:  -Xmn4096m -Xmx4096m -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=4096
    

    发现 年轻代分配大小=整个heap空间


    image.png

    old区只有0.5M
    所以这会导致频繁FGC,所以下回排查之前可以先看一下启动参数是如何配置的,确保启动参数没有错误。

    故障2

    改完启动参数后,运行一段时间,依旧cpu满载

    第一步.首先定位cpu满载原因

    gc统计信息,发现FGC频繁

    root@server:/usr/local/application/jvm# jstat -gc 7998 1000 10
     S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
    1024.0 1024.0  0.0    0.0    1024.0   1024.0    7168.0     7050.2   31104.0 28789.7 4224.0 3726.6    165    0.567  336     7.753    8.320
    1024.0 1024.0  0.0    0.0    1024.0   1024.0    7168.0     7048.1   31104.0 28838.2 4224.0 3730.8    165    0.567  384     8.713    9.280
    1024.0 1024.0  0.0    0.0    1024.0   1011.2    7168.0     6972.9   31104.0 28869.7 4224.0 3740.3    165    0.567  429     9.678   10.246
    

    dump信息,发现gc原因:Ergonomics(JVM自适应调整导致的GC)

    
    2021-06-20 18:58:31.491  WARN 8153 --- [           main] ConfigServletWebServerApplicationContext : Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'management.info-org.springframework.boot.actuate.autoconfigure.info.InfoContributorProperties': Initialization of bean failed; nested exception is java.lang.OutOfMemoryError: GC overhead limit exceeded
    [Full GC (Ergonomics) [PSYoungGen: 2048K->1979K(2560K)] [ParOldGen: 7120K->7120K(7168K)] 9168K->9099K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0242829 secs] [Times: user=0.09 sys=0.01, real=0.02 secs]
    [Full GC (Ergonomics) [PSYoungGen: 2048K->1964K(2560K)] [ParOldGen: 7120K->7120K(7168K)] 9168K->9084K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0217299 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]
    [Full GC (Ergonomics) [PSYoungGen: 2048K->1969K(2560K)] [ParOldGen: 7120K->7120K(7168K)] 9168K->9089K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0204122 secs] [Times: user=0.06 sys=0.01, real=0.02 secs]
    [Full GC (Ergonomics) [PSYoungGen: 2048K->1955K(2560K)] [ParOldGen: 7120K->7119K(7168K)] 9168K->9075K(9728K), [Metaspace: 31869K->31869K(1079296K)], 0.0211435 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
    2021-06-20 18:58:31.636  INFO 8153 --- [           main] o.s.s.concurrent.ThreadPoolTaskExecutor  : Shutting down ExecutorService 'applicationTaskExecutor'
    [Full GC (Ergonomics) [PSYoungGen: 2048K->1887K(2560K)] [ParOldGen: 7119K->7119K(7168K)] 9167K->9006K(9728K), [Metaspace: 31871K->31871K(1079296K)], 0.0228763 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
    [Full GC (Ergonomics) [PSYoungGen: 2048K->1784K(2560K)] [ParOldGen: 7119K->7042K(7168K)] 9167K->8827K(9728K), [Metaspace: 31894K->31894K(1079296K)], 0.0564103 secs] [Times: user=0.27 sys=0.05, real=0.06 secs]
    Heap
     PSYoungGen      total 2560K, used 2027K [0x00000000ffd00000, 0x0000000100000000, 0x0000000100000000)
      eden space 2048K, 99% used [0x00000000ffd00000,0x00000000ffefaec8,0x00000000fff00000)
      from space 512K, 0% used [0x00000000fff00000,0x00000000fff00000,0x00000000fff80000)
      to   space 512K, 0% used [0x00000000fff80000,0x00000000fff80000,0x0000000100000000)
     ParOldGen       total 7168K, used 7042K [0x00000000ff600000, 0x00000000ffd00000, 0x00000000ffd00000)
      object space 7168K, 98% used [0x00000000ff600000,0x00000000ffce0ab0,0x00000000ffd00000)
     Metaspace       used 31912K, capacity 34276K, committed 34560K, reserved 1079296K
      class space    used 4152K, capacity 4584K, committed 4608K, reserved 1048576K
    [Full GC (Ergonomics) [PSYoungGen: 2048K->1785K(2560K)] [ParOldGen: 7042K->7027K(7168K)] 9090K->8812K(9728K), [Metaspace: 31912K->31912K(1079296K)], 0.0434741 secs] [Times: user=0.20 sys=0.00, real=0.04 secs]
    
    

    至此可以定位到此次cpu占用高的原因了:频繁FGC

    第二步.分析线程行为和内存占用

    jstack查看线程快照查看具体线程行为,此时一定要多拿几个线程来看,看哪一行为占用cpu最高,去分析具体线程行为,像一般的死锁,死循环都是比较好分析的线程行为,但是有些时候可能通过线程快照并不能发现问题所在,此时需要通过dump信息分析一下到底哪个类内存占用比较高。

    jmap -histo:live 7823
    

    此时因为cpu满载,jmap输出存活对象很慢,此时需要截取
    jmap -histo pid|head -n 10 查看前10位
    jmap -histo pid | sort -k 2 -g -r 查看对象数最多的对象,按降序输出
    jmap -histo pid | sort -k 3 -g -r 查看内存的对象,按降序输出

    root@server:/usr/local/application/jdk/jdk1.8.0_144# jmap -histo 7823|head -n 10
    
     num     #instances         #bytes  class name
    ----------------------------------------------
       1:          2154        9671400  [B
       2:         30968        2989952  [C
       3:          7113         782216  java.lang.Class
       4:         30680         736320  java.lang.String
       5:         21712         694784  java.util.concurrent.ConcurrentHashMap$Node
       6:          3360         584440  [I
       7:          7129         341112  [Ljava.lang.Object;
    

    定位到大对象的位置,上面信息如果可以定位到大对象的位置,直接找大对象即可,如果定位不到,比如像上面这样的,就需要通过一些分析工具了

    第三步.借助分析工具分析dump文件

    此时可以输出dump文件通过jprofiler等工具查看,该对象的具体引用。


    image.png
    image.png

    可可以看见那个类占用的内存最多,TestDto占用了49%,确实可能会导致Full GC,去找对象的调用,但是这并不能直接说明问题,这个dump文件只能定位出来到底是哪个类占用内存比较多,如果先通过这个分析就想定位问题有时候会失准,一定要结合当时的gc统计信息和线程快照进行问题的再次确认。

    补充

    排查过程中与自己模拟的过程中,发现一个现象,一旦springboot项目启动后,就算OOM也不会中断掉进程,所以就算发现项目进程还存在,但是他也可能已经发生了内存溢出或者内存泄漏。

    相关文章

      网友评论

        本文标题:线上cpu满载排查过程

        本文链接:https://www.haomeiwen.com/subject/qsvsyltx.html