美文网首页
记一次线上奔溃分析过程

记一次线上奔溃分析过程

作者: 天地一蜉蝣_6e86 | 来源:发表于2020-03-03 14:43 被阅读0次

    在一次升级之后其中一个应用一直奔溃。java crash 的原因有几种:

    1. java 程序问题,发生OOM 导致进程crash
      排查步骤如下:
        1. 查看JVM参数 -XX:+HeapDumpOnOutOfMemoryError 和 -XX:HeapDumpPath=*/java.hprof;
        2. 根据HeapDumpPath指定的路径查看是否产生dump文件;
        3. 若存在dump文件,使用Jhat、VisualVM等工具分析即可;
    2. jvm 出错,jvm 或者jdk 自身的bug 导致crash
      当jvm出现致命错误时,会生成一个错误文件 hs_err_pid.log,其中包括了导致jvm crash的重要信息,可以通过分析该文件定位到导致crash的根源,从而改善以保证系统稳定。当出现crash时,该文件默认会生成到工作目录下,然而可以通过jvm参数-XX:ErrorFile指定生成路径。
    3. 被操作系统oom-killer
      查看操作系统日志:sudo grep –color “java” /var/log/messages,确定Java进程是否被操作系统Kill
      在线上环境上可以看到有大量的hs_err_pid.log
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x00007f847f9bf641, pid=36367, tid=0x00007f844b3f5700
    #
    # JRE version: OpenJDK Runtime Environment (Zulu 8.38.0.13-CA-linux64) (8.0_212-b04) (build 1.8.0_212-b04)
    # Java VM: OpenJDK 64-Bit Server VM (25.212-b04 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libc.so.6+0x16f641]  __strlen_sse2_pminub+0x11
    #
    # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
    #
    # If you would like to submit a bug report, please visit:
    #   http://www.azulsystems.com/support/
    # The crash happened outside the Java Virtual Machine in native code.
    # See problematic frame for where to report the bug.
    #
    
    ---------------  T H R E A D  ---------------
    
    Current thread (0x00007f834c001000):  JavaThread "AgentMonitor-42" [_thread_in_native, id=36538, stack(0x00007f844b2f5000,0x00007f844b3f6000)]
    
    siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000
    
    Registers:
    RAX=0x0000000000000000, RBX=0x0000000000000016, RCX=0x0000000000000000, RDX=0x0000000000000000
    RSP=0x00007f844b3f16b8, RBP=0x00007f845033a330, RSI=0x00007f844b3f1b90, RDI=0x33261d74e6c20600
    R8 =0x00007f844b3f1720, R9 =0x00007f847f89d27d, R10=0x0000000000000002, R11=0x00007f847f9d3df4
    R12=0x0000000000000041, R13=0x00007f845033a518, R14=0x00007f84804607e0, R15=0x0000000000000004
    RIP=0x00007f847f9bf641, EFLAGS=0x0000000000010283, CSGSFS=0x0000000000000033, ERR=0x0000000000000000
      TRAPNO=0x000000000000000d
    
    Top of Stack: (sp=0x00007f844b3f16b8)
    0x00007f844b3f16b8:   00007f848025b6bd 00007f8340567120
    0x00007f844b3f16c8:   00007f844b3f1f90 00007f844b3f1f90
    0x00007f844b3f16d8:   00007f848025dffb 00007f844b3f1720
    ...
    0x00007f844b3f1898:   00007f845184aba8 00007f834c001000
    0x00007f844b3f18a8:   00007f834c001000 00007f844b3f1920 
    
    Instructions: (pc=0x00007f847f9bf641)
    0x00007f847f9bf621:   c0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48
    0x00007f847f9bf631:   31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77 1d
    0x00007f847f9bf641:   f3 0f 6f 0f 66 0f 74 c1 66 0f d7 d0 85 d2 0f 85
    0x00007f847f9bf651:   4e 02 00 00 48 89 f8 48 83 e0 f0 eb 24 48 89 f8 
    
    Register to memory mapping:
    
    RAX=0x0000000000000000 is an unknown value
    RBX=0x0000000000000016 is an unknown value
    RCX=0x0000000000000000 is an unknown value
    RDX=0x0000000000000000 is an unknown value
    RSP=0x00007f844b3f16b8 is pointing into the stack for thread: 0x00007f834c001000
    RBP=0x00007f845033a330 is pointing into the stack for thread: 0x00007f838c01a000
    RSI=0x00007f844b3f1b90 is pointing into the stack for thread: 0x00007f834c001000
    RDI=0x33261d74e6c20600 is an unknown value
    R8 =0x00007f844b3f1720 is pointing into the stack for thread: 0x00007f834c001000
    R9 =0x00007f847f89d27d: _IO_vfprintf+0x4ccd in /lib64/libc.so.6 at 0x00007f847f850000
    R10=0x0000000000000002 is an unknown value
    R11=0x00007f847f9d3df4: <offset 0x183df4> in /lib64/libc.so.6 at 0x00007f847f850000
    R12=0x0000000000000041 is an unknown value
    R13=0x00007f845033a518 is pointing into the stack for thread: 0x00007f838c01a000
    R14=0x00007f84804607e0: snoopy_inputdatastorage_data+0 in /usr/lib64/libsnoopy.so at 0x00007f8480255000
    R15=0x0000000000000004 is an unknown value
    

    该文件包含如下几类关键信息:
    -日志头文件
    -导致crash的线程信息
    -所有线程信息
    -安全点和锁信息
    -堆信息
    -本地代码缓存
    -编译事件
    -gc相关记录
    -jvm内存映射
    -jvm启动参数
    -服务器信息
    具体分析参考:https://my.oschina.net/xionghui/blog/498785
    在stack 中可以看到:

    Stack: [0x00007f844b2f5000,0x00007f844b3f6000],  sp=0x00007f844b3f16b8,  free space=1009k
    Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
    

    C=native code 说明java 在执行native 代码时crash。可以使用strace 追踪系统调用。(strace :https://blog.csdn.net/rigete/article/details/50055783

    28100 22:03:35.247322 [00007f1a9305b710] open(0x7f1a95f56110, O_RDONLY) = -1 ENOENT (No such file or directory)
    28100 22:03:35.247354 [00007f1a9305b710] open(0x7f1a95f571a0, O_RDONLY) = -1 ENOENT (No such file or directory)
    28100 22:03:35.247383 [00007f1a9305b710] open(0x7f1a95f561a0, O_RDONLY) = -1 ENOENT (No such file or directory)
    28100 22:03:35.247414 [00007f1a9305b710] open(0x7f1a95f57120, O_RDONLY) = -1 ENOENT (No such file or directory)
    28100 22:03:35.247444 [00007f1a9305b710] open(0x7f1a95f57230, O_RDONLY) = -1 ENOENT (No such file or directory)
    28100 22:03:35.247473 [00007f1a9305b710] open(0x7f1a95f56220, O_RDONLY) = -1 ENOENT (No such file or directory)
    28100 22:03:35.247508 [00007f1a9305b9b0] write(2, 0x7ffc86a540b8, 4) = -1 EPIPE (Broken pipe)
    28100 22:03:35.247540 [00007f1a9305b9b0] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=28100, si_uid=1001} ---
    28100 22:03:35.247622 [????????????????] +++  +++killed by SIGPIPE
    

    和hs_err_pid.log中的SIGSEGV信息差不多,写入不存在的内存或者只读内存。信息不多。所以对Register to memory mapping 中的内存谷歌了下发现:
    https://stackoverflow.com/questions/44922588/hadoop-nodemanager-killed-by-sigsegv
    err 信息一模一样,所以尝试停止snoopy,然后解决了。

    相关文章

      网友评论

          本文标题:记一次线上奔溃分析过程

          本文链接:https://www.haomeiwen.com/subject/zlgclhtx.html