美文网首页
服务启动时出现大量 Hystrix timed-out 第一次排

服务启动时出现大量 Hystrix timed-out 第一次排

作者: 蓝笔头 | 来源:发表于2021-07-21 12:50 被阅读0次

    异常日志

    com.netflix.hystrix.exception.HystrixRuntimeException: UserSettingClient#get(long,String)  timed-out and no fallback available.
    com.netflix.hystrix.exception.HystrixRuntimeException: UserSettingClient#get(long,String)  short-circuited 
    

    有人说是要开启 Ribbon 饥饿加载,看了下服务配置,已经开启了。

    ribbon.eager-load.enabled=true
    ribbon.eager-load.clients=user-service
    

    排查过程

    因为服务启动时添加了 JMX 参数,所以可以通过 Java VisualVM 连接到 jmx 端口。

    -Dcom.sun.management.jmxremote.port=12345 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
    

    通过线程监控可以发现存在大量的线程(1000 多个)在短时间(50-24 = 16s)内被创建。

    大量线程创建开始时间节点 大量线程创建结束时间节点
    meikai@build-1:~/test$ while true; do pidstat -wt -p $(pgrep -f ^/usr/bin/java.*athena-web) 1 1 | awk -f stat.awk; done
    2021-07-21 11:30:01 cswch:6782.32 nvcswch:252.90
    2021-07-21 11:30:02 cswch:3625.42 nvcswch:35.28
    2021-07-21 11:30:03 cswch:1984.16 nvcswch:229.38
    2021-07-21 11:30:04 cswch:2103.82 nvcswch:19.60
    2021-07-21 11:30:05 cswch:2064.64 nvcswch:11.76
    2021-07-21 11:30:07 cswch:13307.72 nvcswch:315.58
    2021-07-21 11:30:08 cswch:3503.60 nvcswch:73.06
    2021-07-21 11:30:09 cswch:1713.68 nvcswch:19.60
    2021-07-21 11:30:10 cswch:1101.86 nvcswch:1.96
    2021-07-21 11:30:11 cswch:1419.46 nvcswch:0.00
    2021-07-21 11:30:12 cswch:2515.64 nvcswch:9.80
    2021-07-21 11:30:13 cswch:3562.66 nvcswch:21.56
    2021-07-21 11:30:14 cswch:4803.94 nvcswch:237.60
    2021-07-21 11:30:15 cswch:2090.08 nvcswch:9.80
    2021-07-21 11:30:16 cswch:4843.42 nvcswch:223.74
    2021-07-21 11:30:17 cswch:2543.02 nvcswch:11.76
    2021-07-21 11:30:18 cswch:1915.60 nvcswch:3.92
    2021-07-21 11:30:19 cswch:2853.28 nvcswch:277.22
    2021-07-21 11:30:20 cswch:2060.68 nvcswch:9.80
    2021-07-21 11:30:22 cswch:2435.06 nvcswch:156.82
    2021-07-21 11:30:23 cswch:3158.64 nvcswch:5.88
    2021-07-21 11:30:24 cswch:4964.60 nvcswch:7.84
    2021-07-21 11:30:25 cswch:36033.14 nvcswch:356.78
    2021-07-21 11:30:26 cswch:38675.18 nvcswch:192.16
    2021-07-21 11:30:27 cswch:27278.76 nvcswch:592.28
    2021-07-21 11:30:28 cswch:28185.20 nvcswch:1245.52
    2021-07-21 11:30:30 cswch:27249.28 nvcswch:300.46
    2021-07-21 11:30:31 cswch:27757.72 nvcswch:235.06
    2021-07-21 11:30:32 cswch:35811.64 nvcswch:845.80
    2021-07-21 11:30:33 cswch:45824.74 nvcswch:4289.28
    2021-07-21 11:30:35 cswch:33469.40 nvcswch:921.34
    2021-07-21 11:30:36 cswch:33097.62 nvcswch:794.70
    2021-07-21 11:30:37 cswch:35508.16 nvcswch:407.34
    2021-07-21 11:30:39 cswch:31193.58 nvcswch:244.96
    2021-07-21 11:30:40 cswch:35123.04 nvcswch:4207.68
    2021-07-21 11:30:41 cswch:34204.14 nvcswch:14115.80
    2021-07-21 11:30:43 cswch:33899.98 nvcswch:549.66
    2021-07-21 11:30:44 cswch:38476.98 nvcswch:899.94
    2021-07-21 11:30:45 cswch:33318.62 nvcswch:264.74
    2021-07-21 11:30:47 cswch:32771.18 nvcswch:1146.34
    2021-07-21 11:30:48 cswch:31128.38 nvcswch:949.74
    2021-07-21 11:30:49 cswch:25169.86 nvcswch:240.10
    2021-07-21 11:30:51 cswch:23371.16 nvcswch:264.12
    2021-07-21 11:30:52 cswch:24892.78 nvcswch:98.06
    2021-07-21 11:30:53 cswch:23915.64 nvcswch:181.92
    2021-07-21 11:30:55 cswch:28137.40 nvcswch:406.04
    2021-07-21 11:30:56 cswch:24536.02 nvcswch:86.48
    2021-07-21 11:30:57 cswch:28038.82 nvcswch:284.46
    2021-07-21 11:30:59 cswch:23471.14 nvcswch:53.52
    2021-07-21 11:31:00 cswch:24460.46 nvcswch:203.50
    2021-07-21 11:31:01 cswch:23419.50 nvcswch:60.64
    2021-07-21 11:31:03 cswch:26726.46 nvcswch:169.56
    2021-07-21 11:31:04 cswch:24444.54 nvcswch:187.92
    2021-07-21 11:31:05 cswch:23241.24 nvcswch:30.60
    2021-07-21 11:31:07 cswch:27613.88 nvcswch:74.90
    2021-07-21 11:31:08 cswch:26154.86 nvcswch:173.02
    2021-07-21 11:31:09 cswch:25369.32 nvcswch:42.82
    2021-07-21 11:31:11 cswch:24101.26 nvcswch:55.28
    2021-07-21 11:31:12 cswch:26191.50 nvcswch:69.14
    ^C
    

    我们重点关注 (11:30:24 ~ 2021-07-21 11:30:51)的输出:

    2021-07-21 11:30:24 cswch:4964.60 nvcswch:7.84
    2021-07-21 11:30:25 cswch:36033.14 nvcswch:356.78
    2021-07-21 11:30:26 cswch:38675.18 nvcswch:192.16
    2021-07-21 11:30:27 cswch:27278.76 nvcswch:592.28
    2021-07-21 11:30:28 cswch:28185.20 nvcswch:1245.52
    2021-07-21 11:30:30 cswch:27249.28 nvcswch:300.46
    2021-07-21 11:30:31 cswch:27757.72 nvcswch:235.06
    2021-07-21 11:30:32 cswch:35811.64 nvcswch:845.80
    2021-07-21 11:30:33 cswch:45824.74 nvcswch:4289.28
    2021-07-21 11:30:35 cswch:33469.40 nvcswch:921.34
    2021-07-21 11:30:36 cswch:33097.62 nvcswch:794.70
    2021-07-21 11:30:37 cswch:35508.16 nvcswch:407.34
    2021-07-21 11:30:39 cswch:31193.58 nvcswch:244.96
    2021-07-21 11:30:40 cswch:35123.04 nvcswch:4207.68
    2021-07-21 11:30:41 cswch:34204.14 nvcswch:14115.80
    2021-07-21 11:30:43 cswch:33899.98 nvcswch:549.66
    2021-07-21 11:30:44 cswch:38476.98 nvcswch:899.94
    2021-07-21 11:30:45 cswch:33318.62 nvcswch:264.74
    2021-07-21 11:30:47 cswch:32771.18 nvcswch:1146.34
    2021-07-21 11:30:48 cswch:31128.38 nvcswch:949.74
    2021-07-21 11:30:49 cswch:25169.86 nvcswch:240.10
    2021-07-21 11:30:51 cswch:23371.16 nvcswch:264.12
    

    可以发现,每秒非自愿上下文切换数(在下一节中有描述)比前后都高非常多,最多的达到了 14115.80 次,平时基本都是在 100 次内,偶尔有一两个大于 100 的。

    ** 因此,可以断定是因为服务瞬时创建大量线程,使线程上下文切换太频繁,线程无法连续执行,从而导致出现大量 timed-out。**

    所用到的命令

    Linux 进度调度测试 一文中的测试代码为例。

    (1)获取进程的 pid

    $ pgrep -f TestThread
    242241
    

    通过 pgrep -f 进程名称 获取进程的 pid。

    可以和 ps -ef 命令中输出的 pid 对比一下。

    $ ps -ef | grep TestThread
    meikai   242241 247192  0 12:28 pts/38   00:00:00 java TestThread
    meikai   242359 147324  0 12:28 pts/19   00:00:00 grep TestThread
    

    (2)获取进程的上下文切换信息

    
    meikai@dev:~/test$ pidstat -wt -p $(pgrep -f TestThread) 1 1
    Linux 4.19.0-17-amd64 (dev.wumii.net)   07/21/2021      _x86_64_        (8 CPU)
    
    12:31:59 PM   UID      TGID       TID   cswch/s nvcswch/s  Command
    12:32:00 PM 19200733    242241         -      0.00      0.00  java
    12:32:00 PM 19200733         -    242241      0.00      0.00  |__java
    12:32:00 PM 19200733         -    242242      0.00      0.00  |__java
    12:32:00 PM 19200733         -    242243      1.00      0.00  |__java
    12:32:00 PM 19200733         -    242244      0.00      0.00  |__java
    12:32:00 PM 19200733         -    242245      0.00      0.00  |__java
    12:32:00 PM 19200733         -    242247      0.00      0.00  |__java
    12:32:00 PM 19200733         -    242248      0.00      0.00  |__java
    12:32:00 PM 19200733         -    242249      0.00      0.00  |__java
    12:32:00 PM 19200733         -    242250      0.00      0.00  |__java
    12:32:00 PM 19200733         -    242251     20.00      0.00  |__java
    12:32:00 PM 19200733         -    242416      4.00      5.00  |__java
    12:32:00 PM 19200733         -    242417      4.00      1.00  |__java
    12:32:00 PM 19200733         -    242418      4.00      1.00  |__java
    12:32:00 PM 19200733         -    242419      3.00      7.00  |__java
    12:32:00 PM 19200733         -    242420      5.00      0.00  |__java
    12:32:00 PM 19200733         -    242421      4.00      9.00  |__java
    12:32:00 PM 19200733         -    242422      4.00      0.00  |__java
    12:32:00 PM 19200733         -    242423      4.00      0.00  |__java
    12:32:00 PM 19200733         -    242424      4.00      2.00  |__java
    12:32:00 PM 19200733         -    242425      4.00      0.00  |__java
    
    Average:      UID      TGID       TID   cswch/s nvcswch/s  Command
    ...
    

    参数简介:

    • TID:线程 ID
    • cswch/s:每秒自愿上下文切换数。
      • 进程运行时由于系统资源不足,如 IO,内存等原因不得不进行切换。
    • 自愿的切换的多了,表明在等待资源,如 IO
    • nvcswch/s:每秒非自愿上下文切换数。
      • 比如时间片用完,系统调度让其他任务运行,或者竞争 CPU 的时候也会发生。
      • 非自愿的切换的多了,表明 CPU 被任务竞争

    (3)统计进程中所有线程的上下文切换数量

    自定义 awk 脚本,脚本名为 stat.awk

    #!/bin/bin/awk -f
    
    #运行前
    BEGIN {
        cswch = 0
        nvcswch = 0
    }
    #运行中
    {
        if(NF == 0){
            next;
        }
        if ($1 == "Average:") {
            exit;
        }
        
        
        # 注意:所有用作算术运算符进行操作,操作数自动转为数值,所有非数值都变为0
        cswch+=$(NF-2) # cswch 是倒数第3列
        nvcswch+=$(NF-1) # nvcswch 是倒数第2列
        
        # 调试语句
        # print
        # printf "cswch:%.2f nvcswch:%.2f\n", cswch, nvcswch
    }
    #运行后
    END {
        time = strftime("%Y-%m-%d %H:%M:%S", systime()) # 获取当前时间
        printf "%s cswch:%.2f nvcswch:%.2f\n", time, cswch, nvcswch
    }
    

    统计进程 TestThread 的上下文切换总数:

    $ pidstat -wt -p $(pgrep -f TestThread) 1 1 | awk -f stat.awk
    2021-07-21 12:39:23 cswch:120.00 nvcswch:94.00
    

    (4)循环执行命令

    先测试一下是否正常。

    $ while true; do date; sleep 1; done
    Wed 21 Jul 2021 12:39:42 PM CST
    Wed 21 Jul 2021 12:39:43 PM CST
    

    实际使用:

    $ while true; do pidstat -wt -p $(pgrep -f TestThread) 1 1 | awk -f stat.awk; done
    2021-07-21 12:41:53 cswch:114.00 nvcswch:44.00
    2021-07-21 12:41:55 cswch:116.00 nvcswch:78.00
    2021-07-21 12:41:56 cswch:120.00 nvcswch:50.00
    2021-07-21 12:41:57 cswch:122.76 nvcswch:87.12
    2021-07-21 12:41:58 cswch:112.86 nvcswch:73.26
    2021-07-21 12:41:59 cswch:114.00 nvcswch:60.00
    

    参考

    相关文章

      网友评论

          本文标题:服务启动时出现大量 Hystrix timed-out 第一次排

          本文链接:https://www.haomeiwen.com/subject/bwxjmltx.html