美文网首页
xtrabackup:记录一次xtrabackup造成OOM问题

xtrabackup:记录一次xtrabackup造成OOM问题

作者: 真之棒2016 | 来源:发表于2020-12-04 17:39 被阅读0次

    author:sufei

    版本:xtrabackup 2.4.12

    说明:生产使用备份工具xtrabackup造成OOM,本文作为一个记录,并对相关显现进行分享说明。


    一、现象说明

    ​ 生产环境中,在使用xtrabaxkup进行备份时,触发了死锁,这个死锁的原因这里就不多说了,有很多相关分析,可以看截图如下:


    image.png

    ​ xtrabaxkup是卡在ftwrl中的commit lock等待中。

    ​ 本来这是一个很正常的现象,但是这次在出现的hang时,但是xtrabackup进程会一直在申请内存,最后把系统内存给吃光了,产生OOM,最后触发xtrabackup进程被kill。下面是相关的系统日志:

    xxx-crm-db-043 telegraf: 2020/09/24 09:17:38 authenticated: id=176476686980599134, timeout=40000
    xxx-crm-db-043 telegraf: 2020/09/24 09:17:38 re-submitting `0` credentials after reconnect
    xxx-crm-db-043 systemd-logind: Failed to save session data /run/systemd/sessions/13683961: No space left on device
    xxx-crm-db-043 systemd-logind: Failed to save user data /run/systemd/users/1006: No space left on device
    xxx-crm-db-043 systemd-logind: Removed session 13683961.
    xxx-crm-db-043 systemd-logind: Failed to save user data /run/systemd/users/1006: No space left on device
    xxx-crm-db-043 kernel: kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
    xxx-crm-db-043 kernel: kthreadd cpuset=/ mems_allowed=0
    xxx-crm-db-043 kernel: CPU: 2 PID: 2 Comm: kthreadd Not tainted 3.10.0-693.el7.x86_64 #1
    xxx-crm-db-043 kernel: Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 1.09 01/31/2019
    xxx-crm-db-043 kernel: ffff880127360fd0 00000000ec0267f4 ffff880127383a20 ffffffff816a3d91
    xxx-crm-db-043 kernel: ffff880127383ab0 ffffffff8169f186 ffffffff810e939c ffff880127133210
    xxx-crm-db-043 kernel: ffff880127133228 0000000000000202 0000000000000202 ffff880127383aa0
    xxx-crm-db-043 kernel: Call Trace:
    xxx-crm-db-043 kernel: [<ffffffff816a3d91>] dump_stack+0x19/0x1b
    xxx-crm-db-043 kernel: [<ffffffff8169f186>] dump_header+0x90/0x229
    xxx-crm-db-043 kernel: [<ffffffff810e939c>] ? ktime_get_ts64+0x4c/0xf0
    xxx-crm-db-043 kernel: [<ffffffff8113d36f>] ? delayacct_end+0x8f/0xb0
    xxx-crm-db-043 kernel: [<ffffffff81186394>] oom_kill_process+0x254/0x3d0
    xxx-crm-db-043 kernel: [<ffffffff81185e3d>] ? oom_unkillable_task+0xcd/0x120
    xxx-crm-db-043 kernel: [<ffffffff81185ee6>] ? find_lock_task_mm+0x56/0xc0
    xxx-crm-db-043 kernel: [<ffffffff81186bd6>] out_of_memory+0x4b6/0x4f0
    xxx-crm-db-043 kernel: [<ffffffff8169fc8a>] __alloc_pages_slowpath+0x5d6/0x724
    xxx-crm-db-043 kernel: [<ffffffff8118cd85>] __alloc_pages_nodemask+0x405/0x420
    xxx-crm-db-043 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
    xxx-crm-db-043 kernel: [<ffffffff8108510d>] copy_process+0x1dd/0x1970
    xxx-crm-db-043 kernel: [<ffffffff81029557>] ? __switch_to+0xd7/0x510
    xxx-crm-db-043 kernel: [<ffffffff81086a51>] do_fork+0x91/0x320
    xxx-crm-db-043 kernel: [<ffffffff816a8f8d>] ? __schedule+0x39d/0x8b0
    xxx-crm-db-043 kernel: [<ffffffff81086d06>] kernel_thread+0x26/0x30
    xxx-crm-db-043 kernel: [<ffffffff810b1341>] kthreadd+0x2c1/0x300
    xxx-crm-db-043 kernel: [<ffffffff810b1080>] ? kthread_create_on_cpu+0x60/0x60
    xxx-crm-db-043 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
    xxx-crm-db-043 kernel: [<ffffffff810b1080>] ? kthread_create_on_cpu+0x60/0x60
    xxx-crm-db-043 kernel: Mem-Info:
    xxx-crm-db-043 kernel: active_anon:62418042 inactive_anon:1310674 isolated_anon:0#012 active_file:144 inactive_file:214 isolated_file:10#012 unevictable:0 dirty:16 writeback:4 unstable:0#012 slab_reclaimable:104852 slab_unreclaimable:67281#012 mapped:3721 shmem:709624 pagetables:188395 bounce:0#012 free:1283802 free_pcp:321 free_cma:0
    xxx-crm-db-043 kernel: Node 0 DMA free:15896kB min:244kB low:304kB high:364kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:4kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    xxx-crm-db-043 kernel: lowmem_reserve[]: 0 1379 257354 257354
    xxx-crm-db-043 kernel: Node 0 DMA32 free:1045880kB min:21988kB low:27484kB high:32980kB active_anon:154856kB inactive_anon:161936kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):40kB present:1675200kB managed:1414856kB mlocked:0kB dirty:56kB writeback:0kB mapped:0kB shmem:8kB slab_reclaimable:15908kB slab_unreclaimable:2400kB kernel_stack:384kB pagetables:1616kB unstable:0kB bounce:0kB free_pcp:1268kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:16 all_unreclaimable? yes
    xxx-crm-db-043 kernel: lowmem_reserve[]: 0 0 255974 255974
    xxx-crm-db-043 kernel: Node 0 Normal free:4073432kB min:4073760kB low:5092200kB high:6110640kB active_anon:249517312kB inactive_anon:5080760kB active_file:916kB inactive_file:1884kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:266338304kB managed:262117464kB mlocked:0kB dirty:8kB writeback:16kB mapped:15012kB shmem:2838488kB slab_reclaimable:403500kB slab_unreclaimable:266720kB kernel_stack:35728kB pagetables:751964kB unstable:0kB bounce:0kB free_pcp:16kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:80302 all_unreclaimable? yes
    xxx-crm-db-043 kernel: lowmem_reserve[]: 0 0 0 0
    xxx-crm-db-043 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
    xxx-crm-db-043 kernel: Node 0 DMA32: 358*4kB (UEM) 340*8kB (UEM) 202*16kB (UEM) 571*32kB (UEM) 437*64kB (UEM) 1014*128kB (UEM) 602*256kB (UEM) 402*512kB (UEM) 260*1024kB (UEM) 48*2048kB (UEM) 34*4096kB (UM) = 1047160kB
    xxx-crm-db-043 kernel: Node 0 Normal: 8290*4kB (UEM) 3627*8kB (UEM) 1813*16kB (UEM) 7945*32kB (UEM) 4505*64kB (UEM) 2935*128kB (UEM) 1497*256kB (UEM) 533*512kB (UEM) 244*1024kB (EM) 877*2048kB (EM) 90*4096kB (M) = 4080144kB
    xxx-crm-db-043 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
    xxx-crm-db-043 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    xxx-crm-db-043 kernel: 718080 total pagecache pages
    xxx-crm-db-043 kernel: 7571 pages in swap cache
    xxx-crm-db-043 kernel: Swap cache stats: add 43936925, delete 43929354, find 255985525/257994660
    xxx-crm-db-043 kernel: Free swap  = 0kB
    xxx-crm-db-043 kernel: Total swap = 67129340kB
    xxx-crm-db-043 kernel: 67007372 pages RAM
    xxx-crm-db-043 kernel: 0 pages HighMem/MovableOnly
    xxx-crm-db-043 kernel: 1120317 pages reserved
    xxx-crm-db-043 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
    xxx-crm-db-043 kernel: [  770]     0   770    27716      528      58       62             0 systemd-journal
    xxx-crm-db-043 kernel: [  796]     0   796    10959        2      23      135         -1000 systemd-udevd
    xxx-crm-db-043 kernel: [ 1052]     0  1052    13863       31      28       82         -1000 auditd
    xxx-crm-db-043 kernel: [ 1085]     0  1085     5640      244      16       67             0 irqbalance
    xxx-crm-db-043 kernel: [ 1087]   999  1087   133565      104      56     1831             0 polkitd
    xxx-crm-db-043 kernel: [ 1090]     0  1090     6163       50      16      150             0 systemd-logind
    xxx-crm-db-043 kernel: [ 1092]    81  1092     6134       76      16       46          -900 dbus-daemon
    xxx-crm-db-043 kernel: [ 1103]     0  1103    31567       27      21      141             0 crond
    xxx-crm-db-043 kernel: [ 1137]     0  1137    27511        1      10       29             0 agetty
    xxx-crm-db-043 kernel: [ 1547]     0  1547   140597       82      93     3141             0 tuned
    xxx-crm-db-043 kernel: [ 1554]     0  1554     3878        7      12       37             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 1555]     0  1555     3878      592      12       25             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 1556]     0  1556     3878       61      12       18             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 1557]     0  1557     3878        6      11       38             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 1597]     0  1597    10892        6      23      129             0 ctcss-execd
    xxx-crm-db-043 kernel: [ 1609]  1000  1609    28750       56      26      134             0 ctcss-agentd
    xxx-crm-db-043 kernel: [ 1617]     0  1617    10913       21      24      121             0 ctcss-logcollec
    xxx-crm-db-043 kernel: [ 1629]     0  1629    10661       18      23      383             0 ctcss-syscheckd
    xxx-crm-db-043 kernel: [ 4092]     0  4092     3878        6      11       36             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 4094]     0  4094     3878      594      12       27             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 4095]     0  4095     3878       32      10       26             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 4128]     0  4128     3878        6      13       36             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 4129]     0  4129     3878      592      14       26             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 4130]     0  4130     3878       32      12       26             0 ctcm_agentd
    xxx-crm-db-043 kernel: [ 3249]     0  3249    11038        4      23      136             0 keepalived
    xxx-crm-db-043 kernel: [ 3250]     0  3250    11038       19      23      144             0 keepalived
    xxx-crm-db-043 kernel: [  664]     0   664    28848        1      12      102             0 bash
    xxx-crm-db-043 kernel: [  668]     0   668  1328640    72048     438    16566             0 filebeat
    xxx-crm-db-043 kernel: [ 2433]     0  2433  1305720    74938     490    15311             0 filebeat
    xxx-crm-db-043 kernel: [187163]  1005 187163    20434       10      40      193             0 zabbix_agentd
    xxx-crm-db-043 kernel: [187164]  1005 187164    20434      585      42      182             0 zabbix_agentd
    xxx-crm-db-043 kernel: [187165]  1005 187165    20434       16      40      204             0 zabbix_agentd
    xxx-crm-db-043 kernel: [187166]  1005 187166    20434       17      40      204             0 zabbix_agentd
    xxx-crm-db-043 kernel: [187167]  1005 187167    20434       16      40      204             0 zabbix_agentd
    xxx-crm-db-043 kernel: [187168]  1005 187168    20434       60      40      181             0 zabbix_agentd
    xxx-crm-db-043 kernel: [ 7329]  1008  7329    37234        2      29       88             0 ps
    xxx-crm-db-043 kernel: [130090]     0 130090    26499       34      54      211         -1000 sshd
    xxx-crm-db-043 kernel: [33976]  1008 33976    37234        0      29       90             0 ps
    xxx-crm-db-043 kernel: [112227]  1008 112227    37234        2      29       88             0 ps
    xxx-crm-db-043 kernel: [126338]  1008 126338    37234        2      29       88             0 ps
    xxx-crm-db-043 kernel: [179285]  1008 179285    37234        0      28       90             0 ps
    xxx-crm-db-043 kernel: [38321]  1008 38321    37234        2      29       88             0 ps
    xxx-crm-db-043 kernel: [117415]  1008 117415    37234        0      30       89             0 ps
    xxx-crm-db-043 kernel: [212839]  1003 212839    28315        2      13       87             0 mysqld_safe
    xxx-crm-db-043 kernel: [214370]  1003 214370 24301176  9279471   44191  8368726             0 mysqld
    xxx-crm-db-043 kernel: [190668]  1003 190668    28315        2      12       86             0 mysqld_safe
    xxx-crm-db-043 kernel: [192347]  1003 192347 19795602 10438845   37189  6971474             0 mysqld
    xxx-crm-db-043 kernel: [130629]     0 130629   684679    18330     178     4169             0 filebeat
    xxx-crm-db-043 kernel: [107361]     0 107361   804691     2447     139     2890             0 dkron
    xxx-crm-db-043 kernel: [107368]     0 107368   133765      125      37     1648             0 dkron-executor-
    xxx-crm-db-043 kernel: [107383]     0 107383   119408       92      36     1148             0 dkron-executor-
    xxx-crm-db-043 kernel: [107396]     0 107396   135824       91      39     2187             0 dkron-executor-
    xxx-crm-db-043 kernel: [107417]     0 107417   102684      106      35      642             0 dkron-executor-
    xxx-crm-db-043 kernel: [77160]    38 77160     6417       35      19      111             0 ntpd
    xxx-crm-db-043 kernel: [ 9166]     0  9166   185146     1052     244      344             0 rsyslogd
    xxx-crm-db-043 kernel: [141268]  1009 141268     5023        6      13       41             0 zabbix_agentd
    xxx-crm-db-043 kernel: [141269]  1009 141269     5023      596      14       32             0 zabbix_agentd
    xxx-crm-db-043 kernel: [141270]  1009 141270     5023       50      13       23             0 zabbix_agentd
    xxx-crm-db-043 kernel: [141271]  1009 141271     5023       50      13       24             0 zabbix_agentd
    xxx-crm-db-043 kernel: [141272]  1009 141272     5023       49      13       26             0 zabbix_agentd
    xxx-crm-db-043 kernel: [141273]  1009 141273     5023       28      12       30             0 zabbix_agentd
    xxx-crm-db-043 kernel: [ 9724]  1008  9724    37234        2      30       88             0 ps
    xxx-crm-db-043 kernel: [ 9901]  1008  9901   177200     2555      92     3360             0 python
    xxx-crm-db-043 kernel: [225017]     0 225017   147896    12602     119    10106             0 java
    xxx-crm-db-043 kernel: [ 4265]     0  4265   142776      730      97     9462             0 java
    xxx-crm-db-043 kernel: [121182]  1004 121182  3530288   319081    1408   285577             0 java
    xxx-crm-db-043 kernel: [121671]  1004 121671  3497006   134740    1395   455433             0 java
    xxx-crm-db-043 kernel: [220928]  1004 220928  2327527   157952     641    17454             0 java
    xxx-crm-db-043 kernel: [222409]  1004 222409  2327527   145042     650    21089             0 java
    xxx-crm-db-043 kernel: [120981]  1007 120981   897218     6506     156       18             0 telegraf
    xxx-crm-db-043 kernel: [14208]  1003 14208    14969      182      31        0             0 mysqlreport
    xxx-crm-db-043 kernel: [14209]  1003 14209   202664      429      57        1             0 mysqlreport
    xxx-crm-db-043 kernel: [164882]     0 164882    37079      325      73        0             0 sshd
    xxx-crm-db-043 kernel: [164884]  1006 164884    37079      323      72        0             0 sshd
    xxx-crm-db-043 kernel: [164885]  1006 164885    28936      180      13        0             0 bash
    xxx-crm-db-043 kernel: [165192]     0 165192    37079      324      75        0             0 sshd
    xxx-crm-db-043 kernel: [165194]  1006 165194    37079      322      73        0             0 sshd
    xxx-crm-db-043 kernel: [165195]  1006 165195    28912      180      15        0             0 bash
    xxx-crm-db-043 kernel: [155589]     0 155589    37078      324      75        0             0 sshd
    xxx-crm-db-043 kernel: [155591]     0 155591    37078      323      77        0             0 sshd
    xxx-crm-db-043 kernel: [155592]  1003 155592    37078      330      73        0             0 sshd
    xxx-crm-db-043 kernel: [155594]  1003 155594    37078      323      75        0             0 sshd
    xxx-crm-db-043 kernel: [155595]  1003 155595    12777      133      29        0             0 sftp-server
    xxx-crm-db-043 kernel: [155606]  1003 155606    28283       61      13        0             0 bash
    xxx-crm-db-043 kernel: [155617]  1003 155617 50329587 42324156   97694   248269             0 innobackupex
    xxx-crm-db-043 kernel: [155618]  1003 155618     1607       28       9        0             0 sshpass
    xxx-crm-db-043 kernel: [155619]  1003 155619    44504      685      42        0             0 ssh
    xxx-crm-db-043 kernel: [159336]     0 159336    37079      324      75        0             0 sshd
    xxx-crm-db-043 kernel: [159338]  1006 159338    37079      333      73        0             0 sshd
    xxx-crm-db-043 kernel: [159340]  1006 159340    28904      144      16        0             0 bash
    xxx-crm-db-043 kernel: [110630]     0 110630    37079      325      75        0             0 sshd
    xxx-crm-db-043 kernel: [110633]  1003 110633    37079      335      73        0             0 sshd
    xxx-crm-db-043 kernel: [110634]  1003 110634    28847      100      14        0             0 bash
    xxx-crm-db-043 kernel: [111511]  1003 111511    33443      171      22        0             0 mysql
    xxx-crm-db-043 kernel: [ 4741]     0  4741    37079      324      77        0             0 sshd
    xxx-crm-db-043 kernel: [ 4752]  1003  4752    37079      338      73        0             0 sshd
    xxx-crm-db-043 kernel: [ 4753]  1003  4753    28847      103      14        0             0 bash
    xxx-crm-db-043 kernel: [24250]     0 24250    37079      324      76        0             0 sshd
    xxx-crm-db-043 kernel: [24252]  1003 24252    37079      332      74        0             0 sshd
    xxx-crm-db-043 kernel: [24253]  1003 24253    28847      104      15        0             0 bash
    xxx-crm-db-043 kernel: [31919]     0 31919    37079      324      73        0             0 sshd
    xxx-crm-db-043 kernel: [31921]  1004 31921    37079      329      71        0             0 sshd
    xxx-crm-db-043 kernel: [31922]  1004 31922    28847       92      13        0             0 bash
    xxx-crm-db-043 kernel: [32836]     0 32836    48893      172      53        0             0 sudo
    xxx-crm-db-043 kernel: [32837]     0 32837    47397      145      48        0             0 su
    xxx-crm-db-043 kernel: [32838]     0 32838    28847       99      14        0             0 bash
    xxx-crm-db-043 kernel: [33133]     0 33133    47397      144      52        0             0 su
    xxx-crm-db-043 kernel: [33134]  1003 33134    28848       99      12        0             0 bash
    xxx-crm-db-043 kernel: [35883]  1003 35883    33443      161      21        0             0 mysql
    xxx-crm-db-043 kernel: Out of memory: Kill process 155617 (innobackupex) score 516 or sacrifice child
    xxx-crm-db-043 kernel: Killed process 155617 (innobackupex) total-vm:201318348kB, anon-rss:169296624kB, file-rss:0kB, shmem-rss:0kB
    xxx-crm-db-043 telegraf: 2020/09/24 09:17:54 recv loop terminated: err=EOF
    xxx-crm-db-043 telegraf: 2020/09/24 09:17:54 send loop terminated: err=<nil>
    xxx-crm-db-043 systemd-logind: New session 13705934 of user db.
    xxx-crm-db-043 systemd: Started Session 13705934 of user db.
    
    

    ​ 可以看到,在被kill之前,xtrabackup已经使用了200多G内存了。这就不是正常应该出现的情况了。并且在监控平台可以看到:

    image.jpeg

    ​ 可以看到基本是在死锁发生之后,xtrabackup一直在申请内存,而没有释放,造成了内存泄漏。

    ​ 而对应的xtrabackup日志则是:

    ## 备份参数
     start to run innobackup,cmd is xtrabackup --defaults-file=/app/db/mysql/mysql_8804/etc/mysql8804.cnf --backup --ftwrl-wait-threshold=10 --ftwrl-wait-query-type=all --ftwrl-wait-timeout=20 --kill-long-query-type=all --kill-long-queries-timeout=60 --stream=xbstream --parallel=4 --compress-threads=4 --user=agent --password='*****' --socket=/tmp/mysql_8804.sock --target-dir=/tmp | sshpass -p'passwd' ssh -p22 -oPreferredAuthentications=password -o StrictHostKeyChecking=no  -o GSSAPIAuthentication=no db@134.84.46.9 'xbstream -x -C /app/db/mysql-bak/set_6798804/134.84.46.6_8804/backup_0000000079/2020-11-12_00-00-10'
    
    ## 相关日志
    
    201112 03:08:21 >> log scanned up to (1895593797155)
    201112 03:08:22 >> log scanned up to (1895593857713)
    201112 03:08:23 [01]        ...done
    201112 03:08:23 Waiting 20 seconds for queries running longer than 10 seconds to finish
    201112 03:08:23 Executing FLUSH TABLES WITH READ LOCK...
    201112 03:08:23 Kill query timeout 60 seconds.
    201112 03:08:23 >> log scanned up to (1895595105480)
    201112 03:08:24 >> log scanned up to (1895595108000)
    201112 03:08:25 >> log scanned up to (1895595121941)
    201112 03:08:26 >> log scanned up to (1895595135882)
    ……
    201112 05:39:49 >> log scanned up to (1895595179845)
    201112 05:39:50 >> log scanned up to (1895595179845)
    201112 05:39:51 >> log scanned up to (1895595179845)
    
    [2020/11/12 05:40:12 CST] [EROR] (db/pkg/backuprecovery/backup.(*nationalBackupManager).dataInnobackup:72) [set_2348801]: backup failed
    [2020/11/12 05:40:12 CST] [EROR] (db/pkg/backuprecovery/backup.(*basicBackupManager).logErrorAndReport:161) Xtrabackup run backup failed!, can't find completed OK! in last half ret
    
    

    二、分析

    2.1 xtrabackup备份逻辑

    ​ 这一部分就不过多说明,简单需要了解的就是:xtrabackup进行mysql备份时,首先会开启一个辅助线程,进行redo log拷贝;主线程则首先并行的进行ibd文件拷贝;然后执行ftwrl得到一致性位点;最后进行frm等非事务文件的拷贝。主线程通知redo log拷贝线程停止工作,并释放锁退出。

    ​ 从上面的日志可以看出:

    • xtrabackup已经完成了ibd文件的拷贝,正在进行ftwrl,发生死锁
    • 此时只有主线程,redolog复制线程;
    • 可以看到我们在xtrabackup中开启了kill-long-queries-timeout选项,所以,在执行ftwrl时,会开启一个辅助线程,从Kill query timeout 60 seconds.日志也可以看出。

    可以看到,发生内存泄漏只可能发生在上述三个线程中,而主线程此时应该阻塞在ftwrl中,而redo log复制线程由于死锁,没有redolog,应该也不会存在内存申请的过程。那可以发生泄漏就只有kill-long-queries-timeout辅助线程了,下面是相关源码分析(基于2.4.12)。

    2.1 源码分析

    ​ 首先,我们查看执行ftwrl附近相关源代码

        msg_ts("Executing FLUSH TABLES WITH READ LOCK...\n"); // 打印正在执行ftwrl日志
    
        if (opt_kill_long_queries_timeout) {  // 如果开启了kill-long-queries-timeout选项,则开启辅助线程,内部通过os_event进行唤醒退出
            start_query_killer();
        }
    
        ……
        xb_mysql_query(connection, "FLUSH TABLES WITH READ LOCK", false);  // 执行ftwrl
    
        if (opt_kill_long_queries_timeout) {    // 执行完ftwrl后,通过os_event通知辅助线程退出
            stop_query_killer();
        }
    
        tables_locked = true;
    
    

    ​ 而复杂线程的执行代码如下:

    kill_query_thread(
    /*===============*/
        void *arg __attribute__((unused)))
    {
        MYSQL   *mysql;
        time_t  start_time;
    
        start_time = time(NULL);
    
        os_event_set(kill_query_thread_started);
    
        msg_ts("Kill query timeout %d seconds.\n",  // 打印日志
               opt_kill_long_queries_timeout);
    
        while (time(NULL) - start_time <
                    (time_t)opt_kill_long_queries_timeout) {
            if (os_event_wait_time(kill_query_thread_stop, 1000) !=
                OS_SYNC_TIME_EXCEEDED) {
                goto stop_thread;
            }
        }
    
        if ((mysql = xb_mysql_connect()) == NULL) {
            msg("Error: kill query thread failed\n");
            goto stop_thread;
        }
    
        while (true) {   // 循环执行kill操作,kill超时的语句,直到遇到kill_query_thread_stop被唤醒为止
            kill_long_queries(mysql, time(NULL) - start_time);
            if (os_event_wait_time(kill_query_thread_stop, 1000) !=
                OS_SYNC_TIME_EXCEEDED) {
                break;
            }
        }
    
        mysql_close(mysql);
    
    stop_thread:
        msg_ts("Kill query thread stopped\n");
    
        os_event_set(kill_query_thread_stopped);
    
        os_thread_exit();
        OS_THREAD_DUMMY_RETURN;
    }
    
    

    ​ 从上述逻辑看,并没有什么问题,但是当我们查看kill_long_queries函数时,发现了内存泄漏,由于在不断的while循环之中,最终导致OOM,下面时kill_long_queries函数具体实现

    static
    void
    kill_long_queries(MYSQL *connection, uint timeout)
    {
        MYSQL_RES *result;
        MYSQL_ROW row;
        bool all_queries;
        char kill_stmt[100];
    
        result = xb_mysql_query(connection, "SHOW FULL PROCESSLIST", true);  // 查看mysql现有线程执行情况,结果保存在result中
    
        all_queries = (opt_kill_long_query_type == QUERY_TYPE_ALL);
        while ((row = mysql_fetch_row(result)) != NULL) {
            const char  *info       = row[7];
            int     duration    = atoi(row[5]);
            char        *id     = row[0];
    
            if (info != NULL &&            // 只有info不为null,才能被kill
                duration >= (int)timeout &&
                ((all_queries && is_query(info)) ||
                    is_select_query(info))) {
                msg_ts("Killing query %s (duration %d sec): %s\n",
                       id, duration, info);
                ut_snprintf(kill_stmt, sizeof(kill_stmt),
                        "KILL %s", id);
                xb_mysql_query(connection, kill_stmt, false, false);
            }
        }
    }
    
    

    上述处理过程存在两个问题:

    • info != NULL判断条件,我们在死锁界面可以看到,所有线程的info都为null,所以kill_long_queries并不能解决死锁问题,造成kill_long_queries会不断执行;
    • xb_mysql_query返回结果result没有释放,造成内存泄漏。这一点是坚决不被允许的,在mysql客户端开发手册中明确指出
    image.jpeg

    三、规避与修复

    ​ 指导了上述问题原因,相应的应对措施也就可以对症下药了。

    相关文章

      网友评论

          本文标题:xtrabackup:记录一次xtrabackup造成OOM问题

          本文链接:https://www.haomeiwen.com/subject/lquiwktx.html