美文网首页
MySQL 5.7复制的一个小bug-XA事务

MySQL 5.7复制的一个小bug-XA事务

作者: 小卢二 | 来源:发表于2017-07-21 19:26 被阅读0次

    线上一个5.7从库复制中断:

    mysql>show slave status\G
    *************************** 1. row ***************************
                   Slave_IO_State: Waiting for master to send event
                      Master_Host: 192.168.0.1
                      Master_User: repl
                      Master_Port: 3306
                    Connect_Retry: 60
                  Master_Log_File: mysql-bin.000006
              Read_Master_Log_Pos: 49461560
                   Relay_Log_File: db-music1-relay-bin.000002
                    Relay_Log_Pos: 379466837
            Relay_Master_Log_File: mysql-bin.000001
                 Slave_IO_Running: Yes
                Slave_SQL_Running: No
                  Replicate_Do_DB: 
              Replicate_Ignore_DB: 
               Replicate_Do_Table: 
           Replicate_Ignore_Table: 
          Replicate_Wild_Do_Table: 
      Replicate_Wild_Ignore_Table: 
                       Last_Errno: 1399
                       Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'd199707a-5287-11e7-8f35-246e96729f10:1215544883' at master log mysql-bin.000001, end_log_pos 123. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                     Skip_Counter: 0
              Exec_Master_Log_Pos: 379466624
                  Relay_Log_Space: 2575974028
                  Until_Condition: None
                   Until_Log_File: 
                    Until_Log_Pos: 0
               Master_SSL_Allowed: No
               Master_SSL_CA_File: 
               Master_SSL_CA_Path: 
                  Master_SSL_Cert: 
                Master_SSL_Cipher: 
                   Master_SSL_Key: 
            Seconds_Behind_Master: NULL
    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0
                    Last_IO_Error: 
                   Last_SQL_Errno: 1399
                   Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'd199707a-5287-11e7-8f35-246e96729f10:1215544883' at master log mysql-bin.000001, end_log_pos 123. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
      Replicate_Ignore_Server_Ids: 
                 Master_Server_Id: 244331
                      Master_UUID: ce522aa3-6398-11e7-945f-246e96729568
                 Master_Info_File: mysql.slave_master_info
                        SQL_Delay: 0
              SQL_Remaining_Delay: NULL
          Slave_SQL_Running_State: 
               Master_Retry_Count: 86400
                      Master_Bind: 
          Last_IO_Error_Timestamp: 
         Last_SQL_Error_Timestamp: 170708 13:22:50
                   Master_SSL_Crl: 
               Master_SSL_Crlpath: 
               Retrieved_Gtid_Set: ce522aa3-6398-11e7-945f-246e96729568:1,
    d199707a-5287-11e7-8f35-246e96729f10:1215032565-1218510979
                Executed_Gtid_Set: 30472597-527e-11e7-a93f-246e96729ec0:1-47,
    d199707a-5287-11e7-8f35-246e96729f10:1-1215544882,
    d23b9cfd-5287-11e7-a307-246e96729ff0:1-9
                    Auto_Position: 1
             Replicate_Rewrite_DB: 
                     Channel_Name: 
               Master_TLS_Version: 
    1 row in set (0.00 sec)
    

    查询具体报错:

    mysql>select * from performance_schema.replication_applier_status_by_worker \G
    *************************** 1. row ***************************
             CHANNEL_NAME: 
                WORKER_ID: 3
                THREAD_ID: NULL
            SERVICE_STATE: OFF
    LAST_SEEN_TRANSACTION: d199707a-5287-11e7-8f35-246e96729f10:1215544883
        LAST_ERROR_NUMBER: 1399
       LAST_ERROR_MESSAGE: Worker 3 failed executing transaction 'd199707a-5287-11e7-8f35-246e96729f10:1215544883' at master log mysql-bin.000001, end_log_pos 123; Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  IDLE state' on query. Default database: ''. Query: 'ROLLBACK'
     LAST_ERROR_TIMESTAMP: 2017-07-08 13:22:50
    1 rows in set (0.00 sec)
    

    第一感觉很奇怪,为什么会rollback失败呢?于是根据gtid去对应的主库binlog去看了下,并没有任何rollback语句:

    # at 379611076
    #170707 20:28:38 server id 61481  end_log_pos 379611141 CRC32 0x65240fde        GTID    last_committed=512318   sequence_number=512319
    SET @@SESSION.GTID_NEXT= 'd199707a-5287-11e7-8f35-246e96729f10:1215544883'/*!*/;
    # at 379611141
    #170707 20:28:38 server id 61481  end_log_pos 379611269 CRC32 0x19e0ec66        Query   thread_id=758257        exec_time=59329 error_code=0
    SET TIMESTAMP=1499430518/*!*/;
    SET @@session.sql_mode=524288/*!*/;
    XA START X'3230375f313533353431363835313835373138305f',X'353034',39187
    /*!*/;
    # at 379611269
    #170707 20:28:38 server id 61481  end_log_pos 379611392 CRC32 0xd37354d4        Table_map: `XXXX`.`XXXX` mapped to number 123
    # at 379611392
    #170707 20:28:38 server id 61481  end_log_pos 379611910 CRC32 0xec28e0a5        Update_rows: table id 123 flags: STMT_END_F
    ### UPDATE  XXXX.XXXX
    ### WHERE
    ###   @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
    ###   @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @5=1499148021048 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @7=114 /* INT meta=0 nullable=1 is_null=0 */
    ### SET
    ###   @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */```
    ###   @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @5=1499430518207 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @7=115 /* INT meta=0 nullable=1 is_null=0 */
    # at 379611910```
    #170707 20:28:38 server id 61481  end_log_pos 379612036 CRC32 0x8f8d4a35        Query   thread_id=758257        exec_time=59329 error_code=0
    SET TIMESTAMP=1499430518/*!*/;
    SET @@session.sql_mode=1436549152/*!*/;
    XA END X'3230375f313533353431363835313835373138305f',X'353034',39187
    /*!*/;
    # at 379612036
    #170707 20:28:38 server id 61481  end_log_pos 379612096 CRC32 0xd4966095        XA PREPARE X'3230375f313533353431363835313835373138305f',X'353034',39187
    XA PREPARE X'3230375f313533353431363835313835373138305f',X'353034',39187
    /*!*/;
    # at 379614541
    #170707 20:28:38 server id 61481  end_log_pos 379614687 CRC32 0x5273d888        Query   thread_id=758257        exec_time=59329 error_code=0
    SET TIMESTAMP=1499430518/*!*/;
    SET @@session.sql_mode=1436549152/*!*/;
    XA COMMIT X'3230375f313533353431363835313835373138305f',X'353034',39187
    /*!*/;
    

    看下本地的relay log,找到这个事务的gtid

    # at 379611289
    #170707 20:28:38 server id 61481  end_log_pos 379611141 CRC32 0x65240fde        GTID    last_committed=512318   sequence_number=512319
    SET @@SESSION.GTID_NEXT= 'd199707a-5287-11e7-8f35-246e96729f10:1215544883'/*!*/;
    # at 379611354
    #170707 20:28:38 server id 61481  end_log_pos 379611269 CRC32 0x19e0ec66        Query   thread_id=758257        exec_time=59329 error_code=0
    SET TIMESTAMP=1499430518/*!*/;
    SET @@session.sql_mode=524288/*!*/;
    XA START X'3230375f313533353431363835313835373138305f',X'353034',39187
    /*!*/;
    # at 379611482
    #170707 20:28:38 server id 61481  end_log_pos 379611392 CRC32 0xd37354d4        Table_map: XXXX.XXXX mapped to number 123
    # at 379611605
    #170707 20:28:38 server id 61481  end_log_pos 379611910 CRC32 0xec28e0a5        Update_rows: table id 123 flags: STMT_END_F
    
    BINLOG '
    dn5fWRMp8AAAewAAAABpoBYAAHsAAAAAAAEADGRiX3BsYXlsaXN0MQAOTXVzaWNfUGxheWxpc3QA
    KQgPCAgICAMDAwgPAwMIAw8IAQMPCAEPAQMIAQEDCAEBDwMIAQQDAwMIDf0C/QL9AgAMdwGAAQT6
    ////vwHUVHPT
    dn5fWRgp8AAABgIAAAZroBYAAHsAAAAAAAEAKf///////////////wCECAAA/tDzPBwAAAAAGQDm
    jL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAADhtLwxdAQAAAAAAAAAAAABy
    AAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAA3gv6G10BAAABAAAAgQEAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAHAAAAAAAAAAAAAAAAAAAAAAADhtLwxdAQAAAAAgAGIyYTcxOWE4NjdmZjNkNmJlZWI5
    YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAAAACECAAA/tDz
    PBwAAAAAGQDmjL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAAL/9BR1dAQAA
    AAAAAAAAAABzAAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAAv/0FHV0BAAABAAAAgQEAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAHEAAAAAAAAAAAAAAAAAAAAAAL/9BR1dAQAAAAAgAGIyYTcxOWE4Njdm
    ZjNkNmJlZWI5YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAA
    AKXgKOw=
    '/*!*/;
    ### UPDATE  XXXX.XXXX
    ### WHERE
    ###   @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
    ###   @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @5=1499148021048 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @7=114 /* INT meta=0 nullable=1 is_null=0 */
    ### SET
    ###   @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
    ###   @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @5=1499430518207 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
    ###   @7=115 /* INT meta=0 nullable=1 is_null=0 */
    # at 379612123
    #170707 20:28:38 server id 61481  end_log_pos 379612036 CRC32 0x8f8d4a35        Query   thread_id=758257        exec_time=59329 error_code=0
    SET TIMESTAMP=1499430518/*!*/;
    SET @@session.sql_mode=1436549152/*!*/;
    XA END X'3230375f313533353431363835313835373138305f',X'353034',39187
    /*!*/;
    # at 379612249
    #170708 13:18:53 server id 254331  end_log_pos 379612306 CRC32 0xf2c69619       Rotate to db-music1-relay-bin.000003  pos: 4
    ROLLBACK /* added by mysqlbinlog */ /*!*/;
    SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
    DELIMITER ;
    # End of log file
    /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
    /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
    

    到这里,这个relay log日志文件结束了。很显然问题也找到了,就是执行

    ROLLBACK /* added by mysqlbinlog */ /*!*/;
    

    出错了。

    首先 我们看到 这个rollback是MySQL自己加上去的,那么为什么要加呢?
    mysql为了保证一个事务只在一个binlog里,所以当Binlog或者relay log发生截断时,最后一个事务要么commit,要么rollback,如果rollback,那么下一个binlog或者relay log会把这个事务重做一遍,保证这个事务不会丢。
    由于xa事务无法直接rollback,而需要xa rollback ‘XXX’,所以复制就停了。
    怎么修复?是不是直接跳过这个rollback就行了?
    我们来试一下,跳过这个rollback:

    STOP SLAVE;
    SET GTID_NEXT="d199707a-5287-11e7-8f35-246e96729f10:1215544883";
    BEGIN; COMMIT;
    SET GTID_NEXT="AUTOMATIC";
    START SLAVE;
    

    这里GTID_NEXT值不能用show slave status的里executed值,得用具体报错的停止的gtid
    但是,show slave status看到,还是有报错:

    Error 'XAER_NOTA: Unknown XID' on query. Default database: 'XXXX'. Query: 'XA COMMIT X'3230375f313533353431363835313835373138305f',X'353034',39187'
    

    为什么又报这个事务commit找不到XID呢?
    之前说过,在relaylog截断的时候,如果事务没有commit,会自动在最后加rollback,在下一个relay log开始的时候重新做一次这个事务,按理说我们跳过这个rollback,在下个relaylog会被重做,为啥会在commit的时候找不到xid呢?
    我们看到我们跳过的gtid原来就是重做这个事务到PREPARE阶段的gtid,原来,rollback是没有gtid的,所以我们实际上就是把这个事务到PREPARE阶段的gtid给跳过了,commit的时候肯定会找不到xid,接着怎么修复?

    • 手动把这个事务执行一遍,手动commit。
      • 指定 ‘/!/;’为结束符,每个binlog开头都有
        BINLOG ' 8HFvWQ+8fQQAdwAAAHsAAAAAAAQANS43LjE4LTE1LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA AUvdRu4= '/*!*/;
      • 设置gtid_next模式
        SET @@SESSION.GTID_NEXT= 'AUTOMATIC’ ;
      • 开始事务
        XA START X'3230375f313533353431363835313835373138305f',X'353034',39187 /*!*/;
      • 事务内容
        BINLOG ' dn5fWRMp8AAAewAAAABpoBYAAHsAAAAAAAEADGRiX3BsYXlsaXN0MQAOTXVzaWNfUGxheWxpc3QA KQgPCAgICAMDAwgPAwMIAw8IAQMPCAEPAQMIAQEDCAEBDwMIAQQDAwMIDf0C/QL9AgAMdwGAAQT6 ////vwHUVHPT dn5fWRgp8AAABgIAAAZroBYAAHsAAAAAAAEAKf///////////////wCECAAA/tDzPBwAAAAAGQDm jL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAADhtLwxdAQAAAAAAAAAAAABy AAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAA3gv6G10BAAABAAAAgQEAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAHAAAAAAAAAAAAAAAAAAAAAAADhtLwxdAQAAAAAgAGIyYTcxOWE4NjdmZjNkNmJlZWI5 YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAAAACECAAA/tDz PBwAAAAAGQDmjL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAAL/9BR1dAQAA AAAAAAAAAABzAAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAAv/0FHV0BAAABAAAAgQEAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAHEAAAAAAAAAAAAAAAAAAAAAAL/9BR1dAQAAAAAgAGIyYTcxOWE4Njdm ZjNkNmJlZWI5YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAA AKXgKOw= '/*!*/;
      • 结束 & prepare事务
        XA END X'3230375f313533353431363835313835373138305f',X'353034',39187 /*!*/; XA PREPARE X'3230375f313533353431363835313835373138305f',X'353034',39187 /*!*/;
      • 提交事务
        XA COMMIT X'3230375f313533353431363835313835373138305f',X'353034',39187 /*!*/;
    • 跳过commit这个gtid
    • start slave
      这样,才算是完整的修复了这个主从复制,并且数据保持一致。

    为什么这是5.7的一个bug呢?5.7之前的版本因为relaylog被截断并不会出现这个bug。
    5.7对xa事务的binlog记录方式做了修改,把 xa start,xa end,xa prepare放到一个event里,xa commit又是另外一个event。而在之前的MySQL版本中,整个xa事务从start到commit都是在一个event中,所以其他版本并没有问题。
    最后一个问题:5.7为啥要把xa事务拆成两个event?简单的讲是为了数据安全性。
    5.5或者5.6假设下面一个场景:

    • 主库执行 xa事务直到prepare
    • kill主库。
    • 拉起主库,再xa commit停机前的事务
      主库提交成功,修改的数据也是没问题的,再去从库,会发现,从库的数据并没有变化。WHY?

    MySQL在某个分布式事务prepare成功后宕机,宕机前操作该事务的连接并没有断开(如果在宕机前断开连接,事务会被MySQL自动回滚),这个时候已经prepare的事务并不会被回滚,所以在MySQL重新启动后,引擎层通过recover机制能恢复该事务。当然该事务的Binlog已经在宕机过程中被丢失,这个时候,如果去提交,则会造成主从数据的不一致, 即提交没有记录Binlog,从上丢失该条数据。

    正因为5.7之前版本的xa事务存在这个bug,5.7后做了修复。从XA START到XA PREPARE之间的操作都被记录到了Master的Binlog中,然后通过复制关系传到了Slave上。也就是说5.7开始,MySQL对于XA事务,在prepare的时候就完成了写Binlog的操作,通过新增一种叫 XA_prepare_log_event 的event类型来实现。

    其实 MySQL5.7在xa事务上远不止这个bug,后面再来慢慢总结。

    相关文章

      网友评论

          本文标题:MySQL 5.7复制的一个小bug-XA事务

          本文链接:https://www.haomeiwen.com/subject/kumekxtx.html