美文网首页
Mysql InnoDB Cluster集群部署中的问题汇集

Mysql InnoDB Cluster集群部署中的问题汇集

作者: fymit | 来源:发表于2020-03-03 17:44 被阅读0次
    1. 请保证所有的集群机器在一个子网内,网络必须要通, 不然会失败;
    2. 统一使用hostname进行配置;请更改每台机器的hosts文件;
    3. 报错
    ERROR:
    Group Replication join failed.
    ERROR: Error joining instance to cluster: 'host-192-168-1-101:3306' - Query failed. MySQL Error (3092): The server is not configureroperly to be an active member of the group. Please see more details on error log.. Query: START group_replication
     
    通过如下方式进行处理
    mysql> install plugin group_replication soname 'group_replication.so'; ##安装插件
    mysql> set global group_replication_allow_local_disjoint_gtids_join=ON;
    mysql> START GROUP_REPLICATION;
    mysql> select * from performance_schema.replication_group_members;
    
    1. 报错
    Dba.getCluster: This function is not available through a session to a standalone instance (RuntimeError)
    说明集群中的主节点已经不在该机器上,查询后更改机器重试一下即可;
    
    1. 报错
    Dba.getCluster: Dba.getCluster: Unable to get cluster. The instance 'host-192-168-1-101:3306'
    may belong to a different ReplicaSet as the one registered in the Metadata since the value of 'group_replication_group_name'
    does not match the one registered in the ReplicaSet's Metadata: possible split-brain scenario. Please connect to another member of the ReplicaSet to get the Cluster. (RuntimeError)
     
    最致命的错误,master/slave的数据不一致所致,没办法,只能重新来
    mysql-js>dba.dropMetadataSchema();
    

    6) 请保证集群中的数据库表都存在主键,不然会挂掉;

    1. 安装集群监控,保证集群中机器挂掉的时候及时启动,不然所有节点宕机的时候就是灾难到来之时!!! 到时哭都来不及;
    2. 如何重置Innodb cluster集群环境
    主节点:
    mysql-js>dba.dropMetadataSchema();   登录mysql-shell清空集群
     
    mysql> stop group_replication;
    mysql> reset master;               (清空日志,确保和从库的表没有冲突奥,)
    mysql> reset slave;
      
    其他节点(主要清理和主库的主从信息, 确保主库和从库的表没有冲突奥)
    mysql> stop group_replication;
    mysql> reset master;
    mysql> reset slave
    
    1. 主机名和 /etc/hosts中名字不一致
    出现报错:
    [Repl] Slave I/O for channel 'group_replication_recovery': error connecting to master 'mysql_innodb_cluster_r0430970923@mysql3:3306' - retry-time: 60 retries: 1, Error_code: MY-002005
    [ERROR] [MY-011582] [Repl] Plugin group_replication reported: 'There was an error when connecting to the donor server. Please check that group_replication_recovery channel credentials and all MEMBER_HOST column values of performance_schema.replication_group_members table are correct and DNS resolvable.'
    [ERROR] [MY-011583] [Repl] Plugin group_replication reported: 'For details please check performance_schema.replication_connection_status table and error log messages of Slave I/O for channel group_replication_recovery.'
    
    1. 主库的日志应用卡在某个位置无法应用到从库
    出现报错:
    [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000007' position 151
    [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Error executing row event: 'Unknown database 'mysql_innodb_cluster_metadata'', Error_code: MY-001049
      
    重建master:
    mysql> stop group_replication;
    mysql> reset master; 
    
    1. 报错
    [ERROR] Slave SQL for channel 'group_replication_recovery': Could not execute Write_rows event on table mysql_innodb_cluster_metadata.instances;
    Cannot add or update a child row: a foreign key constraint fails (mysql_innodb_cluster_metadata.instances, CONSTRAINT instances_ibfk_1 FOREIGN KEY (host_id) REFERENCES hosts (host_id)),
    Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log binlog.000001, end_log_pos 3059, Error_code: 1452
      
    解决方式:清空表mysql_innodb_cluster_metadata.hosts; 重新建立集群
    
    1. 报错
    This member has more executed transactions than those present in the group
     
    解决方式:
    mysql> stop group_replication;
    mysql> reset master;
    
    1. 用户操作系统资源的限制
    [Warning] Buffered warning: Changed limits: max_open_files: 1024 (requested 5000)
    [Warning] Buffered warning: Changed limits: table_open_cache: 431 (requested 2000)
      
    解决方式:
    # vim /etc/security/limits.conf      #  添加下面内容
    mysql soft nproc 2047
    mysql hard nproc 16384
    mysql soft nofile 1024
    mysql hard nofile 65535
    
    1. 报错
    dba.rebootClusterFromCompleteOutage: The active session instance isn't the most updated in comparison with the ONLINE instances of the
    Cluster's metadata.
      
    在集群没有起来时某些机器的数据表发生变动,导致数据不一致;
    解决方式:
    所有MySQL机器通过reset master命令清空binlogs
    mysql> reset master;
    mysql> show master logs;
    然后再运行Dba.rebootClusterFromCompleteOutage重启集群。
    
    1. service mysql restart 无法重启mysql,mysql stuck,并一直输出日志'[Note] Plugin group_replication reported: '[GCS] cli_err 2''
    解决方式:唯一停止MySQL的命令为:
    #pkill -9 mysqld
    
    1. 如何将Multi-Primary改为Single-Primary?
    a) 解散原来的集群:mysql-js> cluster.dissolve({force: true})
    b) 每台主机MySQL修改如下配置:
    mysql> set global group_replication_enforce_update_everywhere_checks=OFF;
    mysql> set global group_replication_single_primary_mode=ON;
    c) 重新创建集群:
    mysql-js> var cluster = dba.createCluster('mysqlCluster');
    mysql-js> cluster.addInstance('chianyu@svr2:3306');
    mysql-js> cluster.addInstance('chianyu@svr3:3306');
    
    1. 组复制的限制
    - 事物锁缺失问题:
    - 组复制建议,事物隔离级别,read commit
    - 序列化隔离级别:多主模式不支持
    - 并发DDL和DML: 多主模式下,不支持 一边对一个表进行DDL,另一边进行更新,这样对于DDL在其他实例上操作有未检出的风险
    - 外键级联约束:多主模式下,多级外键依赖对引起多级操作, 因此可能导致未知冲突,建议打开 group_replication_enforce_update_everywhere_checks=ON
    - 大事物,超过5秒未提交,会导致组通信失败,
    - 多主模式下:select * for update 会导致 死锁。因为这个锁并非全组共享。
    - 部分复制不支持:组复制下,设置部分复制,会过滤事物,导致组事物不一致。
    - Mysql 8.0.11 group_replication_enforce_update_everywhere_checks=ON 多主模式下不支持。
    - 停止复制的情况下,某个节点执行命令后再启动,会因为本地有私有事物,无法加入集群。需要全局 reset master 重新开始集群复制。
    
    1. 多实例环境不要用 3306端口
    多实例环境下,某个实例采用了默认的3306端口,会导致经常性的误操作。
    一台主机最多部署10个实例
      
    比如:
    cluster节点A服务器启用三个端口实例: 3310, 3320, 3330,
    cluster节点B服务器启用三个端口实例: 3310, 3320, 3330
    cluster节点C服务器启用三个端口实例: 3310, 3320, 3330
    实例数据目录分别为: /data/mysql3310,  /data/mysql3320, /data/mysql3330
      
    管理节点D服务启动三个端口route端口实例: 3310, 3320, 3330
    管理节点E服务启动三个端口route端口实例: 3310, 3320, 3330
    实例数据目录分别为: /data/router3310, /data/router3320, /data/router3330
    
    1. 脑裂场景
      当集群中有部分节点出现UNREACHABLE状态,此时集群无法做出决策,会出现以下局面,此时只剩下一个活跃节点,此节点只能提供查询,无法写入,执行写入操作会hang住。
    mysql-js> cl.status();
    {
        "clusterName": "myCluster", 
        "defaultReplicaSet": {
            "name": "default", 
            "primary": "node-3:3305", 
            "status": "NO_QUORUM", 
            "statusText": "Cluster has no quorum as visible from 'node-1:3305' and cannot process write transactions. 2 members are not active", 
            "topology": {
                "node-1:3305": {
                    "address": "node-1:3305", 
                    "mode": "R/O", 
                    "readReplicas": {}, 
                    "role": "HA", 
                    "status": "ONLINE"
                }, 
                "node-2:3305": {
                    "address": "node-2:3305", 
                    "mode": "R/O", 
                    "readReplicas": {}, 
                    "role": "HA", 
                    "status": "(MISSING)"
                }, 
                "node-3:3305": {
                    "address": "node-3:3305", 
                    "mode": "R/W", 
                    "readReplicas": {}, 
                    "role": "HA", 
                    "status": "UNREACHABLE"
                }
            }
        }
    }
    

    修复这种状态,需要执行forceQuorumUsingPartitionOf指定当前活跃节点(如果是多个则选择primary node),此时活跃节点可以提供读写操作,然后将其他节点加入此集群。

    mysql-js> cluster.forceQuorumUsingPartitionOf('root@node-1:3305')
    Restoring replicaset 'default' from loss of quorum, by using the partition composed of [nt-metra-1:3305]
    
    Please provide the password for 'root@nt-metra-1:3305': 
    Restoring the InnoDB cluster ...
    
    The InnoDB cluster was successfully restored using the partition from the instance 'root@nt-metra-1:3305'.
    
    WARNING: To avoid a split-brain scenario, ensure that all other members of the replicaset are removed or joined back to the group that was restored.
    
    mysql-js> cluster.status();
    {
        "clusterName": "myCluster", 
        "defaultReplicaSet": {
            "name": "default", 
            "primary": "nt-metra-1:3305", 
            "status": "OK_NO_TOLERANCE", 
            "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active", 
            "topology": {
                "nt-metra-1:3305": {
                    "address": "nt-metra-1:3305", 
                    "mode": "R/W", 
                    "readReplicas": {}, 
                    "role": "HA", 
                    "status": "ONLINE"
                }, 
                "nt-metra-2:3305": {
                    "address": "nt-metra-2:3305", 
                    "mode": "R/O", 
                    "readReplicas": {}, 
                    "role": "HA", 
                    "status": "(MISSING)"
                }, 
                "nt-metra-3:3305": {
                    "address": "nt-metra-3:3305", 
                    "mode": "R/O", 
                    "readReplicas": {}, 
                    "role": "HA", 
                    "status": "(MISSING)"
                }
            }
        }
    }
    
    # 根据情况使用
    cluster.rejoinInstance("root@node-2:3305")
    #或者,删除错误节点,重新添加:
    cluster.removeInstance('root@node-2:3305');
    cluster.addInstance('root@node-2:3305');
    

    相关文章

      网友评论

          本文标题:Mysql InnoDB Cluster集群部署中的问题汇集

          本文链接:https://www.haomeiwen.com/subject/gtbhkhtx.html