美文网首页
Mysql InnoDB Cluster集群部署中的问题汇集

Mysql InnoDB Cluster集群部署中的问题汇集

作者: fymit | 来源:发表于2020-03-03 17:44 被阅读0次
  1. 请保证所有的集群机器在一个子网内,网络必须要通, 不然会失败;
  2. 统一使用hostname进行配置;请更改每台机器的hosts文件;
  3. 报错
ERROR:
Group Replication join failed.
ERROR: Error joining instance to cluster: 'host-192-168-1-101:3306' - Query failed. MySQL Error (3092): The server is not configureroperly to be an active member of the group. Please see more details on error log.. Query: START group_replication
 
通过如下方式进行处理
mysql> install plugin group_replication soname 'group_replication.so'; ##安装插件
mysql> set global group_replication_allow_local_disjoint_gtids_join=ON;
mysql> START GROUP_REPLICATION;
mysql> select * from performance_schema.replication_group_members;
  1. 报错
Dba.getCluster: This function is not available through a session to a standalone instance (RuntimeError)
说明集群中的主节点已经不在该机器上,查询后更改机器重试一下即可;
  1. 报错
Dba.getCluster: Dba.getCluster: Unable to get cluster. The instance 'host-192-168-1-101:3306'
may belong to a different ReplicaSet as the one registered in the Metadata since the value of 'group_replication_group_name'
does not match the one registered in the ReplicaSet's Metadata: possible split-brain scenario. Please connect to another member of the ReplicaSet to get the Cluster. (RuntimeError)
 
最致命的错误,master/slave的数据不一致所致,没办法,只能重新来
mysql-js>dba.dropMetadataSchema();

6) 请保证集群中的数据库表都存在主键,不然会挂掉;

  1. 安装集群监控,保证集群中机器挂掉的时候及时启动,不然所有节点宕机的时候就是灾难到来之时!!! 到时哭都来不及;
  2. 如何重置Innodb cluster集群环境
主节点:
mysql-js>dba.dropMetadataSchema();   登录mysql-shell清空集群
 
mysql> stop group_replication;
mysql> reset master;               (清空日志,确保和从库的表没有冲突奥,)
mysql> reset slave;
  
其他节点(主要清理和主库的主从信息, 确保主库和从库的表没有冲突奥)
mysql> stop group_replication;
mysql> reset master;
mysql> reset slave
  1. 主机名和 /etc/hosts中名字不一致
出现报错:
[Repl] Slave I/O for channel 'group_replication_recovery': error connecting to master 'mysql_innodb_cluster_r0430970923@mysql3:3306' - retry-time: 60 retries: 1, Error_code: MY-002005
[ERROR] [MY-011582] [Repl] Plugin group_replication reported: 'There was an error when connecting to the donor server. Please check that group_replication_recovery channel credentials and all MEMBER_HOST column values of performance_schema.replication_group_members table are correct and DNS resolvable.'
[ERROR] [MY-011583] [Repl] Plugin group_replication reported: 'For details please check performance_schema.replication_connection_status table and error log messages of Slave I/O for channel group_replication_recovery.'
  1. 主库的日志应用卡在某个位置无法应用到从库
出现报错:
[ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000007' position 151
[ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Error executing row event: 'Unknown database 'mysql_innodb_cluster_metadata'', Error_code: MY-001049
  
重建master:
mysql> stop group_replication;
mysql> reset master; 
  1. 报错
[ERROR] Slave SQL for channel 'group_replication_recovery': Could not execute Write_rows event on table mysql_innodb_cluster_metadata.instances;
Cannot add or update a child row: a foreign key constraint fails (mysql_innodb_cluster_metadata.instances, CONSTRAINT instances_ibfk_1 FOREIGN KEY (host_id) REFERENCES hosts (host_id)),
Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log binlog.000001, end_log_pos 3059, Error_code: 1452
  
解决方式:清空表mysql_innodb_cluster_metadata.hosts; 重新建立集群
  1. 报错
This member has more executed transactions than those present in the group
 
解决方式:
mysql> stop group_replication;
mysql> reset master;
  1. 用户操作系统资源的限制
[Warning] Buffered warning: Changed limits: max_open_files: 1024 (requested 5000)
[Warning] Buffered warning: Changed limits: table_open_cache: 431 (requested 2000)
  
解决方式:
# vim /etc/security/limits.conf      #  添加下面内容
mysql soft nproc 2047
mysql hard nproc 16384
mysql soft nofile 1024
mysql hard nofile 65535
  1. 报错
dba.rebootClusterFromCompleteOutage: The active session instance isn't the most updated in comparison with the ONLINE instances of the
Cluster's metadata.
  
在集群没有起来时某些机器的数据表发生变动,导致数据不一致;
解决方式:
所有MySQL机器通过reset master命令清空binlogs
mysql> reset master;
mysql> show master logs;
然后再运行Dba.rebootClusterFromCompleteOutage重启集群。
  1. service mysql restart 无法重启mysql,mysql stuck,并一直输出日志'[Note] Plugin group_replication reported: '[GCS] cli_err 2''
解决方式:唯一停止MySQL的命令为:
#pkill -9 mysqld
  1. 如何将Multi-Primary改为Single-Primary?
a) 解散原来的集群:mysql-js> cluster.dissolve({force: true})
b) 每台主机MySQL修改如下配置:
mysql> set global group_replication_enforce_update_everywhere_checks=OFF;
mysql> set global group_replication_single_primary_mode=ON;
c) 重新创建集群:
mysql-js> var cluster = dba.createCluster('mysqlCluster');
mysql-js> cluster.addInstance('chianyu@svr2:3306');
mysql-js> cluster.addInstance('chianyu@svr3:3306');
  1. 组复制的限制
- 事物锁缺失问题:
- 组复制建议,事物隔离级别,read commit
- 序列化隔离级别:多主模式不支持
- 并发DDL和DML: 多主模式下,不支持 一边对一个表进行DDL,另一边进行更新,这样对于DDL在其他实例上操作有未检出的风险
- 外键级联约束:多主模式下,多级外键依赖对引起多级操作, 因此可能导致未知冲突,建议打开 group_replication_enforce_update_everywhere_checks=ON
- 大事物,超过5秒未提交,会导致组通信失败,
- 多主模式下:select * for update 会导致 死锁。因为这个锁并非全组共享。
- 部分复制不支持:组复制下,设置部分复制,会过滤事物,导致组事物不一致。
- Mysql 8.0.11 group_replication_enforce_update_everywhere_checks=ON 多主模式下不支持。
- 停止复制的情况下,某个节点执行命令后再启动,会因为本地有私有事物,无法加入集群。需要全局 reset master 重新开始集群复制。
  1. 多实例环境不要用 3306端口
多实例环境下,某个实例采用了默认的3306端口,会导致经常性的误操作。
一台主机最多部署10个实例
  
比如:
cluster节点A服务器启用三个端口实例: 3310, 3320, 3330,
cluster节点B服务器启用三个端口实例: 3310, 3320, 3330
cluster节点C服务器启用三个端口实例: 3310, 3320, 3330
实例数据目录分别为: /data/mysql3310,  /data/mysql3320, /data/mysql3330
  
管理节点D服务启动三个端口route端口实例: 3310, 3320, 3330
管理节点E服务启动三个端口route端口实例: 3310, 3320, 3330
实例数据目录分别为: /data/router3310, /data/router3320, /data/router3330
  1. 脑裂场景
    当集群中有部分节点出现UNREACHABLE状态,此时集群无法做出决策,会出现以下局面,此时只剩下一个活跃节点,此节点只能提供查询,无法写入,执行写入操作会hang住。
mysql-js> cl.status();
{
    "clusterName": "myCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node-3:3305", 
        "status": "NO_QUORUM", 
        "statusText": "Cluster has no quorum as visible from 'node-1:3305' and cannot process write transactions. 2 members are not active", 
        "topology": {
            "node-1:3305": {
                "address": "node-1:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "node-2:3305": {
                "address": "node-2:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "node-3:3305": {
                "address": "node-3:3305", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "UNREACHABLE"
            }
        }
    }
}

修复这种状态,需要执行forceQuorumUsingPartitionOf指定当前活跃节点(如果是多个则选择primary node),此时活跃节点可以提供读写操作,然后将其他节点加入此集群。

mysql-js> cluster.forceQuorumUsingPartitionOf('root@node-1:3305')
Restoring replicaset 'default' from loss of quorum, by using the partition composed of [nt-metra-1:3305]

Please provide the password for 'root@nt-metra-1:3305': 
Restoring the InnoDB cluster ...

The InnoDB cluster was successfully restored using the partition from the instance 'root@nt-metra-1:3305'.

WARNING: To avoid a split-brain scenario, ensure that all other members of the replicaset are removed or joined back to the group that was restored.

mysql-js> cluster.status();
{
    "clusterName": "myCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "nt-metra-1:3305", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active", 
        "topology": {
            "nt-metra-1:3305": {
                "address": "nt-metra-1:3305", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "nt-metra-2:3305": {
                "address": "nt-metra-2:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "nt-metra-3:3305": {
                "address": "nt-metra-3:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }
        }
    }
}

# 根据情况使用
cluster.rejoinInstance("root@node-2:3305")
#或者,删除错误节点,重新添加:
cluster.removeInstance('root@node-2:3305');
cluster.addInstance('root@node-2:3305');

相关文章

网友评论

      本文标题:Mysql InnoDB Cluster集群部署中的问题汇集

      本文链接:https://www.haomeiwen.com/subject/gtbhkhtx.html