Mysql InnoDB Cluster集群部署中的问题汇集

作者: fymit | 来源:发表于2020-03-03 17:44 被阅读0次

docker环境，搭建Mysql Innodb Cluster集
Kubernetes MySQL Operator的数据reco
根据官方文档搭建MySQL Innodb Cluster
集群部署
mysql proxy,replication分布式集群高可用方
MySQL集群
搭建MySQL innodb cluster集群模式
ElasticSearch基本介绍（一）
Mysql Cluster搭建
mysql innodb cluster（生产环境部署）

请保证所有的集群机器在一个子网内，网络必须要通, 不然会失败；
统一使用hostname进行配置；请更改每台机器的hosts文件；
报错

ERROR:
Group Replication join failed.
ERROR: Error joining instance to cluster: 'host-192-168-1-101:3306' - Query failed. MySQL Error (3092): The server is not configureroperly to be an active member of the group. Please see more details on error log.. Query: START group_replication
 
通过如下方式进行处理
mysql> install plugin group_replication soname 'group_replication.so'; ##安装插件
mysql> set global group_replication_allow_local_disjoint_gtids_join=ON;
mysql> START GROUP_REPLICATION;
mysql> select * from performance_schema.replication_group_members;

报错

Dba.getCluster: This function is not available through a session to a standalone instance (RuntimeError)
说明集群中的主节点已经不在该机器上，查询后更改机器重试一下即可；

报错

Dba.getCluster: Dba.getCluster: Unable to get cluster. The instance 'host-192-168-1-101:3306'
may belong to a different ReplicaSet as the one registered in the Metadata since the value of 'group_replication_group_name'
does not match the one registered in the ReplicaSet's Metadata: possible split-brain scenario. Please connect to another member of the ReplicaSet to get the Cluster. (RuntimeError)
 
最致命的错误，master/slave的数据不一致所致，没办法，只能重新来
mysql-js>dba.dropMetadataSchema();

6) 请保证集群中的数据库表都存在主键，不然会挂掉；

安装集群监控，保证集群中机器挂掉的时候及时启动，不然所有节点宕机的时候就是灾难到来之时!!! 到时哭都来不及；
如何重置Innodb cluster集群环境

主节点：
mysql-js>dba.dropMetadataSchema();   登录mysql-shell清空集群
 
mysql> stop group_replication;
mysql> reset master;               （清空日志，确保和从库的表没有冲突奥，）
mysql> reset slave;
  
其他节点(主要清理和主库的主从信息， 确保主库和从库的表没有冲突奥)
mysql> stop group_replication;
mysql> reset master;
mysql> reset slave

主机名和 /etc/hosts中名字不一致

出现报错:
[Repl] Slave I/O for channel 'group_replication_recovery': error connecting to master 'mysql_innodb_cluster_r0430970923@mysql3:3306' - retry-time: 60 retries: 1, Error_code: MY-002005
[ERROR] [MY-011582] [Repl] Plugin group_replication reported: 'There was an error when connecting to the donor server. Please check that group_replication_recovery channel credentials and all MEMBER_HOST column values of performance_schema.replication_group_members table are correct and DNS resolvable.'
[ERROR] [MY-011583] [Repl] Plugin group_replication reported: 'For details please check performance_schema.replication_connection_status table and error log messages of Slave I/O for channel group_replication_recovery.'

主库的日志应用卡在某个位置无法应用到从库

出现报错:
[ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000007' position 151
[ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Error executing row event: 'Unknown database 'mysql_innodb_cluster_metadata'', Error_code: MY-001049
  
重建master：
mysql> stop group_replication;
mysql> reset master;

报错

[ERROR] Slave SQL for channel 'group_replication_recovery': Could not execute Write_rows event on table mysql_innodb_cluster_metadata.instances;
Cannot add or update a child row: a foreign key constraint fails (mysql_innodb_cluster_metadata.instances, CONSTRAINT instances_ibfk_1 FOREIGN KEY (host_id) REFERENCES hosts (host_id)),
Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log binlog.000001, end_log_pos 3059, Error_code: 1452
  
解决方式：清空表mysql_innodb_cluster_metadata.hosts; 重新建立集群

报错

This member has more executed transactions than those present in the group
 
解决方式：
mysql> stop group_replication;
mysql> reset master;

用户操作系统资源的限制

[Warning] Buffered warning: Changed limits: max_open_files: 1024 (requested 5000)
[Warning] Buffered warning: Changed limits: table_open_cache: 431 (requested 2000)
  
解决方式：
# vim /etc/security/limits.conf      #  添加下面内容
mysql soft nproc 2047
mysql hard nproc 16384
mysql soft nofile 1024
mysql hard nofile 65535

报错

dba.rebootClusterFromCompleteOutage: The active session instance isn't the most updated in comparison with the ONLINE instances of the
Cluster's metadata.
  
在集群没有起来时某些机器的数据表发生变动，导致数据不一致；
解决方式：
所有MySQL机器通过reset master命令清空binlogs
mysql> reset master;
mysql> show master logs;
然后再运行Dba.rebootClusterFromCompleteOutage重启集群。

service mysql restart 无法重启mysql，mysql stuck，并一直输出日志'[Note] Plugin group_replication reported: '[GCS] cli_err 2''

解决方式：唯一停止MySQL的命令为：
#pkill -9 mysqld

如何将Multi-Primary改为Single-Primary？

a) 解散原来的集群：mysql-js> cluster.dissolve({force: true})
b) 每台主机MySQL修改如下配置：
mysql> set global group_replication_enforce_update_everywhere_checks=OFF;
mysql> set global group_replication_single_primary_mode=ON;
c) 重新创建集群：
mysql-js> var cluster = dba.createCluster('mysqlCluster');
mysql-js> cluster.addInstance('chianyu@svr2:3306');
mysql-js> cluster.addInstance('chianyu@svr3:3306');

组复制的限制

- 事物锁缺失问题：
- 组复制建议，事物隔离级别，read commit
- 序列化隔离级别：多主模式不支持
- 并发DDL和DML: 多主模式下，不支持 一边对一个表进行DDL,另一边进行更新，这样对于DDL在其他实例上操作有未检出的风险
- 外键级联约束：多主模式下，多级外键依赖对引起多级操作， 因此可能导致未知冲突，建议打开 group_replication_enforce_update_everywhere_checks=ON
- 大事物，超过5秒未提交，会导致组通信失败，
- 多主模式下：select * for update 会导致 死锁。因为这个锁并非全组共享。
- 部分复制不支持：组复制下，设置部分复制，会过滤事物，导致组事物不一致。
- Mysql 8.0.11 group_replication_enforce_update_everywhere_checks=ON 多主模式下不支持。
- 停止复制的情况下，某个节点执行命令后再启动，会因为本地有私有事物，无法加入集群。需要全局 reset master 重新开始集群复制。

多实例环境不要用 3306端口

多实例环境下，某个实例采用了默认的3306端口，会导致经常性的误操作。
一台主机最多部署10个实例
  
比如:
cluster节点A服务器启用三个端口实例: 3310, 3320, 3330,
cluster节点B服务器启用三个端口实例: 3310, 3320, 3330
cluster节点C服务器启用三个端口实例: 3310, 3320, 3330
实例数据目录分别为: /data/mysql3310,  /data/mysql3320, /data/mysql3330
  
管理节点D服务启动三个端口route端口实例: 3310, 3320, 3330
管理节点E服务启动三个端口route端口实例: 3310, 3320, 3330
实例数据目录分别为: /data/router3310, /data/router3320, /data/router3330

脑裂场景
当集群中有部分节点出现UNREACHABLE状态，此时集群无法做出决策，会出现以下局面，此时只剩下一个活跃节点，此节点只能提供查询，无法写入，执行写入操作会hang住。

mysql-js> cl.status();
{
    "clusterName": "myCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node-3:3305", 
        "status": "NO_QUORUM", 
        "statusText": "Cluster has no quorum as visible from 'node-1:3305' and cannot process write transactions. 2 members are not active", 
        "topology": {
            "node-1:3305": {
                "address": "node-1:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "node-2:3305": {
                "address": "node-2:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "node-3:3305": {
                "address": "node-3:3305", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "UNREACHABLE"
            }
        }
    }
}

修复这种状态，需要执行forceQuorumUsingPartitionOf指定当前活跃节点(如果是多个则选择primary node)，此时活跃节点可以提供读写操作，然后将其他节点加入此集群。

mysql-js> cluster.forceQuorumUsingPartitionOf('root@node-1:3305')
Restoring replicaset 'default' from loss of quorum, by using the partition composed of [nt-metra-1:3305]

Please provide the password for 'root@nt-metra-1:3305': 
Restoring the InnoDB cluster ...

The InnoDB cluster was successfully restored using the partition from the instance 'root@nt-metra-1:3305'.

WARNING: To avoid a split-brain scenario, ensure that all other members of the replicaset are removed or joined back to the group that was restored.

mysql-js> cluster.status();
{
    "clusterName": "myCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "nt-metra-1:3305", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active", 
        "topology": {
            "nt-metra-1:3305": {
                "address": "nt-metra-1:3305", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "nt-metra-2:3305": {
                "address": "nt-metra-2:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "nt-metra-3:3305": {
                "address": "nt-metra-3:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }
        }
    }
}

# 根据情况使用
cluster.rejoinInstance("root@node-2:3305")
#或者，删除错误节点，重新添加：
cluster.removeInstance('root@node-2:3305');
cluster.addInstance('root@node-2:3305');

docker环境，搭建Mysql Innodb Cluster集
本文分享如何在docker环境，搭建Mysql Innodb Cluster集群。 MySQL InnoDB Cl...
Kubernetes MySQL Operator的数据reco
如果MySQL使用operator进行部署时，数据会保存在pv中。【MySql InnoDB Cluster的搭建...
根据官方文档搭建MySQL Innodb Cluster
MySQL Innodb Cluster搭建本次测试使用三台机器搭建mysql innodb cluster p...
集群部署
MariaDB Galera Cluster 集群部署
mysql proxy,replication分布式集群高可用方
分布式集群做MySQL集群，例如：利用mysql cluster ，mysql proxy，mysql repl...
MySQL集群
MyAQL常用的集群: mysql cluster,nbd,MHA,Mariadb Galera cluster,...
搭建MySQL innodb cluster集群模式
机器信息机器角色192.168.124.132master192.168.124.130slave192.168...
ElasticSearch基本介绍（一）
引言引自：Elasticsearch集群部署一、ES中的基本概念 cluster 代表一个集群，集群中有多个节...
Mysql Cluster搭建
Mysql Cluster简介 MySQL Cluster 是一种技术，该技术允许在无共享的系统中部署“内存中”数...
mysql innodb cluster（生产环境部署）
https://dev.mysql.com/doc/refman/5.7/en/mysql-innodb-clus...