集群状态查看
JS > var cluster = dba.getCluster()
JS > cluster.status()
集群恢复操作
1.MGR集群有一个节点组复制断了,如何重新加入集群?
命令:
JS > cluster.rejoinInstance("root@172.16.22.3:3306")
解释:此过程会进行一些参数设置,并持久化到 mysqld-auto.conf,并启动组复制(下面是截取部分 general log):
2020-03-27T03:33:06.039327-00:00 72 Query show GLOBAL variables where `variable_name` in ('persisted_globals_load')
2020-03-27T03:33:06.044153-00:00 72 Query SET PERSIST `super_read_only` = 'ON'
2020-03-27T03:33:06.044901-00:00 72 Query SET PERSIST `group_replication_group_name` = 'f657ae0e-1111-11e8-8bd0-0242ac222222'
2020-03-27T03:33:06.045400-00:00 72 Query SET PERSIST `group_replication_single_primary_mode` = 'ON'
2020-03-27T03:33:06.045897-00:00 72 Query SET PERSIST `group_replication_enforce_update_everywhere_checks` = 'OFF'
2020-03-27T03:33:06.046393-00:00 72 Query SET PERSIST `group_replication_recovery_get_public_key` = 'ON'
2020-03-27T03:33:06.046854-00:00 72 Query SET PERSIST `group_replication_recovery_use_ssl` = 'OFF'
2020-03-27T03:33:06.047299-00:00 72 Query SET PERSIST `group_replication_ssl_mode` = 'DISABLED'
2020-03-27T03:33:06.047767-00:00 72 Query SET PERSIST `group_replication_local_address` = '172.16.22.3:33061'
2020-03-27T03:33:06.051837-00:00 72 Query SET PERSIST `auto_increment_increment` = 1
2020-03-27T03:33:06.052260-00:00 72 Query SET PERSIST `auto_increment_offset` = 2
2020-03-27T03:33:06.052715-00:00 72 Query START GROUP_REPLICATION
2. MGR集群有一个节点故障,数据重做,如何重新加入集群?
此时由于某些设置发生了变化,比如 server_uuid 更改,导致 rejoinInstance 失败。需要先将集群中删除这个节点,再重新加入:
JS > cluster.removeInstance("root@172.16.22.3:3306", {force: true})
JS > cluster.rescan()
JS > cluster.addInstance("root@172.16.22.3:3306")
3. MGR集群中多数节点挂了,导致集群失去仲裁,如何恢复?
比如3节点MGR,其中2个节点挂了:
- 剩余一个节点进行写操作时会 hang 住;
- 此时无法修改集群拓扑:比如故障节点修复,也无法通过 start group_replication 加入集群;
- mysqlrouter 的读写和只读端口无法连接。
需要通过在存活的节点上,使用 cluster.forceQuorumUsingPartitionOf() 还原集群:
JS > cluster.forceQuorumUsingPartitionOf("root@172.16.22.1:3306")
然后通过 cluster.rejoinInstance() 将其他节点重新加入集群:
JS > cluster.rejoinInstance("root@172.16.22.2:3306")
JS > cluster.rejoinInstance("root@172.16.22.3:3306")
4. 如果MGR集群是完整关闭的(complete outage),如何恢复?
举个简单例子:每个节点都是 stop group_replication 脱离集群的。或者每个节点的 mysqld 进程都是正常关闭的,现在 MySQL 进程全都重新启动了,此时 start group_replication 是无法启动集群的。
需要通过 mysqlshell 登录到任意一个节点:
mysqlsh root@172.16.22.1:3306 --log-level=DEBUG3
执行dba.rebootClusterFromCompleteOutage():
JS > dba.rebootClusterFromCompleteOutage()
交互模式运行 mysqlshell 时,会检查集群还有哪些实例,并询问是否将发现的实例重新加入到接下来重新启动的集群中:
The instance '172.16.22.2:3306' was part of the cluster configuration.
Would you like to rejoin it to the cluster? [y/N]: y
The instance '172.16.22.3:3306' was part of the cluster configuration.
Would you like to rejoin it to the cluster? [y/N]: y
接着会检查这些实例哪个数据最新,根据 GTID 集合判断,如果当前连接的实例不是最新的,则会报错,提示数据最新的实例是 172.16.22.2:3306 :
Dba.rebootClusterFromCompleteOutage: The active session instance isn't the most updated
in comparison with the ONLINE instances of the Cluster's metadata.
Please use the most up to date instance: '172.16.22.2:3306'. (RuntimeError)
接下来连接到提示的实例,进行集群重启:
JS > \connect "root@172.16.22.2:3306"
JS > dba.rebootClusterFromCompleteOutage()
查看 172.16.22.2:3306 的general log,重启原理就是在 172.16.22.2:3306 上重新引导一个集群:
SET GLOBAL `super_read_only` = 'OFF';
SET GLOBAL `group_replication_bootstrap_group` = 'ON';
START GROUP_REPLICATION;
SET GLOBAL `group_replication_bootstrap_group` = 'OFF';
SET GLOBAL read_only= 0;
5. 解散集群
解散集群就是删除集群中所有节点的所有元数据和配置,并关闭组复制。并不会删除数据。
操作命令是:
JS > var cluster = dba.getCluster()
JS > cluster.dissolve()
注意事项:
- 需要连接到 primary 节点上执行;
- 只能配置集群内状态正常的节点,如果有异常状态的节点(比如关闭了组复制、或者实例挂了),强烈建议先将其重新加入集群后再进行解散。
网友评论