美文网首页Redis
Redis第2️⃣2️⃣课 Cluster故障转移

Redis第2️⃣2️⃣课 Cluster故障转移

作者: 小超_8b2f | 来源:发表于2019-05-31 21:28 被阅读15次

    一、故障发现

    节点间通过ping / pong 消息实现故障发现:不需要sentinel。ping / pong 不仅传播节点槽的信息(参见前面章节),亦可以传播主从状态,节点故障,

    1. 主观下线

    定义:某一个节点认为另一个节点不可用,“偏见”

    主观下线流程
    2. 客观下线

    定义:当半数以上持有槽的主节点都标记某节点主观下线

    客观下线逻辑流程 尝试客观下线
    • 通知集群内所有节点标记故障节点为客观下线
    • 通知故障节点的从节点触发故障转移流程

    二、故障恢复

    1. 资格检查
    • 每个从节点检查与故障主节点的断线时间
    • 超过(cluster-node-timeout * cluster-slave-validity-factor)取消资格。
    • cluster-slave-validity-factor默认是10
    2. 准备选举时间
    准备选举时间

      为了保证偏移量大的节点有更小的延迟达到选举时间,为了保证数据的一致性更高。偏移量较大的更有可能成为未来的master节点,所以我们给他更小的选举时间,让它首先达到选举时间,然后完成未来的选举,票数多。

    3. 选举投票
    选举投票

    1) 当前从节点取消复制变为主节点(slave of no one)
    2)执行clusterDelSlot撤销故障主节点复制的槽,并执行clusterAddSlot 把这些槽分给自己
    3) 向集群中广播自己的pong消息,表面已经替换了故障主节点。

    4. 替换主节点

    三、故障转移实战演练

    故障演练示例图解

    1)kill某主节点

    #查询集群节点信息
    $ redis-cli --cluster info localhost:7000
    localhost:7000 (1ac9fbbf...) -> 2 keys | 5461 slots | 1 slaves.
    127.0.0.1:7001 (a3c0d3b4...) -> 2 keys | 5462 slots | 1 slaves. # 将要kill掉的主节点
    127.0.0.1:7002 (a89a427b...) -> 1 keys | 5461 slots | 1 slaves.
    
    #查看某节点的进程号
    $ redis-cli -p 7002 info Server | grep process_id
    process_id:4386
    # 循环遍历查询程序报异常,过一会儿自己好了
    kill 4386
    
    $ redis-cli --cluster info localhost:7000
    Could not connect to Redis at 127.0.0.1:7002: Connection refused
    localhost:7000 (1ac9fbbf...) -> 2 keys | 5461 slots | 1 slaves.
    127.0.0.1:7005 (09792d31...) -> 1 keys | 5461 slots | 0 slaves. #新主节点
    127.0.0.1:7001 (a3c0d3b4...) -> 2 keys | 5462 slots | 1 slaves.
    
    $ redis-cli -p 7000 cluster slots
    1) 1) (integer) 10923
       2) (integer) 16383
       3) 1) "127.0.0.1"
          2) (integer) 7005
          3) "09792d31e728ad714a5a90bc7639f277d817fb4e"
    2) 1) (integer) 5461
       2) (integer) 10922
       3) 1) "127.0.0.1"
          2) (integer) 7001
          3) "a3c0d3b42da023dc402faf439d4f93a1cb44d402"
       4) 1) "127.0.0.1"
          2) (integer) 7004
          3) "5a4f085dee8400093f45ce2cfa42cbd206167f73"
    3) 1) (integer) 0
       2) (integer) 5460
       3) 1) "127.0.0.1"
          2) (integer) 7000
          3) "1ac9fbbfe11362e151204132e3d110b18139a1d9"
       4) 1) "127.0.0.1"
          2) (integer) 7003
          3) "2d19dda2a8a790d5636a664fe3ed54aa3dd7677c"
    
    2)新晋主节点日志:redis-cluster-7005.log(原被kill掉的master的slave)
     79 4394:S 31 May 2019 12:09:53.401 # Connection with master lost.
     80 4394:S 31 May 2019 12:09:53.404 * Caching the disconnected master state.
     81 4394:S 31 May 2019 12:09:53.971 * Connecting to MASTER 127.0.0.1:7002
     82 4394:S 31 May 2019 12:09:53.972 * MASTER <-> REPLICA sync started
     83 4394:S 31 May 2019 12:09:53.973 # Error condition on socket for SYNC: Connection refused
     84 4394:S 31 May 2019 12:09:54.987 * Connecting to MASTER 127.0.0.1:7002
     85 4394:S 31 May 2019 12:09:54.988 * MASTER <-> REPLICA sync started
     86 4394:S 31 May 2019 12:09:54.989 # Error condition on socket for SYNC: Connection refused
     87 4394:S 31 May 2019 12:09:56.000 * Connecting to MASTER 127.0.0.1:7002
     88 4394:S 31 May 2019 12:09:56.001 * MASTER <-> REPLICA sync started
     89 4394:S 31 May 2019 12:09:56.002 # Error condition on socket for SYNC: Connection refused
     90 4394:S 31 May 2019 12:09:57.010 * Connecting to MASTER 127.0.0.1:7002
     91 4394:S 31 May 2019 12:09:57.011 * MASTER <-> REPLICA sync started
     92 4394:S 31 May 2019 12:09:57.012 # Error condition on socket for SYNC: Connection refused
     93 4394:S 31 May 2019 12:09:58.025 * Connecting to MASTER 127.0.0.1:7002
     94 4394:S 31 May 2019 12:09:58.026 * MASTER <-> REPLICA sync started
     95 4394:S 31 May 2019 12:09:58.027 # Error condition on socket for SYNC: Connection refused
     96 4394:S 31 May 2019 12:09:59.038 * Connecting to MASTER 127.0.0.1:7002
     97 4394:S 31 May 2019 12:09:59.039 * MASTER <-> REPLICA sync started
     98 4394:S 31 May 2019 12:09:59.040 # Error condition on socket for SYNC: Connection refused
     99 4394:S 31 May 2019 12:10:00.051 * Connecting to MASTER 127.0.0.1:7002
    100 4394:S 31 May 2019 12:10:00.051 * MASTER <-> REPLICA sync started
    101 4394:S 31 May 2019 12:10:00.053 # Error condition on socket for SYNC: Connection refused
    102 4394:S 31 May 2019 12:10:01.063 * Connecting to MASTER 127.0.0.1:7002
    103 4394:S 31 May 2019 12:10:01.064 * MASTER <-> REPLICA sync started
    104 4394:S 31 May 2019 12:10:01.065 # Error condition on socket for SYNC: Connection refused
    105 4394:S 31 May 2019 12:10:02.076 * Connecting to MASTER 127.0.0.1:7002
    106 4394:S 31 May 2019 12:10:02.077 * MASTER <-> REPLICA sync started
    107 4394:S 31 May 2019 12:10:02.078 # Error condition on socket for SYNC: Connection refused
    108 4394:S 31 May 2019 12:10:03.089 * Connecting to MASTER 127.0.0.1:7002
    109 4394:S 31 May 2019 12:10:03.090 * MASTER <-> REPLICA sync started
    110 4394:S 31 May 2019 12:10:03.091 # Error condition on socket for SYNC: Connection refused
    111 4394:S 31 May 2019 12:10:04.099 * Connecting to MASTER 127.0.0.1:7002
    112 4394:S 31 May 2019 12:10:04.100 * MASTER <-> REPLICA sync started
    113 4394:S 31 May 2019 12:10:04.101 # Error condition on socket for SYNC: Connection refused
    114 4394:S 31 May 2019 12:10:05.111 * Connecting to MASTER 127.0.0.1:7002
    115 4394:S 31 May 2019 12:10:05.111 * MASTER <-> REPLICA sync started
    116 4394:S 31 May 2019 12:10:05.112 # Error condition on socket for SYNC: Connection refused
    117 4394:S 31 May 2019 12:10:06.121 * Connecting to MASTER 127.0.0.1:7002
    118 4394:S 31 May 2019 12:10:06.121 * MASTER <-> REPLICA sync started
    119 4394:S 31 May 2019 12:10:06.122 # Error condition on socket for SYNC: Connection refused
    120 4394:S 31 May 2019 12:10:07.135 * Connecting to MASTER 127.0.0.1:7002
    121 4394:S 31 May 2019 12:10:07.136 * MASTER <-> REPLICA sync started
    122 4394:S 31 May 2019 12:10:07.137 # Error condition on socket for SYNC: Connection refused
    123 4394:S 31 May 2019 12:10:08.149 * Connecting to MASTER 127.0.0.1:7002
    124 4394:S 31 May 2019 12:10:08.149 * MASTER <-> REPLICA sync started
    125 4394:S 31 May 2019 12:10:08.150 # Error condition on socket for SYNC: Connection refused
    126 4394:S 31 May 2019 12:10:09.157 * Connecting to MASTER 127.0.0.1:7002
    127 4394:S 31 May 2019 12:10:09.158 * MASTER <-> REPLICA sync started
    128 4394:S 31 May 2019 12:10:09.159 # Error condition on socket for SYNC: Connection refused
    #从7001获取信息失败,主观失败的消息
    129 4394:S 31 May 2019 12:10:09.532 * FAIL message received from a3c0d3b42da023dc402faf439d4f93a1cb44d402 about a89a427b5fe8b2b0ef07ac8c6252d    c3c8efa1f77
    130 4394:S 31 May 2019 12:10:09.565 # Start of election delayed for 925 milliseconds (rank #0, offset 249926).
    131 4394:S 31 May 2019 12:10:10.173 * Connecting to MASTER 127.0.0.1:7002
    132 4394:S 31 May 2019 12:10:10.173 * MASTER <-> REPLICA sync started
    133 4394:S 31 May 2019 12:10:10.174 # Error condition on socket for SYNC: Connection refused
     # 开始新的选举
    134 4394:S 31 May 2019 12:10:10.578 # Starting a failover election for epoch 13.
     # 选举胜出,我是新的master
    135 4394:S 31 May 2019 12:10:10.591 # Failover election won: I'm the new master.  
    136 4394:S 31 May 2019 12:10:10.591 # configEpoch set to 13 after successful failover
    137 4394:M 31 May 2019 12:10:10.592 # Setting secondary replication ID to 27803313625ab7581c806b2a8343d1aff567354b, valid up to offset: 24992    7. New replication ID is 7083e19600c686aece101102f81bede77a55e6dc
    138 4394:M 31 May 2019 12:10:10.593 * Discarding previously cached master state.
    

    故障恢复时间 = 主观下线时间 + 客观下线时间 + 选举时间

      大概不到20秒。如果你无法容忍这个时间,那么可以把sendTimeout调小。但是这个参数会影响到带宽的传播速率、消息在节点中传播的频率,可能会加重带宽。所以这个参数的设置是一般是根据实际情况综合考量而得出的结果。

    3)重启被kill的主节点
    $ redis-server ../etc/cluster/redis-7002.conf 
    
    #kill掉的7002变成了7005的从
    $ redis-cli -p 7000 cluster slots
    1) 1) (integer) 10923
       2) (integer) 16383
       3) 1) "127.0.0.1"
          2) (integer) 7005
          3) "09792d31e728ad714a5a90bc7639f277d817fb4e"
       4) 1) "127.0.0.1"
          2) (integer) 7002
          3) "a89a427b5fe8b2b0ef07ac8c6252dc3c8efa1f77"
    2) 1) (integer) 5461
       2) (integer) 10922
       3) 1) "127.0.0.1"
          2) (integer) 7001
          3) "a3c0d3b42da023dc402faf439d4f93a1cb44d402"
       4) 1) "127.0.0.1"
          2) (integer) 7004
          3) "5a4f085dee8400093f45ce2cfa42cbd206167f73"
    3) 1) (integer) 0
       2) (integer) 5460
       3) 1) "127.0.0.1"
          2) (integer) 7000
          3) "1ac9fbbfe11362e151204132e3d110b18139a1d9"
       4) 1) "127.0.0.1"
          2) (integer) 7003
          3) "2d19dda2a8a790d5636a664fe3ed54aa3dd7677c"
    

    redis-cluster-7002.log

    $ tail -30 redis-cluster-7002.log
    28746:C 31 May 2019 20:53:03.405 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
    28746:C 31 May 2019 20:53:03.407 # Redis version=5.0.4, bits=64, commit=00000000, modified=0, pid=28746, just started
    28746:C 31 May 2019 20:53:03.407 # Configuration loaded
    28747:M 31 May 2019 20:53:03.410 * Increased maximum number of open files to 10032 (it was originally set to 256).
    28747:M 31 May 2019 20:53:03.412 * Node configuration loaded, I'm a89a427b5fe8b2b0ef07ac8c6252dc3c8efa1f77
    28747:M 31 May 2019 20:53:03.413 * Running mode=cluster, port=7002.
    28747:M 31 May 2019 20:53:03.414 # Server initialized
    28747:M 31 May 2019 20:53:03.415 * DB loaded from disk: 0.001 seconds
    28747:M 31 May 2019 20:53:03.416 * Ready to accept connections
    
    # 重新配置自己为xxxId节点的从节点
    28747:M 31 May 2019 20:53:03.419 # Configuration change detected. Reconfiguring myself as a replica of 09792d31e728ad714a5a90bc7639f277d817fb4e
    28747:S 31 May 2019 20:53:03.419 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
    28747:S 31 May 2019 20:53:03.420 # Cluster state changed: ok
    #连接到主节点7005
    28747:S 31 May 2019 20:53:04.430 * Connecting to MASTER 127.0.0.1:7005
    #开始主从数据同步
    28747:S 31 May 2019 20:53:04.431 * MASTER <-> REPLICA sync started
    28747:S 31 May 2019 20:53:04.431 * Non blocking connect for SYNC fired the event.
    28747:S 31 May 2019 20:53:04.432 * Master replied to PING, replication can continue...
    28747:S 31 May 2019 20:53:04.433 * Trying a partial resynchronization (request 8931dcb4de60e18b8f9835b25f828cebf564c1cf:1).
    28747:S 31 May 2019 20:53:04.441 * Full resync from master: 7318c71d3e107b0896c561f9f1c5294d43619178:249926
    28747:S 31 May 2019 20:53:04.441 * Discarding previously cached master state.
    28747:S 31 May 2019 20:53:04.513 * MASTER <-> REPLICA sync: receiving 192 bytes from master
    28747:S 31 May 2019 20:53:04.515 * MASTER <-> REPLICA sync: Flushing old data
    28747:S 31 May 2019 20:53:04.516 * MASTER <-> REPLICA sync: Loading DB in memory
    28747:S 31 May 2019 20:53:04.516 * MASTER <-> REPLICA sync: Finished with success
    
    

    redis-cluster-7005.log

    4394:M 31 May 2019 13:10:11.573 * Replication backlog freed after 3600 seconds without connected replicas.
    4394:M 31 May 2019 20:53:03.500 * Clear FAIL state for node a89a427b5fe8b2b0ef07ac8c6252dc3c8efa1f77: master without slots is reachable again.
    4394:M 31 May 2019 20:53:04.434 * Replica 127.0.0.1:7002 asks for synchronization
    4394:M 31 May 2019 20:53:04.434 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '8931dcb4de60e18b8f9835b25f828cebf564c1cf', my replication IDs are 'd1547b3a6d4eb61969a5cd19f55f907e2f18b10c' and '0000000000000000000000000000000000000000')
    4394:M 31 May 2019 20:53:04.436 * Starting BGSAVE for SYNC with target: disk
    4394:M 31 May 2019 20:53:04.440 * Background saving started by pid 28748
    28748:C 31 May 2019 20:53:04.448 * DB saved on disk
    4394:M 31 May 2019 20:53:04.511 * Background saving terminated with success
    4394:M 31 May 2019 20:53:04.513 * Synchronization with replica 127.0.0.1:7002 succeeded
    

    相关文章

      网友评论

        本文标题:Redis第2️⃣2️⃣课 Cluster故障转移

        本文链接:https://www.haomeiwen.com/subject/kjmbtctx.html