美文网首页
生产etcd服务器掉电故障修复

生产etcd服务器掉电故障修复

作者: 疯疯疯子子子 | 来源:发表于2020-05-19 18:39 被阅读0次

    客户现场集群异常掉电,我们于中午进行远程恢复集群。启动etcd服务时。出现如下错误

    member c77b7b06d2075637 has already been bootstrapped
    

    查看资料说是:
    One of the member was bootstrapped via discovery service. You must remove the previous data-dir to clean up the member information. Or the member will ignore the new configuration and start with the old configuration. That is why you see the mismatch.
    大概意思:
    其中一个成员是通过discovery service引导的。必须删除以前的数据目录来清理成员信息。否则成员将忽略新配置,使用旧配置。这就是为什么你看到了不匹配。
    看到了这里,问题所在也就很明确了,启动失败的原因在于data-dir (/var/lib/etcd/default.etcd)中记录的信息与 etcd启动的选项所标识的信息不太匹配造成的。

    解决方案:将该节点的etcd从集群中移除,并删除相关数据(后面可同步恢复)。再重新加入etcd集群。
    1.查看现有etcd节点

    export ETCDCTL_API=3
    etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem  member list
    c666144c29031acd, started, etcd-host0, https://20.140.249.65:2380, https://20.140.249.65:2379
    c77b7b06d2075637, started, etcd-host1, https://20.140.249.66:2380, https://20.140.249.66:2379
    f11a3a48abfa96dd, started, etcd-host2, https://20.140.249.67:2380, https://20.140.249.67:2379
    

    2.将报错节点移除

    export ETCDCTL_API=3
    [root@ga-k8s1 data]# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem  member remove c77b7b06d2075637
    Member c77b7b06d2075637 removed from cluster 7ab1847bce8f7723
    

    3.修改/usr/lib/systemd/system/etcd.service

    [Unit]
    Description=Etcd Server
    After=network.target
    After=network-online.target
    Wants=network-online.target
    Documentation=https://github.com/coreos
    
    [Service]
    Type=notify
    WorkingDirectory=/app/etcd/
    ExecStart=/usr/local/bin/etcd \
      --name=etcd-host0  \
      --data-dir=/app/etcd \
      --cert-file=/etc/etcd/ssl/etcd.pem \
      --key-file=/etc/etcd/ssl/etcd-key.pem \
      --trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
      --peer-cert-file=/etc/etcd/ssl/etcd.pem \
      --peer-key-file=/etc/etcd/ssl/etcd-key.pem \
      --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
      --peer-client-cert-auth \
      --client-cert-auth \
      --initial-advertise-peer-urls=https://20.140.249.66:2380 \
      --listen-peer-urls=https://20.140.249.66:2380 \
      --listen-client-urls=https://20.140.249.66:2379,https://127.0.0.1:2379 \
      --advertise-client-urls=https://20.140.249.66:2379 \
      --initial-cluster-token=etcd-cluster-0 \
      --initial-cluster=etcd-host0=https://20.140.249.65:2380,etcd-host1=https://20.140.249.66:2380,etcd-host2=https://20.140.249.67:2380 \
      --initial-cluster-state=existing \  # 将new这个参数修改成existing.
    Restart=on-failure
    RestartSec=5
    LimitNOFILE=65536
    
    [Install]
    WantedBy=multi-user.target
    

    4.删除数据

    rm -rf /var/lib/etcd/
    rm -rf /app/etcd/  # WorkingDirectory=/app/etcd/
    

    5.重新将etcd节点进行添加

    export ETCDCTL_API=2
    etcdctl --endpoints=https://127.0.0.1:2379 --ca-file=/etc/kubernetes/ssl/ca.pem --cert-file=/etc/etcd/ssl/etcd.pem --key-file=/etc/etcd/ssl/etcd-key.pem  member add  etcd-host1 https://20.140.249.66:2380
    

    6.启动etcd,重新加入的节点会向前两个节点重新同步数据

    systemctl daemon-reload && systemctl start etcd
    

    相关文章

      网友评论

          本文标题:生产etcd服务器掉电故障修复

          本文链接:https://www.haomeiwen.com/subject/hkobohtx.html