美文网首页kubernetes本地部署
2. ETCD集群部署及维护

2. ETCD集群部署及维护

作者: 一瓶多先生 | 来源:发表于2020-11-09 16:45 被阅读0次

    目录

    ETCD 是一个高可用的分布式键值数据库,可用于服务发现。ETCD 采用 raft 一致性算法,基于 Go 语言实现。

    01.集群部署基本信息

    etcd_version: v3.4.6
    etcd_base_dir: /var/lib/etcd

    etcd_data_dir: "/var/lib/etcd/default.etcd"

    etcd_listen_port: "2379"

    etcd_peer_port: "2380"

    etcd_bin_dir: /srv/kubernetes/bin

    etcd_conf_dir: /srv/kubernetes/conf

    etcd_pki_dir: /srv/kubernetes/pki

    部署主机:

    10.40.58.153

    10.40.58.154

    10.40.58.116

    02.ETCD凭证

    创建 ETCD 凭证签发请求文件:

    cat > etcd-csr.json <<EOF
    {
      "CN": "etcd",
      "hosts": [
        "10.40.61.116",
        "10.40.58.153",
        "10.40.58.154",
        "127.0.0.1"
      ],
      "key": {
        "algo": "rsa",
        "size": 2048
      },
      "names": [
        {
          "C": "CN",
          "L": "BeiJing",
          "O": "kubernetes",
          "OU": "System",
          "ST": "BeiJing"
        }
      ]
    }
    EOF
    

    其中hosts请填写部署etcd的主机IP地址, 如果使用域名请将域名一并填写

    创建 ETCD 凭证和私钥:

    cfssl gencert \
      -ca=ca.pem \
      -ca-key=ca-key.pem \
      -config=ca-config.json \
      -profile=kubernetes \
      etcd-csr.json | cfssljson -bare etcd
    

    结果将生成以下两个文件

    etcd-key.pem
    etcd.pem
    

    03.部署

    下载软件包

    使用如下命令初始化etcd的运行环境及下载etcd程序并安装

    mkdir -p /var/lib/etcd/default.etcd
    mkdir -p /srv/kubernetes/bin 
    mkdir -p /srv/kubernetes/pki 
    mkdir -p  /srv/kubernetes/conf
    
    ETCD_VER=v3.4.6
    GITHUB_URL=https://github.com/etcd-io/etcd/releases/download
    DOWNLOAD_URL=${GITHUB_URL}
    
    rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
    rm -rf /tmp/etcd-download-test && mkdir -p /tmp/etcd-download-test
    
    curl -L ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
    tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
    cp /tmp/etcd-download-test/{etcd,etcdctl} /srv/kubernetes/bin
    

    配置文件维护

    节点1

    NODENAME="py-modelo2o08cn-p005"
    THISIPADDRESS="10.40.61.116"
    CLUSTER="py-modelo2o08cn-p005=https://10.40.61.116:2380,\
        py-modelo2o08cn-p003=https://10.40.58.153:2380,\
        py-modelo2o08cn-p004=https://10.40.58.154:2380"
    

    节点2

    NODENAME="py-modelo2o08cn-p003"
    THISIPADDRESS="10.40.58.153"
    CLUSTER="py-modelo2o08cn-p005=https://10.40.61.116:2380,\
        py-modelo2o08cn-p003=https://10.40.58.153:2380,\
        py-modelo2o08cn-p004=https://10.40.58.154:2380"
    

    节点3

    NODENAME="py-modelo2o08cn-p004"
    THISIPADDRESS="10.40.58.154"
    CLUSTER="py-modelo2o08cn-p005=https://10.40.61.116:2380,\
        py-modelo2o08cn-p003=https://10.40.58.153:2380,\
        py-modelo2o08cn-p004=https://10.40.58.154:2380"
    

    分别登录没台节点机器, 使上面的环境变量生效, 执行下面的命令

    cat > /srv/kubernetes/conf/etcd.yaml <<EOF
    name: ${NODENAME}
    wal-dir: 
    data-dir: /var/lib/etcd/default.etcd
    max-snapshots: 10 
    max-wals: 10 
    snapshot-count: 10
    
    listen-peer-urls: https://${THISIPADDRESS}:2380
    listen-client-urls: https://${THISIPADDRESS}:2379,https://127.0.0.1:2379
    
    advertise-client-urls: https://${THISIPADDRESS}:2379
    initial-advertise-peer-urls: https://${THISIPADDRESS}:2380
    initial-cluster: ${CLUSTER}
    initial-cluster-token: kube-etcd-cluster
    initial-cluster-state: new
    
    client-transport-security:
      cert-file: /srv/kubernetes/pki/etcd.pem
      key-file: /srv/kubernetes/pki/etcd-key.pem
      client-cert-auth: true
      trusted-ca-file: /srv/kubernetes/pki/ca.pem
      auto-tls: false
    
    peer-transport-security:
      cert-file: /srv/kubernetes/pki/etcd.pem
      key-file: /srv/kubernetes/pki/etcd-key.pem
      client-cert-auth: true
      trusted-ca-file: /srv/kubernetes/pki/ca.pem
      auto-tls: false
    
    debug: true
    logger: zap
    log-outputs: [stderr]
    EOF
    

    如果需要etcd api v2 的支持, 请配置enable-v2: true

    使用system管理etcd

    创建service文件

    cat > /etc/systemd/system/etcd.service <<EOF
    [Unit]
    Description=Etcd Server
    After=network.target
    
    [Service]
    WorkingDirectory=/var/lib/etcd
    ExecStart=/srv/kubernetes/bin/etcd --config-file=/srv/kubernetes/conf/etcd.yaml
    Type=notify
    
    [Install]
    WantedBy=multi-user.target
    EOF
    

    配置etcd开机启动

    sudo /bin/systemctl daemon-reload
    sudo /bin/systemctl enable etcd.service
    

    用下面的命令启动或者停止etcd

    sudo systemctl start etcd.service
    sudo systemctl stop etcd.service
    

    验证查看集群状态

    为了方便使用配置etcdctl别名, 配置完成后退出重新登录主机

    cat >>  ~/.bashrc << EOF 
    alias etcdctl="/srv/kubernetes/bin/etcdctl \
        --endpoints=https://10.40.58.153:2379,https://10.40.58.154:2379,https://10.40.61.116:2379 \
        --cacert=/srv/kubernetes/pki/ca.pem \
        --cert=/srv/kubernetes/pki/etcd.pem \
        --key=/srv/kubernetes/pki/etcd-key.pem"
    EOF
    

    查看集群状况

    etcdctl endpoint health
    

    output

    https://10.40.61.116:2379 is healthy: successfully committed proposal: took = 17.824976ms
    https://10.40.58.154:2379 is healthy: successfully committed proposal: took = 18.437575ms
    https://10.40.58.153:2379 is healthy: successfully committed proposal: took = 19.917812ms
    

    04.架构及内部机制解析

    raft一致性算法

    • 节点数量

      etcd集群节点的数量根据raft一致性算法中quorum的概念, 集群中必须存在(n+1)/2个节点才可以对外提供服务, 如果一个集群中有3个节点, 那么集群可以对外提供服务的最小节点数量为(3+1)/2=2 个, 也就只允许其中一个节点故障

    • 数据写入

      etcd的集群由3个到5个节点组成, 使用raft一致性算法完成一致性协同, 算法选举leader并有leader完成数据的同步和数据的分发, 当leader发生故障时选举出新的leader并由新的leader完成数据的同步, 客户端连接到etcd的集群时只需要连接其中的任意一个节点就可以完成数据的读写

    Api介绍

    etcd 提供的接口可以划分为5组:

    • PUT(Key. Value)/Delete(Key)

      使用put接口提供Key和Value写入数据

      使用Delete接口提供Key删除数据

    • GET(key)/Get(keyFrom, keyEnd)

      get(key) 查询指定key的value

      get(keyFrom, keyEnd) 指定一个Key的范围进行查询

    • Watch(Key/keyPrefix)0

      使用watch接口可以实时订阅etcd中数据更新

    • Transaction(if/then/else ops).commit()

      etcd提供了事物支持, 可以指定条件满足时执行某些操作

    • Leases: Grant/Revoke/Keepalive

      >

    数据的版本控制

    全局版本

    • etcd中有个term的概念, 代表的是整个集群leader的任期, 当leader发生切换时term的数值题+1

    • etcd中还有一个revision, revision代表全局数据的版本, 当数据发生变更如: 增、删、改、查都会是revision加一。 当集群的term不发生变化时revision递增

    KeyValue数据版本

    • Create_revision, 是keyvalue创建时的revision
    • mod_revision, 对数据进行操作,数据发生变更时的revision
    • version, 是一个计数器, 表示keyvalue被修改了多少次

    验证

    使用如下指令我们可以获得key的版本信息

    etcdctl  get name   -w json  |jq
    {
      "header": {
        "cluster_id": 9796312800751810000,
        "member_id": 13645171481868003000,
        "revision": 15,
        "raft_term": 54
      },
      "kvs": [
        {
          "key": "bmFtZQ==",
          "create_revision": 15,
          "mod_revision": 15,
          "version": 1,
          "value": "dG9t"
        }
      ],
      "count": 1
    }
    

    执行如下指令进行修改, 修改完成后进行对比

    $ etcdctl put name alex
    OK
    $ etcdctl  get name -w json  |jq
    {
      "header": {
        "cluster_id": 9796312800751810000,
        "member_id": 13645171481868003000,
        "revision": 16,
        "raft_term": 54
      },
      "kvs": [
        {
          "key": "bmFtZQ==",
          "create_revision": 15,
          "mod_revision": 16,
          "version": 2,
          "value": "YWxleA=="
        }
      ],
      "count": 1
    }
    

    通过如上数据的对比我们可以得出以下结论,由于集群没有进行新的选举所以term的verison没有发生变化, 修改了一次数据以后全局的revision加了1, 数据的create_revision为数据写入时全局的revision, 由于进行了数据的修改所以mod_revision递增+1, 数据修改了两次所以version等于2

    05.etcdctl命令详解

    etcdctl是etcd的命令行工具, 可以使用此命令和etcd进行交互, 默认etcdctl使用v2的API, 如果需要使用V3 的API请设置环境变量, 命令如下:

    $ export ETCDCTL_API=3
    

    数据的写入

    $ etcdctl put key value
    

    数据的查询

    • 指定单个key查询

      $ etcdctl  get key
      
    • 指定key的范围进行查询

      写入数据

      $ etcdctl put key0 value0
      OK
      $ etcdctl put key1 value1
      OK
      $ etcdctl put key2 value2
      OK
      $ etcdctl put key3 value3
      OK
      $ etcdctl put key4 value4
      OK
      $ etcdctl put key5 value5
      OK
      

      查询数据

      $ etcdctl  get key0 key5
      key0
      value0
      key1
      value1
      key2
      value2
      key3
      value3
      key4
      value4
      

      根据查询的结果我们可以知道查询的结果是不包含结束的key

    • 查询指定前缀的key

      etcdctl get --prefix key
      key
      value
      key0
      value0
      key1
      value1
      key2
      value2
      key3
      value3
      key4
      value4
      key5
      value5
      
    • 查看置顶revision的key

      $ etcdctl get name --rev=16
      name
      alex
      $ etcdctl get name --rev=15
      name
      tom
      
    • 查看集群中所有数据

      etcdctl get /  --prefix --keys-only
      

      06.集群维护

    节点增加

    增加节点10.40.58.152为新节点, 请先执行 下载软件包进行环境初始化

    • 更新证书
    • [复制 TLS 证书和密钥对](#复制 TLS 证书和密钥对)
    • [重启etcd Cluster](#重启etcd Cluster)
    • 启动新实例

    更新证书

    修改证书的csr文件重新生成证书的公钥和私钥, hosts字段添加新的主机IP地址

    cat > etcd-csr.json <<EOF
    {
      "CN": "etcd",
      "hosts": [
        "10.40.61.116",
        "10.40.58.153",
        "10.40.58.154",
        "10.40.58.152"
      ],
      "key": {
        "algo": "rsa",
        "size": 2048
      },
      "names": [
        {
          "C": "CN",
          "L": "BeiJing",
          "O": "kubernetes",
          "OU": "System",
          "ST": "BeiJing"
        }
      ]
    }
    EOF
    

    创建 admin client 凭证和私钥:

    cfssl gencert \
      -ca=ca.pem \
      -ca-key=ca-key.pem \
      -config=ca-config.json \
      -profile=kubernetes \
      etcd-csr.json | cfssljson -bare etcd
    

    结果将生成以下两个文件

    etcd-key.pem
    etcd.pem
    

    复制 TLS 证书和密钥对

    rsync -auv etcd-key.pem etcd.pem ca.pem root@10.40.58.152:/src/kubernetes/pki/
    rsync -auv etcd-key.pem etcd.pem ca.pem root@10.40.58.153:/srv/kubernetes/pki
    rsync -auv etcd-key.pem etcd.pem ca.pem root@10.40.58.154:/srv/kubernetes/pki
    rsync  -auv etcd-key.pem etcd.pem ca.pem /srv/kubernetes/pki/
    

    重启etcd Cluster

    $ sudo systemctl  restart etcd.service
    

    启动新实例

    请根据如下命令生成etcd的配置文件, 完成后请根据使用system管理etcd配置systemd, 并启动节点

    NODENAME="py-modelo2o08cn-p002"
    THISIPADDRESS="10.40.58.152"
    CLUSTER="py-modelo2o08cn-p005=https://10.40.61.116:2380,\
        py-modelo2o08cn-p003=https://10.40.58.153:2380,\
        py-modelo2o08cn-p004=https://10.40.58.154:2380,\
        py-modelo2o08cn-p002=https://10.40.58.152:2380"
    

    分别登录没台节点机器, 使上面的环境变量生效, 执行下面的命令

    cat > /srv/kubernetes/conf/etcd.yaml <<EOF
    name: ${NODENAME}
    wal-dir: 
    data-dir: /var/lib/etcd/default.etcd
    max-snapshots: 10 
    max-wals: 10 
    
    listen-peer-urls: https://${THISIPADDRESS}:2380
    listen-client-urls: https://${THISIPADDRESS}:2379,https://127.0.0.1:2379
    
    advertise-client-urls: https://${THISIPADDRESS}:2379
    initial-advertise-peer-urls: https://${THISIPADDRESS}:2380
    initial-cluster: ${CLUSTER}
    initial-cluster-token: kube-etcd-cluster
    initial-cluster-state: existing
    
    client-transport-security:
      cert-file: /srv/kubernetes/pki/etcd.pem
      key-file: /srv/kubernetes/pki/etcd-key.pem
      client-cert-auth: true
      trusted-ca-file: /srv/kubernetes/pki/ca.pem
      auto-tls: false
    
    peer-transport-security:
      cert-file: /srv/kubernetes/pki/etcd.pem
      key-file: /srv/kubernetes/pki/etcd-key.pem
      client-cert-auth: true
      trusted-ca-file: /srv/kubernetes/pki/ca.pem
      auto-tls: false
    
    debug: true
    logger: zap
    log-outputs: [stderr]
    EOF
    

    添加实例

    etcdctl member add py-modelo2o08cn-p002 --peer-urls=https://10.40.58.152:2380
    

    节点删除

    获取节点的ID

    $ etcdctl  member list
    690eb5228cd49828, started, py-modelo2o08cn-p002, https://10.40.58.152:2380, https://10.40.58.152:2379, false
    7064f95d4211e35b, started, py-modelo2o08cn-p003, https://10.40.58.153:2380, https://10.40.58.153:2379, false
    b54dd19729976a3f, started, py-modelo2o08cn-p004, https://10.40.58.154:2380, https://10.40.58.154:2379, false
    bd5d632ae4086bfd, started, py-modelo2o08cn-p005, https://10.40.61.116:2380, https://10.40.61.116:2379, false
    

    删除member

    $ etcdctl member remove 690eb5228cd49828
    Member 690eb5228cd49828 removed from cluster 87f37e96d56c7453
    

    查看集群的健康状态

    $ etcdctl  endpoint health
    https://10.40.61.116:2379 is healthy: successfully committed proposal: took = 17.168768ms
    https://10.40.58.154:2379 is healthy: successfully committed proposal: took = 21.879205ms
    https://10.40.58.153:2379 is healthy: successfully committed proposal: took = 21.980464ms
    

    数据备份

    --snapshot-count:指定有多少事务(transaction)被提交时,触发截取快照保存到磁盘,在v3.2之前的版本,默认的参数是10000条,3.2之后调整为100000条。 这个条目数量不能配置过高或者过低,过低会导致频繁的io压力,过高会导致占用高内存以及会导致etcd GC过慢。建议设置为10W-20W条。

    --max-snapshots '5': 最大保留多少快照文件

    配置snaphost, 添加如下内容到etcd.yaml文件中, 其中snapshot-count 为了测试设置为10

    max-snapshots: 10 
    max-wals: 10 
    snapshot-count: 10
    

    执行相关提交10次后查看日志发现snapshot已经保存

    Apr  5 10:34:11 py-modelo2o08cn-p003 etcd: {"level":"info","ts":"2020-04-05T10:34:11.480+0800","caller":"etcdserver/server.go:2381","msg":"saved snapshot","snapshot-index":64}
    

    备份的snap文件存在/var/lib/etcd/default.etcd/member/snap

    数据恢复

    本文的测试用例为3台节点其中一台节点所有原始文件全部丢失, 查看集群的健康状态如下:

    其中10.40.61.116节点状态异常, 我们删掉了data-dir下的所有文件并停止了服务

    $ etcdctl endpoint status
    {"level":"warn","ts":"2020-04-05T14:38:36.309+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://10.40.61.116:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.40.61.116:2379: connect: connection refused\""}
    Failed to get the status of endpoint https://10.40.61.116:2379 (context deadline exceeded)
    https://10.40.58.153:2379, 7064f95d4211e35b, 3.4.6, 20 kB, true, false, 5, 25, 25,
    https://10.40.58.154:2379, b54dd19729976a3f, 3.4.6, 20 kB, false, false, 5, 25, 25,
    

    创建快照

    从正在运行的节点上执行如下执行

    • 写入测试数据
    $ etcdctl put b 1
    
    • 生成快照
    $ etcdctl snapshot save snapshot.db
    
    • 将快照cp到10.40.61.116主机
    $ rsync -auv  snapshot.db root@10.40.61.116:/root
    

    恢复快照

    $ etcdctl snapshot restore  snapshot.db
    

    验证

    $ etcdctl endpoint health
    $ etcdctl get b
    

    Q&A

    Q:

    $ /srv/kubernetes/bin/etcdctl --endpoints=https://10.40.58.153:2379,https://10.40.58.154:2379,https://10.40.61.116:2379  --cert-file=/srv/kubernetes/pki/etcd.pem --key-file=/srv/kubernetes/pki/etcd-key.pem  --ca-file /srv/kubernetes/pki/ca.pem --debug  ls /
    Error:  client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint
    
    
    $ curl -X GET https://10.40.61.116:2379/v2/members  --cacert /root/certificated/ca.pem  --cert  /root/certificated/etcd.pem  --key /root/certificated/etcd-key.pem
    404 page not found
    

    A:
    上面报错的主要是因为当前集群没有开通api v2的支持, 请在/srv/kubernetes/conf/etcd.yaml 文件中添加v2支持

    enable-v2: true
    

    参考文档

    相关文章

      网友评论

        本文标题:2. ETCD集群部署及维护

        本文链接:https://www.haomeiwen.com/subject/dvfqbktx.html