美文网首页
Loki 日志系统分布式部署实践二 etcd

Loki 日志系统分布式部署实践二 etcd

作者: kong62 | 来源:发表于2020-12-03 09:37 被阅读0次

    说明

    本次 loki 采用 etcd 作为 k/v 存储 ring,etcd 采用 3 节点部署模式。
    注意:loki 本身为对 k/v 做过期处理,会造成很大的 key,之前没注意,一晚上把我的 etcd 写爆了

    安装

    # helm repo add bitnami https://charts.bitnami.com/bitnami
    # helm repo update
    # helm search repo etcd
    NAME                    CHART VERSION   APP VERSION     DESCRIPTION                                       
    alicloud/etcd-operator  0.7.0           0.7.0           CoreOS etcd-operator Helm chart for Kubernetes    
    alicloud/zetcd          0.1.6           0.0.3           CoreOS zetcd Helm chart for Kubernetes            
    azure/etcd-operator     0.11.2          0.9.4           DEPRECATED CoreOS etcd-operator Helm chart for ...
    azure/zetcd             0.1.11          0.0.3           DEPRECATED CoreOS zetcd Helm chart for Kubernetes 
    bitnami/etcd            5.2.1           3.4.14          etcd is a distributed key value store that prov...
    stable/etcd-operator    0.11.2          0.9.4           DEPRECATED CoreOS etcd-operator Helm chart for ...
    stable/zetcd            0.1.11          0.0.3           DEPRECATED CoreOS zetcd Helm chart for Kubernetes 
    
    # helm pull bitnami/etcd --version 5.2.1
    # ll |grep etcd
    -rw-r--r--  1 root    root       34939 Nov 26 15:17 etcd-5.2.1.tgz
    

    创建配置:

    # helm show values etcd-5.2.1.tgz
    # cat > etcd-config.yaml <<EOF
    global:
      imageRegistry: ops-harbor.hupu.io/k8s
      #imagePullSecrets:
      #  - myRegistryKeySecretName
      storageClass: alicloud-disk-efficiency-cn-hangzhou-g
    
    image:
      registry: ops-harbor.hupu.io/k8s
      repository: etcd
      #registry: docker.io
      #repository: bitnami/minideb
      tag: 3.4.14
      pullPolicy: IfNotPresent
      debug: false
    
    volumePermissions:
      enabled: false
      image:
        registry: ops-harbor.hupu.io/k8s
        #repository: bitnami/minideb
        repository: minideb
        tag: buster
        pullPolicy: IfNotPresent
      #resources:
      #  limits:
      #    cpu: 100m
      #    memory: 128Mi
      #  requests: 
      #    cpu: 100m
      #    memory: 128Mi
    
    statefulset:
      replicaCount: 3
      updateStrategy: RollingUpdate
      podManagementPolicy: Parallel
    
    allowNoneAuthentication: true
    
    envVarsConfigMap: etcd-env-ext
    
    auth:
      rbac:
        enabled: false
    
      client:
        secureTransport: false
        useAutoTLS: false
        enableAuthentication: false
        certFilename: cert.pem
        certKeyFilename: key.pem
        caFilename: ""
    
      peer:
        secureTransport: false
        useAutoTLS: false
        enableAuthentication: false
        certFilename: cert.pem
        certKeyFilename: key.pem
        caFilename: ""
    
    securityContext:
      enabled: true
      fsGroup: 1001
      runAsUser: 1001
    
    clusterDomain: cluster.local
    
    etcd:
      initialClusterState: ""
    
    service:
      type: ClusterIP
      port: 2379
      clientPortNameOverride: ""
      peerPort: 2380
      peerPortNameOverride: ""
      nodePorts:
        clientPort: ""
        peerPort: ""
      annotations: {}
    
    persistence:
      enabled: true
      storageClass: "-"
      accessModes:
        - ReadWriteOnce
      size: 20Gi
    
    pdb:
      enabled: false
      # minAvailable: 1
      # maxUnavailable: 1
    
    resources:
      limits:
        cpu: 2000m
        memory: 4Gi
      requests: 
        cpu: 100m
        memory: 256Mi
    
    podAntiAffinityPreset: soft
    affinity: 
      # Pod 反亲和
      podAntiAffinity:
        # Pod 硬反亲和
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - etcd
          topologyKey: "kubernetes.io/hostname"
        # Pod 软反亲和
        #preferredDuringSchedulingIgnoredDuringExecution:
        #- podAffinityTerm:
        #    labelSelector:
        #      matchExpressions:
        #      - key: app
        #        operator: In
        #        values:
        #        - loki
        #    topologyKey: kubernetes.io/hostname
        #  weight: 100
    
    livenessProbe:
      enabled: true
      initialDelaySeconds: 60
      periodSeconds: 10
      timeoutSeconds: 5
      successThreshold: 1
      failureThreshold: 3
    
    readinessProbe:
      enabled: true
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 5
      successThreshold: 1
      failureThreshold: 3
    
    metrics:
      enabled: true
      podAnnotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "2379"
      serviceMonitor:
        enabled: true
    
    startFromSnapshot:
      enabled: false
    
    disasterRecovery:
      enabled: false
      debug: true
      cronjob:
        schedule: "*/30 * * * *"
        historyLimit: 1
        snapshotHistoryLimit: 1
        podAnnotations: {}
      pvc:
        size: 20Gi
        storageClassName: alicloud-disk-efficiency-cn-hangzhou-g
    EOF
    

    优化 etcd

    创建环境变量 configmap:
    注意:ETCD3.4 版本中 ETCDCTL_API=3 和 --enable-v2=false 成为了默认配置,如要使用 v2 版本,执行 etcdctl 时候需要设置 ETCDCTL_API=2 环境变量,例如:ETCDCTL_API=2 etcdctl,另外配置文件需要开启 v2 版本,故配置文件中设置 ETCD_ENABLE_V2="true"

    # kubectl create configmap etcd-env-ext \
                               --from-literal=ETCD_HEARTBEAT_INTERVAL=150 \
                               --from-literal=ETCD_MAX_REQUEST_BYTES=10485760 \
                               --from-literal=ETCD_QUOTA_BACKEND_BYTES=8589934592 \
                               --from-literal=ETCD_SNAPSHOT_COUNT=10000 \
                               --from-literal=ETCD_AUTO_COMPACTION_RETENTION=1 \
                               --from-literal=ETCD_ELECTION_TIMEOUT=1500 \
                               --from-literal=ETCD_ENABLE_V2=true \
                               -n grafana 
    

    安装:

    # helm upgrade --install -f etcd-config.yaml etcd etcd-5.2.1.tgz --set auth.rbac.enabled=false -n grafana
    # kubectl get pod -n grafana |grep etcd    
    etcd-0                                1/1     Running   0          91s
    etcd-1                                1/1     Running   0          91s
    etcd-2                                1/1     Running   0          91s
    
    # kubectl get svc -n grafana |grep etcd      
    etcd                                     ClusterIP   172.21.10.82    <none>        2379/TCP,2380/TCP         97s
    etcd-headless                            ClusterIP   None            <none>        2379/TCP,2380/TCP         97s
    

    验证:

    # kubectl exec -it -n grafana etcd-0 -- etcdctl endpoint status
    127.0.0.1:2379, 39df1d045f231667, 3.4.14, 20 kB, false, false, 4, 1694, 1694, 
    
    # kubectl exec -it -n grafana etcd-0 -- etcdctl member list
    39df1d045f231667, started, etcd-0, http://etcd-0.etcd-headless.grafana.svc.cluster.local:2380, http://etcd-0.etcd-headless.grafana.svc.cluster.local:2379, false
    487d1b1ef7d2ac8d, started, etcd-2, http://etcd-2.etcd-headless.grafana.svc.cluster.local:2380, http://etcd-2.etcd-headless.grafana.svc.cluster.local:2379, false
    7b7cf2cee115321e, started, etcd-1, http://etcd-1.etcd-headless.grafana.svc.cluster.local:2380, http://etcd-1.etcd-headless.grafana.svc.cluster.local:2379, false
    
    # kubectl exec -it -n grafana etcd-0 -- etcdctl get --prefix / --keys-only
    

    启用自动压缩后,运行一段时间后的数据量,居然比我一个 k8s 集群的 etcd 数据量都大:


    image.png

    错误处理

    错误 1:

    # kubectl exec -it -n grafana etcd-0 -- etcdctl get --prefix / --keys-only
    {"level":"warn","ts":"2020-11-26T09:19:29.188Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-33699fc2-2bf7-4f4b-b16a-84e7175bf451/127.0.0.1:2379","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: user name is empty"}
    Error: etcdserver: user name is empty
    

    解决:
    参考:https://github.com/bitnami/charts/issues/2433
    这是因为启用了 RBAC 导致的。需要关闭

    # helm uninstall etcd -n grafana 
    # kubectl delete pvc -n grafana data-etcd-1 data-etcd-0 data-etcd-2
    # helm upgrade --install -f etcd-config.yaml etcd etcd-5.2.1.tgz --set auth.rbac.enabled=false -n grafana
    

    错误 2:

    # kubectl logs -f -n grafana etcd-0 
    2020-11-27 00:56:46.986881 W | etcdserver: failed to apply request "header:<ID:12433723910218610446 > txn:<compare:<key:\"collectors/ring\" version:141738 > success:<request_put:<key:\"collectors/ring\" value_size:13501 >> failure:<>>" with response "" took (2.864µs) to execute, err is etcdserver: no space
    2020-11-27 00:56:46.992294 W | etcdserver: failed to apply request "header:<ID:3611453715176272468 > txn:<compare:<key:\"collectors/ring\" version:141738 > success:<request_put:<key:\"collectors/ring\" value_size:13629 >> failure:<>>" with response "" took (3.626µs) to execute, err is etcdserver: no space
    

    解决:
    在 k8s 跑了一段时间后,发现集群不可用,查找问题发现是 etcd 数据达到了 2G 上限,进而导致不可写。
    实际上解决不可写有三种方法:

    1. 更改 2g 限制
    2. 压缩历史版本
    3. 设置自动压缩

    更改 2g 限制:
    注意:这个 2g 是为了在保留性能的同时 etcd 的最大存储,不能随意设置

    --quota-backend-bytes
    

    etcd db 数据的大小,默认是 2G,当数据达到 2G 的时候就不允许写入,必须对历史数据进行压缩才能继续写入;官方推荐是 8G。

    --max-request-bytes
    

    etcd Raft 消息最大字节数,默认值是 1.5M,官方推荐的是 10M。

    --auto-compaction-retention=1
    

    表示每 1 小时进行一次数据压缩,默认一小时压缩一次数据。这样可以极大的保证集群稳定,减少内存和磁盘占用。
    由于 etcd 存储多版本数据,随着写入主键增加,历史版本增加,需要定时清理,默认的历史数据是不做清理的,在数据到达 2g 的时候,就无法继续写入数据,那时必须手动清理压缩数据才能继续写入。

    手动执行 etcd 磁盘清理:
    显示 etcd 空间配额:

    # ETCDCTL_API=3 etcdctl --endpoints=https://x.x.x.x:2379 --cert=/x/x.pem --key=/x/x.key --cacert=/x/x.crt endpoint status --write-out="table"
    

    查看告警:

    # ETCDCTL_API=3 etcdctl --endpoints=https://x.x.x.x:2379 --cert=/x/x.pem --key=/x/x.key --cacert=/x/x.crt alarm list
    

    获取当前 etcd 数据的修订版本(revision):

    # rev=$(ETCDCTL_API=3 etcdctl --endpoints=https://x.x.x.x:2379 --cert=/x/x.pem --key=/x/x.key  --cacert=/x/x.crt endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
    # echo $rev
    

    整合压缩旧版本数据:

    # ETCDCTL_API=3 etcdctl --endpoints=https://x.x.x.x:2379 --cert=/x/x.pem --key=/x/x.key --cacert=/x/x.crt compact $rev
    

    执行碎片整理:

    # ETCDCTL_API=3 etcdctl --endpoints=https://x.x.x.x:2379 --cert=/x/x.pem --key=/x/x.key --cacert=/x/x.crt defrag
    

    解除告警:

    # ETCDCTL_API=3 etcdctl --endpoints=https://x.x.x.x:2379 --cert=/x/x.pem --key=/x/x.key --cacert=/x/x.crt alarm disarm
    

    查看配额:

    # ETCDCTL_API=3 etcdctl --endpoints=https://x.x.x.x:2379 --cert=/x/x.pem --key=/x/x.key --cacert=/x/x.crt endpoint status --write-out="table"
    

    相关文章

      网友评论

          本文标题:Loki 日志系统分布式部署实践二 etcd

          本文链接:https://www.haomeiwen.com/subject/rqyywktx.html