美文网首页
k8s 搭建 gpu 监控

k8s 搭建 gpu 监控

作者: 流月汐志 | 来源:发表于2020-04-24 14:25 被阅读0次

    部署架构

    部署方式:kubernetes
    node 监控和 gpu 监控

    • node-exporter + gpu-metrics-exporter
    • prometheus + grafana

    gpu 监控

    使用项目
    pod-gpu-metrics-exporter

    需要环境

    • NVIDIA Tesla drivers = R384+ (download from NVIDIA Driver Downloads page)
    • nvidia-docker version > 2.0 (see how to install and it's prerequisites)
    • Set the default runtime to nvidia
    • Kubernetes version = 1.13
    • Set KubeletPodResources in /etc/default/kubelet: KUBELET_EXTRA_ARGS=--feature-gates=KubeletPodResources=true

    安装环境

    安装脚本(ubuntu):
    install-nvidia-docker.sh

    #!/bin/bash
    
    pwd=$1
    
    if [[ -z ${pwd} ]]
    then
        echo "please run [bash $0 <pwd>]"
        exit 0
    fi
    
    # 安装 docker
    echo ${pwd} | sudo apt-get update
    
    echo ${pwd} | sudo apt-get install curl && \
    curl -fsSL https://get.docker.com -o get-docker.sh && \
    echo ${pwd} | sudo sh get-docker.sh
    echo ${pwd} | sudo usermod -aG docker digisky
    echo ${pwd} | sudo systemctl enable docker
    # nvidia-docker
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    
    echo ${pwd} | sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit nvidia-container-runtime
    # nvidia-container-runtime
    echo ${pwd} | sudo cp -f daemon.json /etc/docker/daemon.json 
    
    echo ${pwd} | sudo systemctl restart docker
    # gpu-monitoring-tools-master
    

    daemon.json

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        },
        "registry-mirrors": ["https://vs2fctcq.mirror.aliyuncs.com"]
    }
    

    pod-gpu-metrics-exporter.yaml

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      labels:
        app.kubernetes.io/name: gpu-metrics-exporter
        app.kubernetes.io/version: latest
      name: gpu-metrics-exporter
      namespace: monitor
    spec:
      selector:
        matchLabels:
          app.kubernetes.io/name: pod-gpu-metrics-exporter
      template:
        metadata:
          labels:
            app.kubernetes.io/name: pod-gpu-metrics-exporter
            app.kubernetes.io/part-of: gpu-metrics-exporter
            app.kubernetes.io/version: latest
          name: pod-gpu-metrics-exporter
        spec:
          containers:
          - image: xxx/pod-gpu-metrics-exporter:latest
            imagePullPolicy: Always
            name: pod-nvidia-gpu-metrics-exporter
            ports:
            - containerPort: 9400
              hostPort: 59101
              name: gpu-port
              protocol: TCP
            volumeMounts:
            - mountPath: /var/lib/kubelet/pod-resources
              name: pod-gpu-resources
              readOnly: true
            - mountPath: /run/prometheus
              name: device-metrics
              readOnly: true
          - image: xxx/dcgm-exporter:latest
            imagePullPolicy: Always
            name: nvidia-dcgm-exporter
            volumeMounts:
            - mountPath: /run/prometheus
              name: device-metrics
          dnsPolicy: ClusterFirst
    #      imagePullSecrets:
    #      - name: hub-out
          restartPolicy: Always
          volumes:
          - hostPath:
              path: /var/lib/kubelet/pod-resources
              type: ""
            name: pod-gpu-resources
          - emptyDir:
              medium: Memory
            name: device-metrics
    

    采集指标解释

    指标 解释
    dcgm_fan_speed_percent GPU风扇转速占比(%)
    dcgm_sm_clock GPU sm时钟(MHz)
    dcgm_memory_clock GPU 内存时钟(MHz)
    dcgm_gpu_temp GPU 运行的温度(℃)
    dcgm_power_usage GPU 的功率(w)
    dcgm_pcie_tx_throughput GPU PCIeTX传输的字节总数 (kb)
    dcgm_pcie_rx_throughput GPU PCIeRX接收的字节总数 (kb)
    dcgm_pcie_replay_counter GPU PCIe重试的总数
    dcgm_gpu_utilization GPU利用率(%)
    dcgm_mem_copy_utilization GPU 内存利用率(%)
    dcgm_enc_utilization GPU编码器利用率(%)
    dcgm_dec_utilization GPU解码器利用率(%)
    dcgm_xid_errors GPU 上一个xid错误的值
    dcgm_power_violation GPU 功率限制导致的节流持续时间(us)
    dcgm_thermal_violation GPU 热约束节流持续时间(us)
    dcgm_sync_boost_violation GPU 同步增强限制,限制持续时间(us)
    dcgm_fb_free GPUfb(帧缓存)的剩余(MiB)
    dcgm_fb_used GPUfb(帧缓存)的使用(MiB)

    node 监控

    参考yaml

    修改后并测试成功的yaml

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
      name: node-exporter
      namespace: monitor
    spec:
      selector:
        matchLabels:
          app.kubernetes.io/name: node-exporter
      template:
        metadata:
          labels:
            app.kubernetes.io/name: node-exporter
        spec:
          containers:
          - args:
            - --web.listen-address=0.0.0.0:59100
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/host/root
            - --no-collector.wifi
            - --no-collector.hwmon
            - --collector.filesystem.ignored-mount-points=^/(var.*|run.*|boot.*|snap.*|dev|proc|sys|var/lib/docker/.+)($|/)
            - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
            image: xxx/node-exporter:latest
            imagePullPolicy: IfNotPresent
            name: node-exporter
            ports:
            # hostNetwork开启为 true 时, containerPort 和 hostPort 需设置一样
            - containerPort: 59100
              hostPort: 59100
              name: node-port
              protocol: TCP
            resources:
              limits:
                cpu: 250m
                memory: 180Mi
              requests:
                cpu: 102m
                memory: 180Mi
            securityContext:
              readOnlyRootFilesystem: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /host/proc
              name: proc
            - mountPath: /host/sys
              name: sys
            - mountPath: /host/root
              mountPropagation: HostToContainer
              name: root
              readOnly: true
          # 以下参数用以采集 node 的真实数据
          hostIPC: true
          hostNetwork: true
          hostPID: true
          # 指定镜像仓库的密钥
    #      imagePullSecrets:
    #      - name: hub-out
          nodeSelector:
            beta.kubernetes.io/os: linux
          restartPolicy: Always
          volumes:
          - hostPath:
              path: /proc
              type: ""
            name: proc
          - hostPath:
              path: /sys
              type: ""
            name: sys
          - hostPath:
              path: /
              type: ""
            name: root
    

    prometheus

    file_sd_configs 采用 file_sd_configs 的方式
    prometheus.yaml

    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    scrape_configs:
      - job_name: 'prometheus-dev'
        file_sd_configs:
        - files:
          - prometheus-etc.json
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['192.168.20.75:9093']
    

    grafana

    资料网站:

    相关文章

      网友评论

          本文标题:k8s 搭建 gpu 监控

          本文链接:https://www.haomeiwen.com/subject/celfwhtx.html