美文网首页
Kubernetes Nvidia GPU Monitor &

Kubernetes Nvidia GPU Monitor &

作者: Anoyi | 来源:发表于2021-03-05 18:53 被阅读0次

    ▶ Export Metrics

    1、前置条件

    2、标记 GPU 服务器

    kubectl label nodes <node-name> device_type=gpu
    

    3、在 GPU 节点上运行 DCGM Exporter

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: dcgm-exporter
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          k8s-app: dcgm-exporter
      template:
        metadata:
          labels:
            k8s-app: dcgm-exporter
        spec:
          nodeSelector:
            device_type: gpu
          hostNetwork: true
          hostPID: true
          containers:
            - name: dcgm-exporter
              image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04"
              imagePullPolicy: Always
              securityContext:
                capabilities:
                  add:
                    - SYS_ADMIN
              ports:
                - name: metrics
                  containerPort: 9400
                  hostPort: 9400
    

    更多细节,查看 https://github.com/NVIDIA/gpu-monitoring-tools

    4、测试获取 Metrics

    上一步,会在宿主机暴露 9400 端口

    curl <host-ip>:9400/metrics
    

    Metrics 信息如下,显示的是单服务器上两块 GPU 的情况:

    # HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
    # TYPE DCGM_FI_DEV_SM_CLOCK gauge
    # HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
    # TYPE DCGM_FI_DEV_MEM_CLOCK gauge
    # HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
    # TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
    # HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
    # TYPE DCGM_FI_DEV_GPU_TEMP gauge
    # HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
    # TYPE DCGM_FI_DEV_POWER_USAGE gauge
    ......
    
    DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 1290
    DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 877
    DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 39
    DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 42
    DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 57.555000
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 154680858400
    ......
    
    DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 1290
    DCGM_FI_DEV_MEM_CLOCK{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 877
    DCGM_FI_DEV_MEMORY_TEMP{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 40
    DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 43
    DCGM_FI_DEV_POWER_USAGE{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 55.157000
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 148793918798
    .....
    

    ▶ 使用 Prometheus 收集 Metrics

    1、创建 ConfigMap

    每个 Job 对应一个 GPU 服务器

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      namespace: kube-system
    data:
      prometheus.yml: |
        scrape_configs:
        - job_name: 'metrics-gpu-1'
          honor_labels: true
          static_configs:
            - targets: ['<host01-ip>:9400']
              labels:
                instance: GN1
        - job_name: 'metrics-gpu-2'
          honor_labels: true
          static_configs:
            - targets: ['<host02-ip>:9400']
              labels:
                instance: GN2
    

    2、部署 Prometheus

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: prometheus
      namespace: kube-system
    spec:
      replicas: 1
      revisionHistoryLimit: 3
      selector:
        matchLabels:
          k8s-app: prometheus
      template:
        metadata:
          labels:
            k8s-app: prometheus
        spec:
          volumes:
            - name: prometheus
              configMap:
                name: prometheus-config
          serviceAccountName: admin-user
          containers:
            - name: prometheus
              image: "prom/prometheus:latest"
              volumeMounts:
                - name: prometheus
                  mountPath: /etc/prometheus/
              imagePullPolicy: Always
              ports:
                - containerPort: 9090
                  protocol: TCP
    

    3、创建 Prometheus Service

    kind: Service
    apiVersion: v1
    metadata:
      labels:
        k8s-app: prometheus
      name: prometheus-service
      namespace: kube-system
    spec:
      ports:
        - port: 9090
          targetPort: 9090
      selector:
        k8s-app: prometheus
    

    ▶ 使用 Grafana 可视化 Metrics

    1、部署 Grafana

    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: grafana
      namespace: kube-system
    spec:
      replicas: 1
      selector:
        matchLabels:
          k8s-app: grafana
      template:
        metadata:
          labels:
            k8s-app: grafana
        spec:
          containers:
            - name: grafana
              image: grafana/grafana:latest
              env:
                - name: GF_SECURITY_ADMIN_PASSWORD
                  value: <your-password>
                - name: GF_SECURITY_ADMIN_USER
                  value: <your-username>
              ports:
                - containerPort: 3000
                  protocol: TCP
    

    2、创建 Grafana Service

    kind: Service
    apiVersion: v1
    metadata:
      labels:
        k8s-app: grafana
      name: grafana-service
      namespace: kube-system
    spec:
      ports:
        - port: 3000
          targetPort: 3000
          nodePort: 31111
      selector:
        k8s-app: grafana
      type: NodePort
    

    3、访问 Grafana

    Web 地址: http://<kubernetes-node-ip>:31111/ ,账号密码详见第一步的配置。

    4、添加 DataSource

    依次点击 setting -> DateSource -> Add data source -> Prometheus。 配置示例:

    • Name: Prometheus
    • Default: Yes
    • URL: http://prometheus-service:9090
    • Access: Server
    • Http Method: Get

    点击 Save & Test 即可接入 Prometheus 数据

    5、自定义 GPU 监控面板

    例如, 显示 GPU 温度:

    # HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
    DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 42
    

    查询每个 GPU 的温度,查询语句为 DCGM_FI_DEV_GPU_TEMP

    其他查询语句:

    • GPU 数量: count(DCGM_FI_DEV_SM_CLOCK)
    • GPU 总内存使用率: sum(DCGM_FI_DEV_FB_USED) / (sum(DCGM_FI_DEV_FB_FREE) + sum(DCGM_FI_DEV_FB_USED))
    • GPU 功耗: DCGM_FI_DEV_POWER_USAGE
    • GPU 内存温度: DCGM_FI_DEV_MEMORY_TEMP

    相关文章

      网友评论

          本文标题:Kubernetes Nvidia GPU Monitor &

          本文链接:https://www.haomeiwen.com/subject/peuwtltx.html