▶ Export Metrics
1、前置条件
- NVIDIA Tesla drivers = R384+ (download from NVIDIA Driver Downloads page)
- nvidia-docker version > 2.0 (see how to install and it's prerequisites)
- Optionally configure docker to set your default runtime to nvidia
- NVIDIA device plugin for Kubernetes (see how to install)
2、标记 GPU 服务器
kubectl label nodes <node-name> device_type=gpu
3、在 GPU 节点上运行 DCGM Exporter
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: dcgm-exporter
template:
metadata:
labels:
k8s-app: dcgm-exporter
spec:
nodeSelector:
device_type: gpu
hostNetwork: true
hostPID: true
containers:
- name: dcgm-exporter
image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04"
imagePullPolicy: Always
securityContext:
capabilities:
add:
- SYS_ADMIN
ports:
- name: metrics
containerPort: 9400
hostPort: 9400
更多细节,查看 https://github.com/NVIDIA/gpu-monitoring-tools
4、测试获取 Metrics
上一步,会在宿主机暴露 9400 端口
curl <host-ip>:9400/metrics
Metrics 信息如下,显示的是单服务器上两块 GPU 的情况:
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
......
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 1290
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 39
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 42
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 57.555000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 154680858400
......
DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 1290
DCGM_FI_DEV_MEM_CLOCK{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 40
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 43
DCGM_FI_DEV_POWER_USAGE{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 55.157000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 148793918798
.....
▶ 使用 Prometheus 收集 Metrics
1、创建 ConfigMap
每个 Job 对应一个 GPU 服务器
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: kube-system
data:
prometheus.yml: |
scrape_configs:
- job_name: 'metrics-gpu-1'
honor_labels: true
static_configs:
- targets: ['<host01-ip>:9400']
labels:
instance: GN1
- job_name: 'metrics-gpu-2'
honor_labels: true
static_configs:
- targets: ['<host02-ip>:9400']
labels:
instance: GN2
2、部署 Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: kube-system
spec:
replicas: 1
revisionHistoryLimit: 3
selector:
matchLabels:
k8s-app: prometheus
template:
metadata:
labels:
k8s-app: prometheus
spec:
volumes:
- name: prometheus
configMap:
name: prometheus-config
serviceAccountName: admin-user
containers:
- name: prometheus
image: "prom/prometheus:latest"
volumeMounts:
- name: prometheus
mountPath: /etc/prometheus/
imagePullPolicy: Always
ports:
- containerPort: 9090
protocol: TCP
3、创建 Prometheus Service
kind: Service
apiVersion: v1
metadata:
labels:
k8s-app: prometheus
name: prometheus-service
namespace: kube-system
spec:
ports:
- port: 9090
targetPort: 9090
selector:
k8s-app: prometheus
▶ 使用 Grafana 可视化 Metrics
1、部署 Grafana
kind: Deployment
apiVersion: apps/v1
metadata:
name: grafana
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
k8s-app: grafana
template:
metadata:
labels:
k8s-app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:latest
env:
- name: GF_SECURITY_ADMIN_PASSWORD
value: <your-password>
- name: GF_SECURITY_ADMIN_USER
value: <your-username>
ports:
- containerPort: 3000
protocol: TCP
2、创建 Grafana Service
kind: Service
apiVersion: v1
metadata:
labels:
k8s-app: grafana
name: grafana-service
namespace: kube-system
spec:
ports:
- port: 3000
targetPort: 3000
nodePort: 31111
selector:
k8s-app: grafana
type: NodePort
3、访问 Grafana
Web 地址: http://<kubernetes-node-ip>:31111/ ,账号密码详见第一步的配置。
4、添加 DataSource
依次点击 setting
-> DateSource
-> Add data source
-> Prometheus
。 配置示例:
- Name:
Prometheus
- Default:
Yes
- URL:
http://prometheus-service:9090
- Access:
Server
- Http Method:
Get
点击 Save & Test
即可接入 Prometheus 数据
5、自定义 GPU 监控面板
例如, 显示 GPU 温度:
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 42
查询每个 GPU 的温度,查询语句为 DCGM_FI_DEV_GPU_TEMP
其他查询语句:
- GPU 数量:
count(DCGM_FI_DEV_SM_CLOCK)
- GPU 总内存使用率:
sum(DCGM_FI_DEV_FB_USED) / (sum(DCGM_FI_DEV_FB_FREE) + sum(DCGM_FI_DEV_FB_USED))
- GPU 功耗:
DCGM_FI_DEV_POWER_USAGE
- GPU 内存温度:
DCGM_FI_DEV_MEMORY_TEMP
网友评论