- install
- configuration
- rules
- promeQL
- Federation
- Pushgateway
- Remote write
install
wget https://github.com/prometheus/prometheus/releases/download/v2.3.2/prometheus-2.3.2.linux-amd64.tar.gz -O prometheus-2.3.2.tar.gz
tar xf prometheus-2.3.2.tar.gz -C /data/
ln -s /data/prometheus /data/prometheus-2.3.2.linux-amd64
nohup /data/prometheus/prometheus --config.file="/data/prometheus/prometheus.yml" --storage.tsdb.retention=1d --web.enable-lifecycle &
# 校验
promtool check rules rules/host.yml
promtool check config prometheus.yml
# reload
curl -XPOST http://localhost:9090/-/reload
# UI
http://127.0.0.1:9090
configuration
command-line
./prometheus --config.file=prometheus.yml --storage.tsdb.path=/data --web.enable-lifecycle
-h # 帮助
--web.enable-lifecycle # 允许发请求重载配置/-/reload
--web.console.templates consoles/ # 允许查看console-templates
--storage.tsdb.retention.time 15d # 清楚旧数据
configure-file
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
scrape_timeout: 10s
evaluation_interval: 15s # Evaluate rules every 15 seconds.
# Attach these extra labels to all timeseries collected by this Prometheus instance.
external_labels:
monitor: 'codelab-monitor'
rule_files:
- "first.rules"
- "my/*.rules"
scrape_configs:
- job_name: 'prometheus'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
labels:
group: 'localhost'
# 远程读
remote_write:
- url: http://remote1/push
name: drop_expensive
write_relabel_configs:
- source_labels: [__name__]
regex: expensive.*
action: drop
# 远程写
remote_read:
- url: http://remote1/read
read_recent: true
name: default
alerting:
alertmanagers:
- scheme: https
static_configs:
- targets:
- "1.2.3.4:9093"
- "1.2.3.5:9093"
rules
go get github.com/prometheus/prometheus/cmd/promtool
promtool check rules /path/to/example.rules.yml
recording rules
作用: 提前计算保存计算结果为新的时间记录。 查询更快,适用于大屏,注意时间间隔设置
groups:
- name: example-recording-rules
rules:
- record: job:http_inprogress_requests:sum # recording 冒号
expr: sum(http_inprogress_requests) by (job)
alerting rules
作用:定义告警条件发送给第三方
groups:
- name: example-alerting-rules
rules:
- alert: InstanceDown
expr: up == 0
for: 5m # alert的expr触发后,firing前的等待时间
labels: # 添加label,存在的key将会被覆盖
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
PromeQL
查询语句数据类型 query-example
-
Instant vector
一段时间的值
-
=
: Select labels that are exactly equal to the provided string. -
!=
: Select labels that are not equal to the provided string. -
=~
: Select labels that regex-match the provided string. -
!~
: Select labels that do not regex-match the provided string.
http_requests_total{environment=~"staging|testing|development",method!="GET"}
-
-
Range vector
时间段范围值可以设置偏移量
-
s
- seconds -
m
- minutes -
h
- hours -
d
- days -
w
- weeks -
y
- years
http_requests_total{job="prometheus"}[5m] sum(http_requests_total{method="GET"} offset 5m) // GOOD.
-
example
# count指标
increase(node_cpu[2m]) / 120 # 两分钟内cpu的增长率
rate(node_cpu[2m]) # 两分钟内平均增长率,出现"长尾问题":某个瞬时cpu100%时无法体现
irate(node_cpu[2m]) # 两分钟内瞬时增长率
# aggregation聚合
sum(container_memory_rss{instance="10.51.1.126:10250"}) by (pod_name) # 不同pod的内存rss
sum(http_requests_total) # http的请求总量
topk(5,http_request_total) # 获取前5的请求量
sum by (handler)(topk(5,http_request_total)) # 获取前5的请求量,只保留handler的label
# 动态Label替换
# 1.正则产生新label
label_replace(v instant-vector, dst_label string, replacement string, src_label string, regex string)
label_replace(up, "host", "$1", "instance", "(.*):.*")
up{host="localhost",instance="localhost:8080",job="cadvisor"} # 处理后增加host的标签
# 2. 连接产生新label
label_join(v instant-vector, dst_label string, separator string, src_label_1 string, src_label_2 string, ...)
label_join(up,"info","&","instance","job")
up{instance="localhost:8080",job="cadvisor",info="localhost:8080&cadvisor"}
Federation
作用:prometheus server抓取数据从其他prometheus server
场景:
- 分层联邦:委派合作。类似树,顶层server收集子server聚合的数据
- 交叉联邦:分工合作。各server收集的数据存储到同一DB中,一台server可以查询所有server采集的数据。
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
# not overwrite any labels exposed by the source server
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'source-prometheus-1:9090'
- 'source-prometheus-2:9090'
- 'source-prometheus-3:9090'
Pushgateway
适用: 抓取服务级别的批量任务
# 命令行测试
cat <<EOF | curl --data-binary @- http://127.0.0.1:9091/metrics/job/some_job/instance/some_instance
# TYPE some_metric counter
some_metric{label="val1"} 42
# TYPE another_metric gauge
# HELP another_metric Just an example.
another_metric 2398.283
EOF
# query
some_metric
some_metric{exported_instance="some_instance",exported_job="some_job",instance="localhost:9091",job="pushgateway",label="val1"} 42
Remote write
实现:prometheus server暴露HTTP API,第三方适配器抓取存储
# 适配器配置 写es prometheusbeat.yml
prometheusbeat:
listen: ":8080"
context: "/prometheus"
......
output.elasticsearch:
hosts: ["localhost:9200"]
# prometheus配置
remote_write:
url: "http://localhost:8080/prometheus"
网友评论