美文网首页
prometheus 配置

prometheus 配置

作者: xyz098 | 来源:发表于2020-01-19 18:09 被阅读0次
  • install
  • configuration
  • rules
  • promeQL
  • Federation
  • Pushgateway
  • Remote write

install

wget https://github.com/prometheus/prometheus/releases/download/v2.3.2/prometheus-2.3.2.linux-amd64.tar.gz -O prometheus-2.3.2.tar.gz
tar xf prometheus-2.3.2.tar.gz -C /data/

ln -s /data/prometheus /data/prometheus-2.3.2.linux-amd64 
nohup /data/prometheus/prometheus --config.file="/data/prometheus/prometheus.yml" --storage.tsdb.retention=1d --web.enable-lifecycle &

# 校验
promtool check rules rules/host.yml  
promtool check config prometheus.yml 

# reload
curl -XPOST  http://localhost:9090/-/reload     

# UI
http://127.0.0.1:9090

configuration

command-line

./prometheus --config.file=prometheus.yml --storage.tsdb.path=/data --web.enable-lifecycle

-h   # 帮助
--web.enable-lifecycle              # 允许发请求重载配置/-/reload
--web.console.templates consoles/   # 允许查看console-templates
--storage.tsdb.retention.time 15d   # 清楚旧数据

configure-file

conf.good.yml

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  scrape_timeout:      10s
  evaluation_interval: 15s # Evaluate rules every 15 seconds.
  # Attach these extra labels to all timeseries collected by this Prometheus instance.
  external_labels:
    monitor: 'codelab-monitor'

rule_files:
  - "first.rules"
  - "my/*.rules"

scrape_configs:
  - job_name: 'prometheus'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']
        labels:
          group: 'localhost'

# 远程读
remote_write:
  - url: http://remote1/push
    name: drop_expensive
    write_relabel_configs:
    - source_labels: [__name__]
      regex:         expensive.*
      action:        drop

# 远程写
remote_read:
  - url: http://remote1/read
    read_recent: true
    name: default

alerting:
  alertmanagers:
  - scheme: https
    static_configs:
    - targets:
      - "1.2.3.4:9093"
      - "1.2.3.5:9093"

rules

go get github.com/prometheus/prometheus/cmd/promtool
promtool check rules /path/to/example.rules.yml

recording rules

作用: 提前计算保存计算结果为新的时间记录。 查询更快,适用于大屏,注意时间间隔设置

groups:
  - name: example-recording-rules
    rules:
    - record: job:http_inprogress_requests:sum       # recording 冒号
      expr: sum(http_inprogress_requests) by (job)

alerting rules

作用:定义告警条件发送给第三方

groups:
- name: example-alerting-rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m    # alert的expr触发后,firing前的等待时间
    labels:    # 添加label,存在的key将会被覆盖
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

PromeQL

查询语句数据类型 query-example

  • Instant vector

    一段时间的值

    • =: Select labels that are exactly equal to the provided string.
    • !=: Select labels that are not equal to the provided string.
    • =~: Select labels that regex-match the provided string.
    • !~: Select labels that do not regex-match the provided string.
    http_requests_total{environment=~"staging|testing|development",method!="GET"}
    
  • Range vector

    时间段范围值可以设置偏移量

    • s - seconds
    • m - minutes
    • h - hours
    • d - days
    • w - weeks
    • y - years
    http_requests_total{job="prometheus"}[5m]
    sum(http_requests_total{method="GET"} offset 5m) // GOOD.
    

example

# count指标
increase(node_cpu[2m]) / 120   # 两分钟内cpu的增长率
rate(node_cpu[2m])             # 两分钟内平均增长率,出现"长尾问题":某个瞬时cpu100%时无法体现
irate(node_cpu[2m])            # 两分钟内瞬时增长率

# aggregation聚合
sum(container_memory_rss{instance="10.51.1.126:10250"}) by (pod_name)  # 不同pod的内存rss
sum(http_requests_total)    # http的请求总量 
topk(5,http_request_total)  # 获取前5的请求量
sum by (handler)(topk(5,http_request_total))  # 获取前5的请求量,只保留handler的label 

# 动态Label替换
# 1.正则产生新label
label_replace(v instant-vector, dst_label string, replacement string, src_label string, regex string)    
label_replace(up, "host", "$1", "instance",  "(.*):.*")
up{host="localhost",instance="localhost:8080",job="cadvisor"}  # 处理后增加host的标签
# 2. 连接产生新label
label_join(v instant-vector, dst_label string, separator string, src_label_1 string, src_label_2 string, ...)
label_join(up,"info","&","instance","job")
up{instance="localhost:8080",job="cadvisor",info="localhost:8080&cadvisor"} 

Federation

作用:prometheus server抓取数据从其他prometheus server

场景:

  • 分层联邦:委派合作。类似树,顶层server收集子server聚合的数据
  • 交叉联邦:分工合作。各server收集的数据存储到同一DB中,一台server可以查询所有server采集的数据。
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s

    #  not overwrite any labels exposed by the source server
    honor_labels: true
    metrics_path: '/federate'

    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'

    static_configs:
      - targets:
        - 'source-prometheus-1:9090'
        - 'source-prometheus-2:9090'
        - 'source-prometheus-3:9090'

Pushgateway

适用: 抓取服务级别的批量任务

# 命令行测试
cat <<EOF | curl --data-binary @- http://127.0.0.1:9091/metrics/job/some_job/instance/some_instance
# TYPE some_metric counter
some_metric{label="val1"} 42
# TYPE another_metric gauge
# HELP another_metric Just an example.
another_metric 2398.283
EOF

# query
some_metric
some_metric{exported_instance="some_instance",exported_job="some_job",instance="localhost:9091",job="pushgateway",label="val1"} 42

Remote write

实现:prometheus server暴露HTTP API,第三方适配器抓取存储

# 适配器配置 写es prometheusbeat.yml 
prometheusbeat:
  listen: ":8080"
  context: "/prometheus"
 ......
output.elasticsearch:
  hosts: ["localhost:9200"]
  
# prometheus配置 
remote_write:
  url: "http://localhost:8080/prometheus"

相关文章

网友评论

      本文标题:prometheus 配置

      本文链接:https://www.haomeiwen.com/subject/xvzrzctx.html