prometheus

作者: 后知不觉1 | 来源:发表于2023-01-09 19:05 被阅读0次

    1、解压包

    #server
    wget  https://github.com/prometheus/prometheus/releases/download/v2.41.0/prometheus-2.41.0.linux-amd64.tar.gz
    #alertmanager
    wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
    #nodeExporter
    wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
    

    2、安装prometheus-server

    2.1、配置prometheus-server
    # my global config
    global:
      scrape_interval: 10s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 10s # Evaluate rules every 15 seconds. The default is every 1 minute.  刷新监控规则
    
    # Alertmanager的访问地址
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['127.0.0.1:9093']
    
    rule_files:
      - "/opt/prometheus/rules_conf/*.yml" 
    
    scrape_configs:
      # prometheus server访问地址
      - job_name: "prometheus"
        static_configs:
          - targets: ["localhost:9090"]
    
      #自动发现有多种方式,这里根据文件配置自动发现主机
      - job_name: "node-exporter-discovery" 
        file_sd_configs:
          - refresh_interval: 1m
            files:
            - /opt/prometheus/node_conf/node_exporter.yaml
    

    node_exporter.ymal配置文件说明

    - targets:
      - 127.0.0.1:9100
      labels:
        idc: prd
    
    image.png
    2.3、启动prometheus-server
    2.3.1、添加服务
     cat > /usr/lib/systemd/system/prometheus.service << 'EOF' 
    [Unit]
    Description=Prometheus
    After=network.target
    
    [Service]
    User=prometheus
    Group=prometheus
    ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml --web.enable-lifecycle --storage.tsdb.max-block-duration=2d --storage.tsdb.min-block-duration=2h --storage.tsdb.path=/data1/prometheus/data  --storage.tsdb.retention=15d --log.level=info
    
    [Install]
    WantedBy=multi-user.target
    EOF
    

    备注说明

    --web.enable-lifecycle  #开启url刷新配置功能
    --storage.tsdb.max-block-duration=2d #配置tsdb最大文件块时长2d
    --storage.tsdb.min-block-duration=2h  #配置tsdb最小文件块时长
    --storage.tsdb.retention=15d  # tsdb 保存的数据时长,默认90d
    

    这一套参数解决了tsdb文件块mmap 内存不够的问题

    2.3.2、更新启动服务
    systemctl daemon-reload
    systemctl enable prometheus.service --now
    systemctl start prometheus.service
    

    3、安装alertManager

    3.1、配置alertManager
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 2d
      receiver: 'web.hook'   #配置告警方式
    receivers:
      - name: 'web.hook'
        webhook_configs:
          - url: 'http://127.0.0.1:5210/'   #告警方式配置
    
    3.2、添加服务
    cat > /usr/lib/systemd/system/alertmanager.service << 'EOF' 
    [Unit]
    Description=alertmanager
    After=network.target
    
    [Service]
    User=prometheus
    Group=prometheus
    ExecStart=/opt/prometheus-alertmanager/alertmanager --config.file=/opt/prometheus-alertmanager/alertmanager.yml
    
    [Install]
    WantedBy=multi-user.target
    EOF
    
    3.3、更新启动服务
    systemctl daemon-reload
    systemctl enable alertmanager.service --now
    systemctl start alertmanager.service
    

    4、告警方式服务说明

    需要python3环境,安装参考linux安装miniconda3
    这里使用了一个简单的python 脚本,依赖说明: https://github.com/keijack/python-simple-http-server
    安装说明:https://pypi.org/project/simple-http-server

    脚本内容,主要功能是拦截alertManager的请求,将请请求转换成通知中心接受方式。alterManger不支持调用脚本的方式

    from simple_http_server import route, server
    from simple_http_server import Request
    import json
    import requests
    
    @route("/", method=["GET", "POST", "PUT"])
    def index(req=Request()):
    
        alert_url="http://xxxxxx.com/noticeSend"
        headers = {"Content-Type":"application/json"}
        request_data = json.loads(str(req.body,"utf-8"))
        alerts_data=request_data["alerts"]
        tmp_ip_arr = []
        for item in alerts_data:
            ip = item["labels"]["instance"].split(":")[0]
            tmp_ip_arr.append(ip)
    
        ip_str = ",".join(tmp_ip_arr)
        content = request_data["alerts"][0]["annotations"]["summary"]
    
        tmp_data={}
        tmp_data["userIds"]="123123"
        tmp_data["sms"]= {
            "templateParamList": [ip_str,content]
        }
        print(json.dumps(tmp_data))
        alert_req= requests.post(alert_url,json.dumps(tmp_data),headers={'Content-Type':'application/json'})
        if alert_req.status_code == 200:
            return {"msg": "alert success"}
        else:
            return {"msg": "alert failed"}
    
    
    def main(*args):
        server.start(port=5210)
    
    if __name__ == "__main__":
        main()
    

    5、nodeExporter安装

    5.1、解压nodeExport到指定位置
    5.2、添加nodeExport服务
    cat > /usr/lib/systemd/system/nodeExporter.service << 'EOF' 
    [Unit]
    Description=nodeExporter
    After=network.target
    
    [Service]
    User=prometheus
    Group=prometheus
    ExecStart=/opt/node_exporter/node_exporter
    
    [Install]
    WantedBy=multi-user.target
    EOF
    
    5.3、更新启动服务
    systemctl daemon-reload
    systemctl enable nodeExporter.service --now
    systemctl start nodeExporter.service
    

    6、告警规则配置

    6.1、修改告警规则

    根据prometheus server的配置修改。这里用的general.yml配置文件

    1673262087475.png

    这个文档比较全https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware

    demo: 这是nodeExporter的主机告警规则,可以自行删减

    groups:
    
    - name: NodeExporter
    
      rules:
    
        - alert: HostOutOfMemory
          expr: 'node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10'
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Host out of memory
            description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    
    
        - alert: HostUnusualNetworkThroughputIn
          expr: 'sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100'
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Hostnetworkthroughput 
            description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostUnusualNetworkThroughputOut
          expr: 'sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100'
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: unusual network throughput out
            description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostUnusualDiskReadRate
          expr: 'sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50'
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: unusual disk read rate
            description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostUnusualDiskWriteRate
          expr: 'sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50'
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: unusual disk write rate
            description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostOutOfDiskSpace
          expr: '(node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0'
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: out of disk space
            description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostDiskWillFillIn24Hours
          expr: '(node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0'
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: disk will fill
            description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostOutOfInodes
          expr: 'node_filesystem_files_free / node_filesystem_files * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0'
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: out of inodes
            description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostInodesWillFillIn24Hours
          expr: 'node_filesystem_files_free / node_filesystem_files * 100 < 10 and predict_linear(node_filesystem_files_free[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0'
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: inodes will fill
            description: "Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostUnusualDiskReadLatency
          expr: 'rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0'
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: unusual disk read latency
            description: "Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostUnusualDiskWriteLatency
          expr: 'rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0'
          for: 20m
          labels:
            severity: warning
          annotations:
            summary: unusual disk write latency
            description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostHighCpuLoad
          expr: '100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80'
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: high CPU load
            description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostCpuIsUnderUtilized
          expr: '100 - (rate(node_cpu_seconds_total{mode="idle"}[30m]) * 100) < 20'
          for: 1w
          labels:
            severity: info
          annotations:
            summary: CPU is under utilized
            description: "CPU load is < 20% for 1 week. Consider reducing the number of CPUs.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostCpuStealNoisyNeighbor
          expr: 'avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10'
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: CPU steal noisy neighbor
            description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostCpuHighIowait
          expr: 'avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 5'
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: Host CPU high iowait
            description: "CPU iowait > 5%. A high iowait means that you are disk or network bound.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostContextSwitching
          expr: '(rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000'
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: Host context switching
            description: "Context switching is growing on node (> 1000 / s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostSwapIsFillingUp
          expr: '(1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80'
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Host swap is filling up
            description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
     
      
    
    
        - alert: HostRaidDiskFailure
          expr: 'node_md_disks{state="failed"} > 0'
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: Host RAID disk failure
            description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    
        - alert: HostOomKillDetected
          expr: 'increase(node_vmstat_oom_kill[1m]) > 0'
          for: 0m
          labels:
            severity: warning
          annotations:
            summary:  OOM kill detected
            description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: HostEdacCorrectableErrorsDetected
          expr: 'increase(node_edac_correctable_errors_total[1m]) > 0'
          for: 0m
          labels:
            severity: info
          annotations:
            summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
            description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
    
    
     
        - alert: HostRequiresReboot
          expr: 'node_reboot_required > 0'
          for: 4h
          labels:
            severity: info
          annotations:
            summary: Host requires reboot (instance {{ $labels.instance }})
            description: "{{ $labels.instance }} requires a reboot.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    
        - alert: node-exporter-down
          expr: up == 0 
          for: 1m
          labels: 
            severity: info
          annotations: 
            summary: " {{ $labels.instance }} 宕机了"  
            description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 关机了, 时间已经1分钟了。" 
            value: "{{ $value }}"
            instance: "{{ $labels.instance }}"
    
    6.2、更新告警规则

    这个有个前提是开启了 --web.enable-lifecycle 刷新更新参数,如果没有该参数必须要重启prometheus-server更新

    curl -XPOST http://127.0.0.1:9090/-/reload
    

    end,成功收到磁盘满的告警

    相关文章

      网友评论

        本文标题:prometheus

        本文链接:https://www.haomeiwen.com/subject/cgtqqdtx.html