美文网首页监控
Prometheus Alertmanager报警组件

Prometheus Alertmanager报警组件

作者: YichenWong | 来源:发表于2017-03-30 19:36 被阅读12668次

    Prometheus Alertmanager

    概述

    Alertmanager与Prometheus是相互分离的两个组件。Prometheus服务器根据报警规则将警报发送给Alertmanager,然后Alertmanager将silencing、inhibition、aggregation等消息通过电子邮件、PaperDuty和HipChat发送通知。

    设置警报和通知的主要步骤:

    • 安装配置Alertmanager
    • 配置Prometheus通过-alertmanager.url标志与Alertmanager通信
    • 在Prometheus中创建告警规则

    Alertmanager简介及机制

    Alertmanager处理由例如Prometheus服务器等客户端发来的警报。它负责删除重复数据、分组,并将警报通过路由发送到正确的接收器,比如电子邮件、Slack等。Alertmanager还支持groups,silencing和警报抑制的机制。

    分组

    分组是指将同一类型的警报分类为单个通知。当许多系统同时宕机时,很有可能成百上千的警报会同时生成,这种机制特别有用。
    例如,当数十或数百个服务的实例在运行,网络发生故障时,有可能一半的服务实例不能访问数据库。在prometheus告警规则中配置为每一个服务实例都发送警报的话,那么结果是数百警报被发送至Alertmanager。

    但是作为用户只想看到单一的报警页面,同时仍然能够清楚的看到哪些实例受到影响,因此,可以通过配置Alertmanager将警报分组打包,并发送一个相对看起来紧凑的通知。

    分组警报、警报时间,以及接收警报的receiver是在alertmanager配置文件中通过路由树配置的。

    抑制(Inhibition)

    抑制是指当警报发出后,停止重复发送由此警报引发其他错误的警报的机制。(比如网络不可达,导致其他服务连接相关警报)

    例如,当整个集群网络不可达,此时警报被触发,可以事先配置Alertmanager忽略由该警报触发而产生的所有其他警报,这可以防止通知数百或数千与此问题不相关的其他警报。

    抑制机制也是通过Alertmanager的配置文件来配置。

    沉默(Silences)

    Silences是一种简单的特定时间不告警的机制。silences警告是通过匹配器(matchers)来配置,就像路由树一样。传入的警报会匹配RE,如果匹配,将不会为此警报发送通知。

    这个可视化编辑器可以帮助构建路由树。

    silences报警机制可以通过Alertmanager的Web页面进行配置。

    Alermanager的配置

    Alertmanager通过命令行flag和一个配置文件进行配置。命令行flag配置不变的系统参数、配置文件定义的抑制(inhibition)规则、通知路由和通知接收器。

    要查看所有可用的命令行flag,运行alertmanager -h。
    Alertmanager支持在运行时加载配置,如果新配置语法格式不正确,更改将不会被应用,并记录语法错误。通过向该进程发送SIGHUP或向/-/reload端点发送HTTP POST请求来触发配置热加载。

    配置文件

    要指定加载的配置文件,需要使用-config.file标志。该文件使用YAML来完成,通过下面的描述来定义。带括号的参数表示是可选的,对于非列表的参数的值,将被设置为指定的缺省值。

    通用占位符定义解释:

    • <duration> : 与正则表达式匹配的持续时间值,[0-9]+(ms|[smhdwy])
    • <labelname>: 与正则表达式匹配的字符串,[a-zA-Z_][a-zA-Z0-9_]*
    • <labelvalue>: unicode字符串
    • <filepath>: 有效的文件路径
    • <boolean>: boolean类型,true或者false
    • <string>: 字符串
    • <tmpl_string>: 模板变量字符串

    global全局配置文件参数在所有配置上下文生效,作为其他配置项的默认值,可被覆盖.

    global:
      # ResolveTimeout is the time after which an alert is declared resolved
      # if it has not been updated.
      #解决报警时间间隔
      [ resolve_timeout: <duration> | default = 5m ]
    
      # The default SMTP From header field.
      [ smtp_from: <tmpl_string> ]
      # The default SMTP smarthost used for sending emails.
      [ smtp_smarthost: <string> ]
      # SMTP authentication information.
      [ smtp_auth_username: <string> ]
      [ smtp_auth_password: <string> ]
      [ smtp_auth_secret: <string> ]
      # The default SMTP TLS requirement.
      [ smtp_require_tls: <bool> | default = true ]
    
      # The API URL to use for Slack notifications.
      [ slack_api_url: <string> ]
    
      [ pagerduty_url: <string> | default = "https://events.pagerduty.com/generic/2010-04-15/create_event.json" ]
      [ opsgenie_api_host: <string> | default = "https://api.opsgenie.com/" ]
    
    # Files from which custom notification template definitions are read.
    # The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
    templates:
      [ - <filepath> ... ]
    
    # The root node of the routing tree.
    route: <route>
    
    # A list of notification receivers.
    receivers:
      - <receiver> ...
    
    # A list of inhibition rules.
    inhibit_rules:
      [ - <inhibit_rule> ... ]
    

    路由(route)

    路由块定义了路由树及其子节点。如果没有设置的话,子节点的可选配置参数从其父节点继承。

    每个警报都会在配置的顶级路由中进入路由树,该路由树必须匹配所有警报(即没有任何配置的匹配器)。然后遍历子节点。如果continue的值设置为false,它在第一个匹配的子节点之后就停止;如果continue的值为true,警报将继续进行后续子节点的匹配。如果警报不匹配任何节点的任何子节点(没有匹配的子节点,或不存在),该警报基于当前节点的配置处理。

    路由配置格式

    #报警接收器
    [ receiver: <string> ]
    
    #分组
    [ group_by: '[' <labelname>, ... ']' ]
    
    # Whether an alert should continue matching subsequent sibling nodes.
    [ continue: <boolean> | default = false ]
    
    # A set of equality matchers an alert has to fulfill to match the node.
    #根据匹配的警报,指定接收器
    match:
      [ <labelname>: <labelvalue>, ... ]
    
    # A set of regex-matchers an alert has to fulfill to match the node.
    match_re:
    #根据匹配正则符合的警告,指定接收器
      [ <labelname>: <regex>, ... ]
    
    # How long to initially wait to send a notification for a group
    # of alerts. Allows to wait for an inhibiting alert to arrive or collect
    # more initial alerts for the same group. (Usually ~0s to few minutes.)
    [ group_wait: <duration> ]
    
    # How long to wait before sending notification about new alerts that are
    # in are added to a group of alerts for which an initial notification
    # has already been sent. (Usually ~5min or more.)
    [ group_interval: <duration> ]
    
    # How long to wait before sending a notification again if it has already
    # been sent successfully for an alert. (Usually ~3h or more).
    [ repeat_interval: <duration> ]
    
    # Zero or more child routes.
    routes:
      [ - <route> ... ]
    

    例子:

    # The root route with all parameters, which are inherited by the child
    # routes if they are not overwritten.
    route:
      receiver: 'default-receiver'
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      group_by: [cluster, alertname]
      # All alerts that do not match the following child routes
      # will remain at the root node and be dispatched to 'default-receiver'.
      routes:
      # All alerts with service=mysql or service=cassandra
      # are dispatched to the database pager.
      - receiver: 'database-pager'
        group_wait: 10s
        match_re:
          service: mysql|cassandra
      # All alerts with the team=frontend label match this sub-route.
      # They are grouped by product and environment rather than cluster
      # and alertname.
      - receiver: 'frontend-pager'
        group_by: [product, environment]
        match:
          team: frontend
    

    抑制规则 inhibit_rule

    抑制规则,是存在另一组匹配器匹配的情况下,使其他被引发警报的规则静音。这两个警报,必须有一组相同的标签。

    抑制配置格式

    # Matchers that have to be fulfilled in the alerts to be muted.
    ##必须在要需要静音的警报中履行的匹配者
    target_match:
      [ <labelname>: <labelvalue>, ... ]
    target_match_re:
      [ <labelname>: <regex>, ... ]
    
    # Matchers for which one or more alerts have to exist for the
    # inhibition to take effect.
    #必须存在一个或多个警报以使抑制生效的匹配者。
    source_match:
      [ <labelname>: <labelvalue>, ... ]
    source_match_re:
      [ <labelname>: <regex>, ... ]
    
    # Labels that must have an equal value in the source and target
    # alert for the inhibition to take effect.
    #在源和目标警报中必须具有相等值的标签才能使抑制生效
    [ equal: '[' <labelname>, ... ']' ]
    

    接收器(receiver)

    顾名思义,警报接收的配置。

    通用配置格式

    # The unique name of the receiver.
    name: <string>
    
    # Configurations for several notification integrations.
    email_configs:
      [ - <email_config>, ... ]
    pagerduty_configs:
      [ - <pagerduty_config>, ... ]
    slack_config:
      [ - <slack_config>, ... ]
    opsgenie_configs:
      [ - <opsgenie_config>, ... ]
    webhook_configs:
      [ - <webhook_config>, ... ]
    

    邮件接收器email_config

    # Whether or not to notify about resolved alerts.
    #警报被解决之后是否通知
    [ send_resolved: <boolean> | default = false ]
    
    # The email address to send notifications to.
    to: <tmpl_string>
    # The sender address.
    [ from: <tmpl_string> | default = global.smtp_from ]
    # The SMTP host through which emails are sent.
    [ smarthost: <string> | default = global.smtp_smarthost ]
    
    # The HTML body of the email notification.
    [ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ] 
    
    # Further headers email header key/value pairs. Overrides any headers
    # previously set by the notification implementation.
    [ headers: { <string>: <tmpl_string>, ... } ]
    
    

    Slcack接收器slack_config

    # Whether or not to notify about resolved alerts.
    [ send_resolved: <boolean> | default = true ]
    
    # The Slack webhook URL.
    [ api_url: <string> | default = global.slack_api_url ]
    
    # The channel or user to send notifications to.
    channel: <tmpl_string>
    
    # API request data as defined by the Slack webhook API.
    [ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
    [ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}'
    [ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ]
    [ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ]
    [ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ]
    [ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ]
    [ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]
    

    Webhook接收器webhook_config

     # Whether or not to notify about resolved alerts.
    [ send_resolved: <boolean> | default = true ]
    
     # The endpoint to send HTTP POST requests to.
    url: <string>
    

    Alertmanager会使用以下的格式向配置端点发送HTTP POST请求:

    {
      "version": "3",
      "groupKey": <number>     // key identifying the group of alerts (e.g. to deduplicate)
      "status": "<resolved|firing>",
      "receiver": <string>,
      "groupLabels": <object>,
      "commonLabels": <object>,
      "commonAnnotations": <object>,
      "externalURL": <string>,  // backling to the Alertmanager.
      "alerts": [
        {
          "labels": <object>,
          "annotations": <object>,
          "startsAt": "<rfc3339>",
          "endsAt": "<rfc3339>"
        },
        ...
      ]
    }
    

    可以添加一个钉钉webhook,通过钉钉报警,由于POST数据需要有要求,简单实现一个数据转发脚本。

    from flask import Flask
    from flask import request
    import json
    
    app = Flask(__name__)
    
    @app.route('/',methods=['POST'])
    def send():
        if request.method == 'POST':
            post_data = request.get_data()
            alert_data(post_data)
        return
    def alert_data(data):
        from urllib2 import Request,urlopen
        url = 'https://oapi.dingtalk.com/robot/send?access_token=xxxx'
        send_data = '{"msgtype": "text","text": {"content": %s}}' %(data)
        request = Request(url, send_data)
        request.add_header('Content-Type','application/json')
        return urlopen(request).read()
    if __name__ == '__main__':
        app.run(host='0.0.0.0')
    
    

    报警规则

    报警规则允许你定义基于Prometheus表达式语言的报警条件,并发送报警通知到外部服务

    定义报警规则

    报警规则通过以下格式定义:

    ALERT <alert name>
      IF <expression>
      [ FOR <duration> ]
      [ LABELS <label set> ]
      [ ANNOTATIONS <label set> ]
    
    • 可选的FOR语句,使得Prometheus在表达式输出的向量元素(例如高HTTP错误率的实例)之间等待一段时间,将警报计数作为触发此元素。如果元素是active,但是没有firing的,就处于pending状态。

    • LABELS(标签)语句允许指定一组标签附加警报上。将覆盖现有冲突的任何标签,标签值也可以被模板化。

    • ANNOTATIONS(注释)它们被用于存储更长的其他信息,例如警报描述或者链接,注释值也可以被模板化。

    • Templating(模板) 标签和注释值可以使用控制台模板进行模板化。$labels变量保存警报实例的标签键/值对,$value保存警报实例的评估值。

      # To insert a firing element's label values:
      {{ $labels.<labelname> }}
      # To insert the numeric expression value of the firing element:
      {{ $value }}
      

    报警规则示例:

    # Alert for any instance that is unreachable for >5 minutes.
    ALERT InstanceDown
      IF up == 0
      FOR 5m
      LABELS { severity = "page" }
      ANNOTATIONS {
        summary = "Instance {{ $labels.instance }} down",
        description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.",
      }
    
    # Alert for any instance that have a median request latency >1s.
    ALERT APIHighRequestLatency
      IF api_http_request_latencies_second{quantile="0.5"} > 1
      FOR 1m
      ANNOTATIONS {
        summary = "High request latency on {{ $labels.instance }}",
        description = "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)",
      }
    

    运行时检查警报

    要手动检查处于active状态(pending或者firing)的警报,可在Prometheus实例web导航窗口的"alert"选项卡查看.

    For pending and firing alerts, Prometheus also stores synthetic time series of the form ALERTS{alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}. The sample value is set to 1 as long as the alert is in the indicated active (pending or firing) state, and a single 0 value gets written out when an alert transitions from active to inactive state. Once inactive, the time series does not get further updates.

    发送报警通知

    Prometheus的警报rules可以很好的知道现在的故障情况,但还不是一个完整的通知解决方案。在简单的警报定义之上,需要另一层级来实现报警汇总,通知速率限制,silences等基于rules之上,在prometheus生态系统中,Alertmanager发挥了这一作用。因此,
    Prometheus可以周期性的发送关于警报状态的信息到Alertmanager实例,然后Alertmanager调度来发送正确的通知。该Alertmanager可以通过-alertmanager.url命令行flag来配置。

    相关文章

      网友评论

        本文标题:Prometheus Alertmanager报警组件

        本文链接:https://www.haomeiwen.com/subject/qlvpottx.html