【监控】FLume

作者: whaike | 来源:发表于2019-03-22 18:32 被阅读57次

Flume监控几种方式
尚硅谷大数据技术之Flume
【监控】FLume
Flume http监控
flume实践
大数据之 Flume 监听文件和文件夹并上传至 hdfs
grafana目标
数仓--open-falcon配置flume监控
Flume Taildir Source源码修改---监听目录
flume定制之 taildir 目录递归

image.png

flume的监控方式有好几种，具体可以参看官方文档第一次尝试过使用jmx的方式，虽然可以从VM看到MBean对象，但是找到怎么针对不同的agent设置不同的Prometheus导出器，也没有进一步研究，选择了更简单的json方式。

本次监控选定的方案：json+flume_exporter，后端是Prometheus+influxDB+Grafana

第一步：暴露指标

每个agent启动的时候，添加启动参数:-Dflume.monitoring.type=http -Dflume.monitoring.port=96001
这里的端口请随意指定，启动之后便可以使用http访问该端口，即可看到返回的json格式的数据。

第二步：配置exporter

有了指标之后，可以自行写代码接收，我这里使用网友开源的flume_exporter
flume_exporter的使用需要使用两个配置文件，github上有，metrics.yml可是设置需要导出哪些指标，这些指标的名称跟MBean对象的命名是一样的。
config.yml可以配置需要监控的agent。我这里有10个agent需要监控，所以配置是这样的（端口随意指定）

agents:
- name: "flume-agents"
  enabled: true
# multiple urls can be separated by ,
  urls:
    - "http://localhost:9601/metrics"
    - "http://localhost:9602/metrics"
    - "http://localhost:9603/metrics"
    - "http://localhost:9604/metrics"
    - "http://localhost:9605/metrics"
    - "http://localhost:9606/metrics"
    - "http://localhost:9607/metrics"
    - "http://localhost:9608/metrics"
    - "http://localhost:9609/metrics"
    - "http://localhost:9610/metrics"

从config.yml文件中是无法为每个agent命名不同的名字的，比如我尝试过如下设置

# Example usage:
# Flume JSON Reporting metrics
agents:
- name: "flume-agents"
  enabled: true
# multiple urls can be separated by ,
  urls:
#    - "http://localhost:9601/metrics"
    - "http://localhost:9602/metrics"
    - "http://localhost:9603/metrics"
    - "http://localhost:9604/metrics"
    - "http://localhost:9605/metrics"
    - "http://localhost:9606/metrics"
    - "http://localhost:9607/metrics"
    - "http://localhost:9608/metrics"
    - "http://localhost:9609/metrics"
    - "http://localhost:9610/metrics"

- name: "mytestlogs"
  enabled: true
  urls: ["http://localhost:9601/metrics"]

它所产生的指标label与第一个配置是一样的，源码中也没看到其他可以用于指定标签的东西，所以不能从这里改了。
启动：

./flume_exporter --metric-file=./metrics.yml --config-file=./config.yml

此时访问http://192.16.22.13:9360/metrics就可以看到所有的指标。

第三步：加入Prometheus

由于无法在第二步中为每个agent设置别名，所以拿到监控数据之后也无法对他们进行区分，所以我们需要在Prometheus的配置中进行修正，正好标签host="localhost:端口号"可以为我所用，所以prometheus.yml中涉及到的配置如下：

scrape_configs:
  - job_name: mx-discovery
    file_sd_configs:
      - files:
        - '/etc/prometheus/fileconfig/mx-nodes.json'
    metric_relabel_configs:
      - source_labels: [host]
        regex: 'localhost:9601'
        replacement: mytestname1
        target_label: logs
        action: replace
      - source_labels: [host]
        regex: 'localhost:9602'
        replacement: mytestname2
        target_label: logs
        action: replace

其实配置中的action: replace是可以不要的。

这里我使用的是文件发现的方式，所以mx-nodes.json文件中这样写的

    {
        "targets": ["192.16.22.13:9360"],
        "labels": {
            "alias": "bc-u-app-2",
            "job": "flume"
        }
    }

如上设置之后，从Prometheus查询出来的指标就包含了可以用于区分不同agent的label了。

第四步：绘图

绘图使用到的指标如下

指标项说明（以下三张表来自https://www.cnblogs.com/fengzzi/p/10033739.html，侵删）
source监控项

objectName(会随实际情况而变化)	指标项	说明
org.apache.flume.source:type=r1	OpenConnectionCount	目前与客户端或sink保持连接的总数量
org.apache.flume.source:type=r1	AppendBatchAcceptedCount	成功提交到channel的批次的总数量
org.apache.flume.source:type=r1	AppendBatchReceivedCount	接收到事件批次的总数量
org.apache.flume.source:type=r1	AppendAcceptedCount	逐条录入的次数
org.apache.flume.source:type=r1	AppendReceivedCount	每批只有一个事件的事件总数量
org.apache.flume.source:type=r1	EventAcceptedCount	成功写出到channel的事件总数量
org.apache.flume.source:type=r1	EventReceivedCount	目前为止source已经接收到的事件总数量
org.apache.flume.source:type=r1	StartTime	source启动时的毫秒值时间
org.apache.flume.source:type=r1	StopTime	source停止时的毫秒值时间，为0表示一直在运行

channel监控项

objectName(会随实际情况而变化)	指标项	说明
org.apache.flume.channel:type=c1	EventPutAttemptCount	Source尝试写入Channe的事件总次数
org.apache.flume.channel:type=c1	EventPutSuccessCount	成功写入channel且提交的事件总次数
org.apache.flume.channel:type=c1	EventTakeAttemptCount	sink尝试从channel拉取事件的总次数。
org.apache.flume.channel:type=c1	EventTakeSuccessCount	sink成功从channel读取事件的总数量
org.apache.flume.channel:type=c1	ChannelSize	目前channel中事件的总数量
org.apache.flume.channel:type=c1	ChannelCapacity	channel的容量
org.apache.flume.channel:type=c1	ChannelFillPercentage	channel已填入的百分比
org.apache.flume.channel:type=c1	StartTime	channel启动时的毫秒值时间
org.apache.flume.channel:type=c1	StopTime	channel停止时的毫秒值时间，为0表示一直在运行

sink监控项

objectName(会随实际情况而变化)	指标项	说明
org.apache.flume.sink:type=k1	ConnectionCreatedCount	创建的连接数量
org.apache.flume.sink:type=k1	ConnectionClosedCount	关闭的连接数量
org.apache.flume.sink:type=k1	ConnectionFailedCount	由于错误关闭的连接数量
org.apache.flume.sink:type=k1	BatchEmptyCount	批量处理event的个数为0的数量-表示source写入数据的速度比sink处理数据的速度慢
org.apache.flume.sink:type=k1	BatchUnderflowCount	批量处理event的个数小于批处理大小的数量
org.apache.flume.sink:type=k1	BatchCompleteCount	批量处理event的个数等于批处理大小的数量
org.apache.flume.sink:type=k1	EventDrainAttemptCount	sink尝试写出到存储的事件总数量
org.apache.flume.sink:type=k1	EventDrainSuccessCount	sink成功写出到存储的事件总数量
org.apache.flume.sink:type=k1	StartTime	channel启动时的毫秒值时间
org.apache.flume.sink:type=k1	StopTime	channel停止时的毫秒值时间，为0表示一直在运行