美文网首页
我们自搭监控系统的grafana的视图设计

我们自搭监控系统的grafana的视图设计

作者: 天草二十六_简村人 | 来源:发表于2022-07-28 15:02 被阅读0次

一、监控体系

监控大盘.png
  • 业务指标监控
  • 应用监控(jvm、接口访问量及耗时、线程池、数据库连接池、hystrix)
  • 中间件--包括数据库监控(mq、es、redis、mysql、mongodb)
  • 机器监控(cpu、内存、磁盘、网络、连接数)

其他监控

  • nginx/kong日志
  • jvm日志
  • 数据库慢查询日志
  • 自定义日志

二、grafana数据集成

展示数据源不仅限于prometheus

  • Prometheus
  • ElasticSearch
  • Mongodb

三、机器监控

早期的机器监控是使用zabbix,而它自身的界面查询非常不便,所以打算在grafana集成zabbix。

  • 当前值(汇总信息)
  • cpu
  • 内存
  • 磁盘空间
  • 网络流量
  • tcp连接数(这个很重要)

四、几个重要的面板

grafana模板一般分为汇总数据和明细数据,先展示汇总情况,然后展示其趋势详情。

4.1、接口调用及耗时

  • 总访问量
sum(increase(http_server_requests_seconds_count{application="$application"}[1m] )) 
  • 总错误数
sum(increase(http_server_requests_seconds_count{application="$application",status=~"5.."}[1m] )) by(application)
  • 错误率
100*(sum(increase(http_server_requests_seconds_count{application="$application", status=~"5.."}[1m] )) by(application)) / (sum(increase(http_server_requests_seconds_count{application="$application"}[1m] )) by(application))
  • 最大QPS(和jvm模板不一样的是,这里的都是按服务维度来的,而jvm是按节点维度来的)
sum(rate(http_server_requests_seconds_count{application="$application"}[1m]))
  • 总接口数
count(sum(increase(http_server_requests_seconds_count{application="$application"}[1m] )) by(uri))
  • QPS趋势

支持查询条件:服务和URI。

sum(irate(http_server_requests_seconds_count{application="$application", uri=~"$uri"}[1m]))
  • 错误数趋势


    汇总信息.png
sum(irate(http_server_requests_seconds_count{application="$application", uri=~"$uri", status=~"5.."}[1m]))
  • 重点分析耗时5秒以上的请求数

使用的是减法操作

sum(increase(http_server_requests_seconds_count{application="$application"}[1m])) by (application)-sum(increase(http_server_requests_seconds_bucket{application="$application",le="5.0"}[1m])) by (application)
  • 对访问量按接口进行排行


    汇总信息2.png
topk(10, sum by(uri, method) (increase(http_server_requests_seconds_count{application="$application"}[1m])))
  • 耗时分布在1-3秒、3-5秒、5秒以上(参考pinpoint,Prometheus是全量采集,而apm一般都是采样的,所以准确性不同)


    汇总信息3.png
//5秒以上
sum(increase(http_server_requests_seconds_count{application="$application"}[1m])) by (application)-sum(increase(http_server_requests_seconds_bucket{application="$application",le="5.0"}[1m])) by (application)
//3~5秒
sum(increase(http_server_requests_seconds_bucket{application="$application",le="5.0"}[1m])) by (application)-sum(increase(http_server_requests_seconds_bucket{application="$application",le="3.0"}[1m])) by (application)
//1~3秒
sum(increase(http_server_requests_seconds_bucket{application="$application",le="3.0"}[1m])) by (application)-sum(increase(http_server_requests_seconds_bucket{application="$application",le="1.0"}[1m])) by (application)
  • 错误详情


    错误趋势.png
// 支持按节点过滤
// 错误率
rate(http_server_requests_seconds_count{application="$application", instance="$instance", uri=~"$uri", status=~"5.."}[1m])

// 错误数
sum(increase(http_server_requests_seconds_count{application="$application", instance=~"$instance", uri=~"$uri", status=~"5.."}[1m] )) by(uri)
  • QPS(1分钟窗口)


    QPS(1分钟窗口).png
irate(http_server_requests_seconds_count{application="$application", instance=~"$instance", uri=~"$uri"}[1m])
  • 接口的响应时间(90线)


    接口的响应时间(90线).png
http_server_requests_seconds{application="$application",quantile="0.9",instance=~"$instance",status="200", uri=~"$uri"} * 1000
  • 接口的访问量(1分钟)


    接口的访问量.png
sum(increase(http_server_requests_seconds_count{application="$application", instance=~"$instance", uri=~"$uri"}[1m])) by(uri,method)

4.2、hystrix

参考hystrix turbine,关注熔断等调用失败的情况,其次是接口的调用次数和耗时。另外就是线程池的监控。


查询条件.png
变量.png
label_values(application)
label_values(hystrix_circuit_breaker_open{application="$application"}, group)
// 级联查询,group见上一个变量
label_values(hystrix_latency_execution_seconds{application="$application",group="$group"}, key)
label_values(hystrix_circuit_breaker_open{application="$application"}, instance)
  • 总请求数
sum(increase(hystrix_execution_total{application=~"$application"}[1m]))
  • 总成功数
sum(increase(hystrix_execution_total{application=~"$application",event="success"}[1m]))
  • 接口成功率
  • 请求总数top K
topk(10,sum by (group)(increase(hystrix_execution_total{application=~"$application"}[1m])))
总体情况.png top 10 group.png top 10 key.png
topk(10,sum by (group)(increase(hystrix_execution_total{application=~"$application", event!~"success"}[1m])))
sum by (key)(increase(hystrix_execution_total{application=~"$application", event!~"success"}[1m]))
sum by (event)(increase(hystrix_execution_total{application=~"$application"}[1m]))
响应时间分析.png
sum by(group,key, event)(increase(hystrix_execution_total{application=~"$application",key=~"$key"}[1m]))

hystrix_latency_execution_seconds{application=~"$application",instance=~"$instance",key=~"$key",quantile="0.9"} * 1000
  • 请求事件统计


    请求事件的统计.png
sum(increase(hystrix_execution_total{application=~"$application",event!~"success",group=~"$group",key=~"$key"}[1m])) by (application, event)
// 请求超时事件,其他的类似
sum(increase(hystrix_execution_total{application=~"$application",event="timeout",group=~"$group",key=~"$key"}[1m])) by (group)

4.3、jvm日志

目标是:定期找出错误数排名靠前的服务,

  • 统计top k个服务下不同日志级别(error/warn/info/debug)的数量
image.png image.png

4.4、java线程池

java线程池监控.png
//线程池大小
executor_pool_size_threads{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}

//活跃线程数
executor_active_threads{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}

//任务队列大小
executor_queued_tasks{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}

//完成任务数
sum(increase(executor_completed_tasks_total{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}[1m]))

//任务队列剩余数
sum(increase(executor_queue_remaining_tasks{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}[1m]))

4.5、kong开发平台

image.png
//总访问量
sum(increase(kong_http_requests_total{}[1m] ))

//总消费者
count(sum(increase(kong_http_requests_total{}[1m] )) by(consumer_name))

//总接口
count(sum(increase(kong_http_requests_total{}[1m] )) by(path))

//最大QPS
sum(rate(kong_http_requests_total{consumer_name=~"$consumerName", path=~"$path"}[1m]))

//top10消费者
topk(10,sum by (consumer_name)(increase(kong_http_requests_total[1m])))

/top5接口请求数
topk(5,sum by (path)(increase(kong_http_requests_total[1m])))

//PQS
sum(rate(kong_http_requests_total{consumer_name=~"$consumerName", path=~"$path"}[1m])) by (consumer_name, path)

//请求次数
sum(increase(kong_http_requests_total{consumer_name=~"$consumerName", path=~"$path"}[1m])) by (consumer_name, path)

//3秒~10秒
sum(increase(kong_http_request_duration_seconds_bucket{le="10.0",path=~"$path"}[1m]))-sum(increase(kong_http_request_duration_seconds_bucket{le="03.0",path=~"$path"}[1m]))

//1秒~3秒
sum(increase(kong_http_request_duration_seconds_bucket{le="03.0",path=~"$path"}[1m]))-sum(increase(kong_http_request_duration_seconds_bucket{le="01.0",path=~"$path"}[1m]))

//100毫秒~1秒
sum(increase(kong_http_request_duration_seconds_bucket{le="01.0",path=~"$path"}[1m]))-sum(increase(kong_http_request_duration_seconds_bucket{le="00.1",path=~"$path"}[1m]))

//100毫秒以下
sum(increase(kong_http_request_duration_seconds_bucket{le="00.1",path=~"$path"}[1m]))

//响应时间p90
histogram_quantile(0.90, sum(rate(kong_http_request_duration_seconds_bucket{path=~"$path"}[1m])) by (le, path)) * 1000
请求数.png 响应时间.png

五、查询条件

Settings-->Variables,Name是在查询语句会用到的,Label是变量的展示名称,可以是汉字。

查询条件.png
变量Variables.png
新增变量.png
label_values(application)
  • 又比如uri变量,展示在面板中的就是“接口URI”


    uri变量.png
label_values(http_server_requests_seconds_count{application="$application", uri!~"/health|/mgm/health|/mgm/prometheus|/\\*\\*|/webjars/\\*\\*|/swagger-resources|root|NOT_FOUND|/\\*\\*/favicon.ico"},uri)

其他操作

这里仅截图以示意

  • 分组


    新增分组和面板.png
    Add a new row.png
    拖拽面板.png
data sources数据源.png zabbix数据源.png

注意zabbix的地址是http://{ip}/api_jsonrpc.php

相关文章

网友评论

      本文标题:我们自搭监控系统的grafana的视图设计

      本文链接:https://www.haomeiwen.com/subject/qrcjirtx.html