一、监控体系
监控大盘.png- 业务指标监控
- 应用监控(jvm、接口访问量及耗时、线程池、数据库连接池、hystrix)
- 中间件--包括数据库监控(mq、es、redis、mysql、mongodb)
- 机器监控(cpu、内存、磁盘、网络、连接数)
其他监控
- nginx/kong日志
- jvm日志
- 数据库慢查询日志
- 自定义日志
二、grafana数据集成
展示数据源不仅限于prometheus
- Prometheus
- ElasticSearch
- Mongodb
三、机器监控
早期的机器监控是使用zabbix,而它自身的界面查询非常不便,所以打算在grafana集成zabbix。
- 当前值(汇总信息)
- cpu
- 内存
- 磁盘空间
- 网络流量
- tcp连接数(这个很重要)
四、几个重要的面板
grafana模板一般分为汇总数据和明细数据,先展示汇总情况,然后展示其趋势详情。
4.1、接口调用及耗时
- 总访问量
sum(increase(http_server_requests_seconds_count{application="$application"}[1m] ))
- 总错误数
sum(increase(http_server_requests_seconds_count{application="$application",status=~"5.."}[1m] )) by(application)
- 错误率
100*(sum(increase(http_server_requests_seconds_count{application="$application", status=~"5.."}[1m] )) by(application)) / (sum(increase(http_server_requests_seconds_count{application="$application"}[1m] )) by(application))
- 最大QPS(和jvm模板不一样的是,这里的都是按服务维度来的,而jvm是按节点维度来的)
sum(rate(http_server_requests_seconds_count{application="$application"}[1m]))
- 总接口数
count(sum(increase(http_server_requests_seconds_count{application="$application"}[1m] )) by(uri))
- QPS趋势
支持查询条件:服务和URI。
sum(irate(http_server_requests_seconds_count{application="$application", uri=~"$uri"}[1m]))
-
错误数趋势
汇总信息.png
sum(irate(http_server_requests_seconds_count{application="$application", uri=~"$uri", status=~"5.."}[1m]))
- 重点分析耗时5秒以上的请求数
使用的是减法操作
sum(increase(http_server_requests_seconds_count{application="$application"}[1m])) by (application)-sum(increase(http_server_requests_seconds_bucket{application="$application",le="5.0"}[1m])) by (application)
-
对访问量按接口进行排行
汇总信息2.png
topk(10, sum by(uri, method) (increase(http_server_requests_seconds_count{application="$application"}[1m])))
-
耗时分布在1-3秒、3-5秒、5秒以上(参考pinpoint,Prometheus是全量采集,而apm一般都是采样的,所以准确性不同)
汇总信息3.png
//5秒以上
sum(increase(http_server_requests_seconds_count{application="$application"}[1m])) by (application)-sum(increase(http_server_requests_seconds_bucket{application="$application",le="5.0"}[1m])) by (application)
//3~5秒
sum(increase(http_server_requests_seconds_bucket{application="$application",le="5.0"}[1m])) by (application)-sum(increase(http_server_requests_seconds_bucket{application="$application",le="3.0"}[1m])) by (application)
//1~3秒
sum(increase(http_server_requests_seconds_bucket{application="$application",le="3.0"}[1m])) by (application)-sum(increase(http_server_requests_seconds_bucket{application="$application",le="1.0"}[1m])) by (application)
-
错误详情
错误趋势.png
// 支持按节点过滤
// 错误率
rate(http_server_requests_seconds_count{application="$application", instance="$instance", uri=~"$uri", status=~"5.."}[1m])
// 错误数
sum(increase(http_server_requests_seconds_count{application="$application", instance=~"$instance", uri=~"$uri", status=~"5.."}[1m] )) by(uri)
-
QPS(1分钟窗口)
QPS(1分钟窗口).png
irate(http_server_requests_seconds_count{application="$application", instance=~"$instance", uri=~"$uri"}[1m])
-
接口的响应时间(90线)
接口的响应时间(90线).png
http_server_requests_seconds{application="$application",quantile="0.9",instance=~"$instance",status="200", uri=~"$uri"} * 1000
-
接口的访问量(1分钟)
接口的访问量.png
sum(increase(http_server_requests_seconds_count{application="$application", instance=~"$instance", uri=~"$uri"}[1m])) by(uri,method)
4.2、hystrix
参考hystrix turbine,关注熔断等调用失败的情况,其次是接口的调用次数和耗时。另外就是线程池的监控。
查询条件.png
变量.png
label_values(application)
label_values(hystrix_circuit_breaker_open{application="$application"}, group)
// 级联查询,group见上一个变量
label_values(hystrix_latency_execution_seconds{application="$application",group="$group"}, key)
label_values(hystrix_circuit_breaker_open{application="$application"}, instance)
- 总请求数
sum(increase(hystrix_execution_total{application=~"$application"}[1m]))
- 总成功数
sum(increase(hystrix_execution_total{application=~"$application",event="success"}[1m]))
- 接口成功率
- 请求总数top K
topk(10,sum by (group)(increase(hystrix_execution_total{application=~"$application"}[1m])))
总体情况.png
top 10 group.png
top 10 key.png
topk(10,sum by (group)(increase(hystrix_execution_total{application=~"$application", event!~"success"}[1m])))
sum by (key)(increase(hystrix_execution_total{application=~"$application", event!~"success"}[1m]))
sum by (event)(increase(hystrix_execution_total{application=~"$application"}[1m]))
响应时间分析.png
sum by(group,key, event)(increase(hystrix_execution_total{application=~"$application",key=~"$key"}[1m]))
hystrix_latency_execution_seconds{application=~"$application",instance=~"$instance",key=~"$key",quantile="0.9"} * 1000
-
请求事件统计
请求事件的统计.png
sum(increase(hystrix_execution_total{application=~"$application",event!~"success",group=~"$group",key=~"$key"}[1m])) by (application, event)
// 请求超时事件,其他的类似
sum(increase(hystrix_execution_total{application=~"$application",event="timeout",group=~"$group",key=~"$key"}[1m])) by (group)
4.3、jvm日志
目标是:定期找出错误数排名靠前的服务,
- 统计top k个服务下不同日志级别(error/warn/info/debug)的数量
4.4、java线程池
java线程池监控.png//线程池大小
executor_pool_size_threads{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}
//活跃线程数
executor_active_threads{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}
//任务队列大小
executor_queued_tasks{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}
//完成任务数
sum(increase(executor_completed_tasks_total{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}[1m]))
//任务队列剩余数
sum(increase(executor_queue_remaining_tasks{application=~"$application",instance=~"$instance",name=~"$theadPoolName"}[1m]))
4.5、kong开发平台
image.png//总访问量
sum(increase(kong_http_requests_total{}[1m] ))
//总消费者
count(sum(increase(kong_http_requests_total{}[1m] )) by(consumer_name))
//总接口
count(sum(increase(kong_http_requests_total{}[1m] )) by(path))
//最大QPS
sum(rate(kong_http_requests_total{consumer_name=~"$consumerName", path=~"$path"}[1m]))
//top10消费者
topk(10,sum by (consumer_name)(increase(kong_http_requests_total[1m])))
/top5接口请求数
topk(5,sum by (path)(increase(kong_http_requests_total[1m])))
//PQS
sum(rate(kong_http_requests_total{consumer_name=~"$consumerName", path=~"$path"}[1m])) by (consumer_name, path)
//请求次数
sum(increase(kong_http_requests_total{consumer_name=~"$consumerName", path=~"$path"}[1m])) by (consumer_name, path)
//3秒~10秒
sum(increase(kong_http_request_duration_seconds_bucket{le="10.0",path=~"$path"}[1m]))-sum(increase(kong_http_request_duration_seconds_bucket{le="03.0",path=~"$path"}[1m]))
//1秒~3秒
sum(increase(kong_http_request_duration_seconds_bucket{le="03.0",path=~"$path"}[1m]))-sum(increase(kong_http_request_duration_seconds_bucket{le="01.0",path=~"$path"}[1m]))
//100毫秒~1秒
sum(increase(kong_http_request_duration_seconds_bucket{le="01.0",path=~"$path"}[1m]))-sum(increase(kong_http_request_duration_seconds_bucket{le="00.1",path=~"$path"}[1m]))
//100毫秒以下
sum(increase(kong_http_request_duration_seconds_bucket{le="00.1",path=~"$path"}[1m]))
//响应时间p90
histogram_quantile(0.90, sum(rate(kong_http_request_duration_seconds_bucket{path=~"$path"}[1m])) by (le, path)) * 1000
请求数.png
响应时间.png
五、查询条件
查询条件.pngSettings-->Variables,Name是在查询语句会用到的,Label是变量的展示名称,可以是汉字。
变量Variables.png
新增变量.png
label_values(application)
-
又比如uri变量,展示在面板中的就是“接口URI”
uri变量.png
label_values(http_server_requests_seconds_count{application="$application", uri!~"/health|/mgm/health|/mgm/prometheus|/\\*\\*|/webjars/\\*\\*|/swagger-resources|root|NOT_FOUND|/\\*\\*/favicon.ico"},uri)
其他操作
这里仅截图以示意
-
分组
新增分组和面板.png
Add a new row.png
拖拽面板.png
注意zabbix的地址是http://{ip}/api_jsonrpc.php
网友评论