参考:https://esrally.readthedocs.io/en/latest/summary_report.html
1、summary report
rally 只有 summary report,如果想要实时查看各指标数据,需要自己实现,或者使用现成的工具。
例子:geopoint in-memory 类型的 report,打印到标准输出。report 里面的指标很多,按照不同的 operations 场景,我们可以将其划分为几个部分,如下所示:
| Lap | Metric | Task | Value | Unit |
|------:|---------------------------------------------------------------:|--------------:|------------:|-------:|
| All | Cumulative indexing time of primary shards | | 34.2505 | min |
| All | Min cumulative indexing time across primary shards | | 6.63023 | min |
| All | Median cumulative indexing time across primary shards | | 6.83152 | min |
| All | Max cumulative indexing time across primary shards | | 7.12722 | min |
| All | Cumulative indexing throttle time of primary shards | | 0 | min |
| All | Min cumulative indexing throttle time across primary shards | | 0 | min |
| All | Median cumulative indexing throttle time across primary shards | | 0 | min |
| All | Max cumulative indexing throttle time across primary shards | | 0 | min |
| All | Cumulative merge time of primary shards | | 25.9476 | min |
| All | Cumulative merge count of primary shards | | 416 | |
| All | Min cumulative merge time across primary shards | | 4.70943 | min |
| All | Median cumulative merge time across primary shards | | 5.15057 | min |
| All | Max cumulative merge time across primary shards | | 5.81993 | min |
| All | Cumulative merge throttle time of primary shards | | 4.27752 | min |
| All | Min cumulative merge throttle time across primary shards | | 0.720717 | min |
| All | Median cumulative merge throttle time across primary shards | | 0.808833 | min |
| All | Max cumulative merge throttle time across primary shards | | 1.03858 | min |
| All | Cumulative refresh time of primary shards | | 6.58592 | min |
| All | Cumulative refresh count of primary shards | | 2428 | |
| All | Min cumulative refresh time across primary shards | | 1.252 | min |
| All | Median cumulative refresh time across primary shards | | 1.3497 | min |
| All | Max cumulative refresh time across primary shards | | 1.35478 | min |
| All | Cumulative flush time of primary shards | | 0.1466 | min |
| All | Cumulative flush count of primary shards | | 15 | |
| All | Min cumulative flush time across primary shards | | 0.0212833 | min |
| All | Median cumulative flush time across primary shards | | 0.0315167 | min |
| All | Max cumulative flush time across primary shards | | 0.0388833 | min |
| All | Median CPU usage | | 300.5 | % |
| All | Total Young Gen GC | | 105.401 | s |
| All | Total Old Gen GC | | 10.115 | s |
| All | Store size | | 2.97451 | GB |
| All | Translog size | | 2.00234e-07 | GB |
| All | Index size | | 2.97451 | GB |
| All | Total written | | 29.766 | GB |
| All | Heap used for segments | | 13.3071 | MB |
| All | Heap used for doc values | | 0.00948334 | MB |
| All | Heap used for terms | | 11.2716 | MB |
| All | Heap used for norms | | 0 | MB |
| All | Heap used for points | | 0.582964 | MB |
| All | Heap used for stored fields | | 1.44304 | MB |
| All | Segment count | | 96 | |
| All | Min Throughput | index-append | 68976.4 | docs/s |
| All | Median Throughput | index-append | 72087.8 | docs/s |
| All | Max Throughput | index-append | 75291.9 | docs/s |
| All | 50th percentile latency | index-append | 528.702 | ms |
| All | 90th percentile latency | index-append | 782.233 | ms |
| All | 99th percentile latency | index-append | 1167.04 | ms |
| All | 99.9th percentile latency | index-append | 1962.36 | ms |
| All | 99.99th percentile latency | index-append | 2501.22 | ms |
| All | 100th percentile latency | index-append | 2634.9 | ms |
| All | 50th percentile service time | index-append | 528.702 | ms |
| All | 90th percentile service time | index-append | 782.233 | ms |
| All | 99th percentile service time | index-append | 1167.04 | ms |
| All | 99.9th percentile service time | index-append | 1962.36 | ms |
| All | 99.99th percentile service time | index-append | 2501.22 | ms |
| All | 100th percentile service time | index-append | 2634.9 | ms |
| All | error rate | index-append | 0 | % |
| All | Min Throughput | polygon | 2.01 | ops/s |
| All | Median Throughput | polygon | 2.01 | ops/s |
| All | Max Throughput | polygon | 2.01 | ops/s |
| All | 50th percentile latency | polygon | 93.6485 | ms |
| All | 90th percentile latency | polygon | 99.4864 | ms |
| All | 99th percentile latency | polygon | 109.385 | ms |
| All | 100th percentile latency | polygon | 110.976 | ms |
| All | 50th percentile service time | polygon | 93.2 | ms |
| All | 90th percentile service time | polygon | 99.042 | ms |
| All | 99th percentile service time | polygon | 108.945 | ms |
| All | 100th percentile service time | polygon | 110.524 | ms |
| All | error rate | polygon | 0 | % |
| All | Min Throughput | bbox | 2.01 | ops/s |
| All | Median Throughput | bbox | 2.01 | ops/s |
| All | Max Throughput | bbox | 2.01 | ops/s |
| All | 50th percentile latency | bbox | 98.1866 | ms |
| All | 90th percentile latency | bbox | 103.392 | ms |
| All | 99th percentile latency | bbox | 119.742 | ms |
| All | 100th percentile latency | bbox | 122.896 | ms |
| All | 50th percentile service time | bbox | 97.7447 | ms |
| All | 90th percentile service time | bbox | 102.939 | ms |
| All | 99th percentile service time | bbox | 119.302 | ms |
| All | 100th percentile service time | bbox | 122.41 | ms |
| All | error rate | bbox | 0 | % |
| All | Min Throughput | distance | 5.02 | ops/s |
| All | Median Throughput | distance | 5.02 | ops/s |
| All | Max Throughput | distance | 5.02 | ops/s |
| All | 50th percentile latency | distance | 18.3639 | ms |
| All | 90th percentile latency | distance | 19.5332 | ms |
| All | 99th percentile latency | distance | 23.3447 | ms |
| All | 100th percentile latency | distance | 24.0361 | ms |
| All | 50th percentile service time | distance | 18.1461 | ms |
| All | 90th percentile service time | distance | 19.2916 | ms |
| All | 99th percentile service time | distance | 23.1039 | ms |
| All | 100th percentile service time | distance | 23.8031 | ms |
| All | error rate | distance | 0 | % |
| All | Min Throughput | distanceRange | 0.42 | ops/s |
| All | Median Throughput | distanceRange | 0.42 | ops/s |
| All | Max Throughput | distanceRange | 0.42 | ops/s |
| All | 50th percentile latency | distanceRange | 181642 | ms |
| All | 90th percentile latency | distanceRange | 208871 | ms |
| All | 99th percentile latency | distanceRange | 215444 | ms |
| All | 100th percentile latency | distanceRange | 216281 | ms |
| All | 50th percentile service time | distanceRange | 2347.54 | ms |
| All | 90th percentile service time | distanceRange | 2523.76 | ms |
| All | 99th percentile service time | distanceRange | 2631.35 | ms |
| All | 100th percentile service time | distanceRange | 2635.94 | ms |
| All | error rate | distanceRange | 0 | % |
----------------------------------
[INFO] SUCCESS (took 2022 seconds)
----------------------------------
指标主要分为 2 大部分:
- index report
- operations report
为了方便举例,这里先预设环境变量:
host=172.17.0.2
index=customer
Cumulative indexing time of primary shards【重要】
所有 primary shards 的 indexing 累积时间总和。数据来自 indices stats API,这个 API 只有索引级别的 indexing time:
curl -s "[http://{index}/_stats?level=indices&pretty" | jq .indices.${index}.primaries.indexing.index_time_in_millis
Metric: indexing_total_time
需要注意的是,这个不是自然时间,而是多个 indexing 线程消耗 CPU 时间的总和。例如有 M 个 indexing 线程,跑了 N 分钟,那么此指标的总时间就是:M*N 分钟,而不是 N 分钟。
Cumulative indexing time across primary shards
单个 primary shard 的 indexing 累积时间总和的最小值、平均值、最大值。数据来自 indices stats API:
curl -X GET "{index}/_stats?level=shards&pretty" | jq .indices.${index}.shards
Cumulative indexing throttle time of primary shards
所有 primary shards indexing 时被限流的累积时间总和。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&pretty" | jq .indices.${index}.primaries.indexing.throttle_time_in_millis
这个也不是自然时间,而是 indexing 时,索引被限流时 indexing 线程消耗的 CPU 时间总和。
Cumulative indexing throttle time across primary shards
单个 primary shards indexing 时被限流的累积时间总和的最小值、平均值、最大值。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=shards&pretty" | jq .indices.${index}.shards
Cumulative merge time of primary shards【重要】
merge primary shards 的时间总和,也是指线程消耗 CPU 的时间总和。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&pretty" | jq .indices.${index}.primaries.merges.total_time_in_millis
Metric: merges_total_time
Cumulative merge count of primary shards【重要】
发生 merge 动作的 primary shards 数量,不是所有的 shards 都会有 merge 动作的。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&pretty" | jq .indices.${index}.primaries.merges.total
Cumulative merge time across primary shards
单个 primary shards merge 累积时间总和的最小值、平均值、最大值。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=shards&pretty" | jq .indices.${index}.shards
Cumulative refresh time of primary shards【重要】
所有 primary shards refresh 的时间总和,也是指线程消耗 CPU 的时间总和。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&pretty" | jq .indices.${index}.primaries.refresh.total_time_in_millis
Cumulative refresh count of primary shards【重要】
所有 primary shards refresh 的次数总和。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&pretty" | jq .indices.${index}.primaries.refresh.total
Cumulative refresh time across primary shards
单个 primary shard refresh 的时间总和的最小值、平均值、最大值。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=shards&pretty" | jq .indices.${index}.shards
Cumulative flush time of primary shards【重要】
所有 primary shards 把缓存的事务刷到磁盘的时间总和,也是指线程消耗 CPU 的时间总和。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&pretty" | jq .indices.${index}.primaries.flush.total_time_in_millis
Cumulative flush count of primary shards
所有 primary shards flush 的次数总和。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&pretty" | jq .indices.${index}.primaries.flush.total
Cumulative flush time across primary shards
单个 primary shard flush 时间总和的最小值、平均值、最大值,也是指线程消耗 CPU 的时间总和。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=shards&pretty" | jq .indices.${index}.shards
Cumulative merge throttle time of primary shards
所有 primary shards merge 的时间,也是指线程消耗 CPU 的时间总和。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&pretty" | jq .indices.${index}.primaries.merges.total_throttled_time_in_millis
Cumulative merge throttle time across primary shards
单个 primary shards merge 的时间最小值、平均值、最大值,也是指线程消耗 CPU 的时间总和。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=shards&pretty" | jq .indices.${index}.shards
Merge time (X)
merge 的对象有很多种,X 指的是下面某个对象:
- postings
- stored fields
- doc values
- norms
- vectors
- points
Lucene 会汇报不同 merge 动作所需的时间,通过日志方式进行跟踪。用 --car-params="verbose_iw_logging_enabled:true"
可以启动这个日志跟踪,对应的配置文件是这个:
~/.rally/benchmarks/teams/default/cars/v1/vanilla/templates/config/log4j2.properties
ML processing time
字面意思是:机器学习的处理时间。这里指机器学习的 job 在处理单个 bucket 的所花费的时间。包括:
- 最小值
- 平均值
- 中间值
- 最大值
metrics key: ml_processing_time
Total Young Gen GC【重要】
集群所有节点的 YGC 时间。数据来自 node stats API
curl -s "http://${host}:9200/_nodes/stats?level=node&metric=jvm&pretty" | jq .nodes
metrics key: node_total_young_gen_gc_time
Total Old Gen GC【重要】
集群所有节点的 OGC 时间。数据来自 node stats API
curl -s "http://${host}:9200/_nodes/stats?level=node&metric=jvm&pretty" | jq .nodes
metrics key: node_total_old_gen_gc_time
Index size【一般】
指的是 bentchmark 结束后,所有节点的索引大小。Index size = Store size + Translog size,不包括副本分片。
metrics key: final_index_size_bytes
Store size
索引文件的大小,单位 bytes。不包括 translog,不包括副本分片。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&metric=store" | jq .indices.${index}.primaries.store.size_in_bytes
metrics key: store_size_in_bytes
Translog size
事务日志文件大小,单位 bytes。数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=indices&metric=translog" | jq .indices.${index}.primaries.translog.size_in_bytes
metrics key: translog_size_in_bytes
Total written
benchmark 期间写到磁盘的数据大小:
- 在 Linux 平台,只汇报 ES 的写入数据
- 在 Mac OS X,汇报所有进程的写入数据
metrics key: disk_io_write_bytes
Heap used for X
所有 primary shard 的 heap 使用报告,包含以下几项,单位 bytes:
- doc values
- terms
- norms
- points
- stored fields
数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=shards&metric=segments&pretty" | jq .indices.${index}.primaries.segments
metrics keys: segments_*_in_bytes
Segment count【一般】
所有 primary shard 的 segments 数,数据来自 indices stats API:
curl -s "[http://{index}/_stats?level=shards&metric=segments&pretty" | jq .indices.${index}.primaries.segments.count
metrics key: segments_count
Throughput【重点】
每个 operation 的吞吐量,即 qps,汇报形式为:最小值、中间值、最大值。
Rally reports the minimum, median and maximum throughput for each task.
metrics key: throughput
Latency【重点】
请求生命周期分为以下几个阶段:
- 客户端向 ES 提交请求
- 等待 ES 处理请求
- ES 开始接受并开始处理请求
- ES 处理正在处理请求
- ES 开始返回请求的结果给客户端
- ES 返回结果完毕
latency 指的是以上所有阶段所经历的时间(提交一个请求到完全接收返回所经历的时间),单位 ms。
每个 operation 都会汇报以下几种情况的响应延时:
- 50%
- 90%
- 99%
- 99.9%
- 99.99%
- 100%
以上各种情况的请求数 > 5 才汇报,不然没意义
metrics key: latency
Service time
指的是 ES 开始处理一个请求和返回结果所经历的总时间。在请求生命周期中,只包含 3-6 折几个阶段。
如果分不清 latency 和 service time,很容易就将它们混淆了。一般来说,我们都是使用 latency,很少使用 service time。
和 latency 一样,service time 也是分为 6 种情况进行汇报。
metrics key: service_time
Error rate【重要】
每个 operation 的请求错误率,单位 % 。
任何的请求异常都会被当成错误的返回,包括 HTTP 4xx 5xx、网络错误 等。如果 error rate > 0 ,那么就要查看日志,是什么原因导致请求异常。
每个 service_time 记录都会有一个 meta.success 来标记这个请求是否成功,rally 通过这个来统计请求的错误率。
metrics key: service_time
2、本地文件 reports
另外,这些指标报告是保存在本地的,我们可以用来自己做一个报告。
例子:
~/.rally/benchmarks/races/2019-07-10-11-42-55/race.json
{
"trial-id": "1329796e-c61a-40e2-b5c2-56e915e9d123",
"results": {
"indexing_throttle_time_per_shard": {
"max": 0,
"unit": "ms",
"min": 0,
"median": 0
},
"merge_part_time_stored_fields": null,
"merge_part_time_postings": null,
"merge_part_time_points": null,
"old_gc_time": 10115,
"merge_throttle_time_per_shard": {
"max": 62315,
"unit": "ms",
"min": 43243,
"median": 48530
},
"segment_count": 96,
"indexing_throttle_time": 0,
"flush_count": 15,
"young_gc_time": 105401,
"refresh_count": 2428,
"merge_time_per_shard": {
"max": 349196,
"unit": "ms",
"min": 282566,
"median": 309034
},
"total_time_per_shard": {
"max": 427633,
"unit": "ms",
"min": 397814,
"median": 409891
},
"store_size": 3193856176,
"median_cpu_usage": 300.5,
"merge_part_time_doc_values": null,
"op_metrics": [
{
"throughput": {
"mean": 72469.83788189304,
"max": 75291.84695429394,
"unit": "docs/s",
"min": 68976.42460010575,
"median": 72087.81317903922
},
"operation": "index-append",
"latency": {
"50_0": 528.701514005661,
"90_0": 782.2325207293034,
"99_0": 1167.0357184298318,
"99_9": 1962.3588057384227,
"99_99": 2501.217666936475,
"100_0": 2634.900774806738,
"mean": 564.4015841809908
},
"error_rate": 0.0,
"task": "index-append",
"service_time": {
"50_0": 528.701514005661,
"90_0": 782.2325207293034,
"99_0": 1167.0357184298318,
"99_9": 1962.3588057384227,
"99_99": 2501.217666936475,
"100_0": 2634.900774806738,
"mean": 564.4015841809908
}
},
{
"throughput": {
"mean": 2.0066067405365007,
"max": 2.008148376258046,
"unit": "ops/s",
"min": 2.0054495592111383,
"median": 2.0065018842024935
},
"operation": "polygon",
"latency": {
"50_0": 93.64854823797941,
"90_0": 99.48644153773786,
"99_0": 109.38462147489192,
"100_0": 110.97554676234722,
"mean": 94.74336672574282
},
"error_rate": 0.0,
"task": "polygon",
"service_time": {
"50_0": 93.20002049207687,
"90_0": 99.04197864234449,
"99_0": 108.94488280639054,
"100_0": 110.5240248143673,
"mean": 94.30186055600643
}
},
{
"throughput": {
"mean": 2.0065356066431304,
"max": 2.008013747372341,
"unit": "ops/s",
"min": 2.0053142795678984,
"median": 2.006503138210337
},
"operation": "bbox",
"latency": {
"50_0": 98.18658325821161,
"90_0": 103.39184030890465,
"99_0": 119.74237725138666,
"100_0": 122.89570085704327,
"mean": 98.93894387409091
},
"error_rate": 0.0,
"task": "bbox",
"service_time": {
"50_0": 97.74468280375004,
"90_0": 102.93922554701567,
"99_0": 119.30210115388037,
"100_0": 122.40998260676861,
"mean": 98.49681122228503
}
},
{
"throughput": {
"mean": 5.018613647231882,
"max": 5.022730414910589,
"unit": "ops/s",
"min": 5.01536813892288,
"median": 5.018371813955537
},
"operation": "distance",
"latency": {
"50_0": 18.363934941589832,
"90_0": 19.53315138816834,
"99_0": 23.344674017280344,
"100_0": 24.036075919866562,
"mean": 18.68119029328227
},
"error_rate": 0.0,
"task": "distance",
"service_time": {
"50_0": 18.14611628651619,
"90_0": 19.29163541644812,
"99_0": 23.10389801859856,
"100_0": 23.803113028407097,
"mean": 18.451415169984102
}
},
{
"throughput": {
"mean": 0.4193458177488859,
"max": 0.42008646673051614,
"unit": "ops/s",
"min": 0.4184097157477222,
"median": 0.41945384373297073
},
"operation": "distanceRange",
"latency": {
"50_0": 181642.2101240605,
"90_0": 208871.21991384774,
"99_0": 215443.5841869004,
"100_0": 216281.42643161118,
"mean": 181446.70384759083
},
"error_rate": 0.0,
"task": "distanceRange",
"service_time": {
"50_0": 2347.544995136559,
"90_0": 2523.755827359855,
"99_0": 2631.3503402471542,
"100_0": 2635.939357802272,
"mean": 2371.6406882554293
}
}
],
"memory_segments": 13953522,
"refresh_time": 395155,
"memory_stored_fields": 1513136,
"ml_processing_time": [],
"merge_part_time_norms": null,
"flush_time_per_shard": {
"max": 2333,
"unit": "ms",
"min": 1277,
"median": 1891
},
"memory_norms": 0,
"merge_time": 1556855,
"memory_terms": 11819160,
"node_metrics": [
{
"startup_time": 8.711096312850714,
"node": "rally-node-0"
}
],
"refresh_time_per_shard": {
"max": 81287,
"unit": "ms",
"min": 75120,
"median": 80982
},
"total_time": 2055030,
"merge_count": 416,
"translog_size": 215,
"memory_points": 611282,
"memory_doc_values": 9944,
"flush_time": 8796,
"bytes_written": 31961006080,
"index_size": 3193858129,
"merge_part_time_vectors": null,
"merge_throttle_time": 256651
},
"challenge": "append-no-conflicts",
"trial-timestamp": "20190710T114255Z",
"pipeline": "from-distribution",
"user-tags": {},
"rally-version": "1.1.0",
"total-laps": 1,
"car": [
"defaults"
],
"track": "geopoint",
"cluster": {
"distribution-version": "5.5.2",
"revision": "b2f0c09",
"nodes": [
{
"cpu": {
"allocated_processors": 4,
"available_processors": 4
},
"host_name": "127.0.0.1",
"plugins": [],
"node_name": "rally-node-0",
"ip": "127.0.0.1",
"memory": {
"total_bytes": 4144381952
},
"jvm": {
"version": "1.8.0_102",
"vendor": "Oracle Corporation"
},
"fs": [
{
"type": "rootfs",
"mount": "/ (rootfs)",
"spins": "unknown"
}
],
"os": {
"version": "3.10.0-229.el7.x86_64",
"name": "Linux"
}
}
],
"node-count": 1,
"distribution-flavor": "oss"
},
"environment": "local"
}
3、report templates
track report 的 templates 信息放在 python esrally module 里面,例如:
python-3.5.2/lib/python3.5/site-packages/esrally/resources/metrics-template.json
网友评论