尴尴尬尬
作为一个菜鸡数据库运维(AKA:删库大魔头),MySQL性能问题几乎成了我们最恐惧的梦魇。无论是zabbix还是prometheus+grafana,即便加载了再多percona的“套娃”,监控数据依然只是个五彩斑斓但看着却非常闹心的趋势图,程式化的性能指标真很难帮助我们准确得找出问题根因。而面对MySQL性能问题,如何快速有效的定位和分析几乎成为了每一位菜鸡绕不过的挑战。
简单粗暴且有效
面对挑战,我们需要的应该是“一把40米长的大刀”......虽然这话听起来有些戏谑,但真实的情况就是我们正需要这种简单粗暴且有效的方法论,而我喜欢把这种方法论称为:两头堵。
即从系统
和进程
入手,尝试分析找到问题症结。
先堵:系统
系统层面的分析思路是基于进程调用栈的信息分析,这个领域大家肯定听说过火焰图
。
作为一只菜鸡中的战斗鸡,火焰图有这么几个知识点:
啥原理
-
硬件中的PMU单元(performance monitor unit)
瞎解释:PMU可以简单理解为硬件中的一个“事件计数阀门”,允许软件针对某类硬件事件设置阀门数值,随后处理器开始对该类事件计数。当计数值超过预制数值后,阀门关闭产生中断。
举板栗: cache miss 达到某个值后,PMU 便能产生相应的中断。捕获这些中断,就可以考察程序对这些硬件特性的利用效率。 -
Tracepoint
瞎解释:即在内核源代码中的一些 hook,在特定的代码被运行时触发,这一特性可以被各种 trace或debug 工具所利用。
用啥整
大概都知道的Perf
怎么整
Perf的使用说明、指导文档、安装指南一类的东西网上已经有太多太多了,此处不再赘述。
有道是:内事不决问度娘,外事不决问Go爹!
重点看“平头”
即看火焰图顶层的哪个函数占据的宽度最大,这就是问题所在。而常见的火焰图类型包括On-CPU、Off-CPU和Memory等等。
- On-CPU
cpu占用过高,执行中的时间通常又分为用户态时间user和系统态时间sys。 - Off-CPU
cpu过低,利用率不高。等待下一轮CPU,或者等待I/O、锁、换页等等。 - Memory
程序如果出现内存泄漏,同样也可以使用内存级别火焰图快速分析问题。
上个满大街都是的示图
CPU-MySQL-Crop-500二堵:进程
这个层面的分析思路其实就是基于MySQL的资源消耗分布展开的,从多个角度分析可能存在的性能瓶颈。
CPU
大消耗场景
大多数MySQL进程CPU消耗较大都是因为慢事务造成的,具体场景分为:
- SQL扫全表
- SQL扫描数据量过大
- 内存排序
- 磁盘排序
- 锁争用等待
- 等待磁盘IO
SQL执行状态
短平快的通过Show Processlist查看MySQL中正在执行的SQL,通过SQL执行状态值即可初步识别问题SQL
- Sending data:长期处于此状态,可能是查询未命中有效索引,导致SQL执行慢得一批
- Copying to tmp table:长期处于此状态,可能是临时结果集过大
- Copying to tmp table to disk:而出现此状态时,说明临时结果集过大,并超出临时内存大小已经需要落盘
- Sorting resule:长期处于此状态,说明排序操作会引起过多资源消耗
- Locked:这个如果我也要解释,就只能证明我脑残的认为其他人也脑残
- Using filesort:同Sorting result
SQL定位
以上的操作只是简单的缩小问题SQL范围,而最终在面对大并发排序至CPU狂飙的场景时,可以通过top sql快速定位耗时和消耗分布,进而为问题定位找到有力证据
- 统计耗时最长的SQL Top10
13:16:21 HowUger@HowUger_MySQL_t01:mysql> SELECT EVENT_ID, TRUNCATE(TIMER_WAIT/1000000000000,6) as Duration, SQL
_TEXT FROM performance_schema.events_statements_history_long order by
Duration desc limit 10;
- 根据event_id查看SQL的具体耗时阶段
13:16:47 HowUger@HowUger_MySQL_t01:mysql> SELECT event_name AS Stage, TRUNCATE(TIMER_WAIT/1000000000000,6) AS Du
ration FROM performance_schema.events_stages_history_long WHERE NESTIN
G_EVENT_ID=1576;
+--------------------------------+----------+
| Stage | Duration |
+--------------------------------+----------+
| stage/sql/starting | 0.000097 |
| stage/sql/checking permissions | 0.000014 |
| stage/sql/Opening tables | 0.000019 |
| stage/sql/init | 0.000026 |
| stage/sql/System lock | 0.000009 |
| stage/sql/optimizing | 0.000006 |
| stage/sql/statistics | 0.000014 |
| stage/sql/preparing | 0.000012 |
| stage/sql/Sorting result | 0.000007 |
| stage/sql/executing | 0.000005 |
| stage/sql/Sending data | 0.000009 |
| stage/sql/Creating sort index | 0.028948 |
| stage/sql/end | 0.000005 |
| stage/sql/query end | 0.000006 |
| stage/sql/closing tables | 0.000008 |
| stage/sql/freeing items | 0.000019 |
| stage/sql/cleaning up | 0.000001 |
+--------------------------------+----------+
- 统计排序耗时SQL Top10
13:25:22 HowUger@HowUger_MySQL_t01:mysql> SELECT DIGEST_TEXT, COUNT_STAR,SUM_SORT_ROWS,TRUNCATE(AVG_TIMER_WAIT/1
000000000000,6) as AVG_TIMER_WAIT,SUM_ROWS_EXAMINED,SUM_CREATED_TMP_TA
BLES,SUM_CREATED_TMP_DISK_TABLES,SUM_ROWS_SENT,FIRST_SEEN, LAST_SEEN F
ROM performance_schema.events_statements_summary_by_digest ORDER BY SU
M_SORT_ROWS DESC limit 10 \G
***************************[ 1. row ]***************************
DIGEST_TEXT | SELECT TABLE_NAME , COLUMN_NAME FROM `information_schema` . `columns` WHERE `table_schema` = ? ORDER BY TABLE_NAME , `ordinal_position`
COUNT_STAR | 5
SUM_SORT_ROWS | 1430
AVG_TIMER_WAIT | 0.001987
SUM_ROWS_EXAMINED | 2860
SUM_CREATED_TMP_TABLES | 5
SUM_CREATED_TMP_DISK_TABLES | 5
SUM_ROWS_SENT | 1430
FIRST_SEEN | 2020-01-17 14:23:56
LAST_SEEN | 2020-01-20 14:44:44
...
Memory
初入行时经常遇到的问题是:MySQL进程OOM,然后被系统Kill了
这对于新人菜鸡来说真的很不友好,而且我们往往又没有什么定位思路。
随着工作逐步深入,我们会发现:MySQL的驻留内存问题其实主要集中在下面两个场景中:
- 独享线程内存
- 全局共享内存
作为高级玩家的你看到这里可能会说:这还用你说!
是的,MySQL的驻留内存其实就是由这两部分组成。
Mysql Server Memory Usage= Sum of Global Buffers + (number of Connection * Per thread memory variables)
独享线程内存
先看一看独享线程内存相关的内存参数配置,有以下这么几个:
thread_stack
sort_buffer_size
join_buffer_size
read_buffer_size
read_rnd_buffer_size
net_buffer_length
bulk_insert_buffer_size
tmp_table_size
...
面对内存高耗场景,我们可以使用下面的SQL快速找出哪条线程在搞事情:
- 统计线程的内存消耗
12:30:05 HowUger@HowUger_MySQL_t01:sys> select b.thd_id, b.user, current_count_used,current_allocated, current_avg_alloc, current_max_alloc,total_allocated,current_statement from memory_by_thread_by_current
_bytes a,session b where a.thread_id = b.thd_id limit 5;
+--------+--------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+-------------------------------------------------------------------+
| thd_id | user | current_count_used | current_allocated | current_avg_alloc | current_max_alloc | total_allocated | current_statement |
+--------+--------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+-------------------------------------------------------------------+
| 61 | test_user@192.168.11.234 | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | <null> |
| 91 | test_user@192.168.11.228 | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | <null> |
| 157 | HowUger@HowUger_Cli_vm01 | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | select b.thd_id, b.user, curre ... a.thread_id = b.thd_id limit 5 |
| 26 | sql/event_scheduler | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | <null> |
+--------+--------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+-------------------------------------------------------------------+
- 一个隐藏的配置坑:net_buffer_length
这个值是一个根据需求可动态调整的配置,最大可到为max_allowed_packet。所以,这就是为什么我们总是强调SQL的结果集要尽量小,同时别把max_allowed_packet配置的过大。
因为你真的会遇到,一个线程吃掉max_allowed_packet等值大小内存的情况。
全局共享内存
关于全局共享内存,其实可以罗列很多很多内存参数配置,例如
thread_cache_size
table_open_cache
table_definition_cache
binlog_cache_size
innodb_log_buffer_size
innodb_buffer_pool_size
innodb_additional_mem_pool_size
back_log
...
但是,这部分最主要的内存消耗是在:innodb_buffer_pool_size!
MySQL 的 Buffer Pool机制是比较特别的,通常情况下Buffer Pool驻留内存会逐渐增大,且最终无限接近配置文件中的innodb_buffer_pool_size大小。
即使我们找到了占用buffer较多的表,并且TrunCate了它们,但是被吃掉的Buffer Pool也很难被吐出来(淡然5.7以后可以动态调整这个倒霉参数了)。
- 统计Buffer Pool内存占用表Top10**
12:35:03 HowUger@HowUger_MySQL_t01:sys> select * from innodb_buffer_stats_by_table order by pages desc limit 10;
+---------------+--------------------+------------+------------+-------+--------------+-----------+-------------+
| object_schema | object_name | allocated | data | pages | pages_hashed | pages_old | rows_cached |
+---------------+--------------------+------------+------------+-------+--------------+-----------+-------------+
| InnoDB System | SYS_TABLES | 45.66 MiB | 42.14 MiB | 2922 | 0 | 495 | 146593 |
| testserv | variable | 960.00 KiB | 823.27 KiB | 60 | 0 | 60 | 3181 |
| testserv | testcase | 928.00 KiB | 599.99 KiB | 58 | 0 | 25 | 1560 |
| mysql | help_topic | 496.00 KiB | 392.08 KiB | 31 | 0 | 31 | 515 |
| mysql | help_keyword | 144.00 KiB | 80.22 KiB | 9 | 0 | 9 | 512 |
| mysql | innodb_index_stats | 128.00 KiB | 61.33 KiB | 8 | 0 | 8 | 569 |
| InnoDB System | SYS_COLUMNS | 96.00 KiB | 52.12 KiB | 6 | 0 | 5 | 804 |
| mysql | help_relation | 80.00 KiB | 35.44 KiB | 5 | 0 | 5 | 1397 |
| testserv | t_generallock | 48.00 KiB | 419 bytes | 3 | 0 | 3 | 6 |
| InnoDB System | SYS_FOREIGN | 32.00 KiB | 0 bytes | 2 | 0 | 2 | 0 |
+---------------+--------------------+------------+------------+-------+--------------+-----------+-------------+
如您所见所想,最好的控制就是量身定制的buffer pool size,不大不小刚刚好。
IO
对于MySQL的IO性能,主要由每张表的访问频率和具体读写数据组成。
MySQL内的逻辑IO请求
即统计业务在对应表上的访问频率
- 上SQL
13:15:58 HowUger@HowUger_MySQL_t01:performance_schema> SELECT object_schema AS tb_schema, object_name AS table_name, count_star AS rows_io_total, count_read AS rows_read,
count_write AS rows_write, count_fetch AS rows_fetchs, count_insert AS rows_inserts, count_update AS rows_upd
ates, count_delete AS rows_deletes, CONCAT(ROUND(sum_timer_fetch / 3600000000000000, 2), 'h') AS fetch_latency, CONC
AT(ROUND(sum_timer_insert / 3600000000000000, 2), 'h') AS insert_latency, CONCAT(ROUND(sum_timer_update / 3600000000000000, 2), 'h') AS upd
ate_latency, CONCAT(ROUND(sum_timer_delete / 3600000000000000, 2), 'h') AS delete_latency FROM table_io_waits_summary_by_table O
RDER BY sum_timer_wait DESC limit 10 ;
+-----------+-------------------+---------------+-----------+------------+-------------+--------------+--------------+--------------+---------------+----------------+----------------+----------------+
| tb_schema | table_name | rows_io_total | rows_read | rows_write | rows_fetchs | rows_inserts | rows_updates | rows_deletes | fetch_latency | insert_latency | update_latency | delete_latency |
+-----------+-------------------+---------------+-----------+------------+-------------+--------------+--------------+--------------+---------------+----------------+----------------+----------------+
| testserv | testcase | 6015 | 6014 | 1 | 6014 | 0 | 1 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
| testserv | variable | 13718 | 13603 | 115 | 13603 | 0 | 115 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
| testserv | aw_name_select | 45105 | 45105 | 0 | 45105 | 0 | 0 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
| testserv | t_generallock | 8 | 7 | 1 | 7 | 1 | 0 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
| sys | sys_config | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
| testserv | aw_instance_back | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
| testserv | aw_instance_new | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
| test_1220 | t_test01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
| testserv | aw_lib_repository | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
| testserv | basic_aw_back | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00h | 0.00h | 0.00h | 0.00h |
+-----------+-------------------+---------------+-----------+------------+-------------+--------------+--------------+--------------+---------------+----------------+----------------+----------------+
MySQL内的物理IO请求
即统计MySQL相关物理文件的实际IO读写情况
- 上SQL
13:16:13 HowUger@HowUger_MySQL_t01:sys> select * from io_global_by_file_by_bytes limit 10;
+----------------------------------------+------------+------------+-----------+-------------+---------------+-----------+------------+-----------+
| file | count_read | total_read | avg_read | count_write | total_written | avg_write | total | write_pct |
+----------------------------------------+------------+------------+-----------+-------------+---------------+-----------+------------+-----------+
| @@datadir/master-bin.000003 | 39115 | 764.94 MiB | 20.03 KiB | 1 | 1 bytes | 1 bytes | 764.94 MiB | 0.00 |
| @@datadir/ibtmp1 | 0 | 0 bytes | 0 bytes | 7232 | 124.81 MiB | 17.67 KiB | 124.81 MiB | 100.00 |
| @@datadir/ibdata1 | 328 | 7.14 MiB | 22.29 KiB | 446 | 13.28 MiB | 30.49 KiB | 20.42 MiB | 65.03 |
| @@datadir/mysql/proc.MYD | 3180 | 9.35 MiB | 3.01 KiB | 0 | 0 bytes | 0 bytes | 9.35 MiB | 0.00 |
| @@datadir/testserv/aw_name_select.ibd | 166 | 2.64 MiB | 16.29 KiB | 4 | 64.00 KiB | 16.00 KiB | 2.70 MiB | 2.31 |
| @@datadir/testserv/variable.ibd | 63 | 1.03 MiB | 16.76 KiB | 52 | 832.00 KiB | 16.00 KiB | 1.84 MiB | 44.07 |
| @@datadir/testserv/testcase.ibd | 61 | 1.00 MiB | 16.79 KiB | 1 | 16.00 KiB | 16.00 KiB | 1.02 MiB | 1.54 |
| @@datadir/ib_logfile0 | 7 | 68.50 KiB | 9.79 KiB | 267 | 600.00 KiB | 2.25 KiB | 668.50 KiB | 89.75 |
| @@datadir/mysql/help_topic.ibd | 38 | 656.00 KiB | 17.26 KiB | 0 | 0 bytes | 0 bytes | 656.00 KiB | 0.00 |
| @@datadir/mysql/help_keyword.ibd | 13 | 256.00 KiB | 19.69 KiB | 0 | 0 bytes | 0 bytes | 256.00 KiB | 0.00 |
+----------------------------------------+------------+------------+-----------+-------------+---------------+-----------+------------+-----------+
当访问频率与物理文件读写量的统计信息结合后,找到IO问题点也就不难了。
挖坑学习两不误
关于MySQL内存,可学习的东西还有很多。而从源码角度解读,将有助于菜鸡更准确的理解MySQL的内存分配机制。
网友评论