死锁发现
- 2018-01-18 14:10:03 线上环境批量更新库存的地方出现了死锁
- 2018-01-25 16:50:03 线上环境批量更新库存的地方出现了死锁
org.springframework.dao.DeadlockLoserDataAccessException: PreparedStatementCallback; SQL [UPDATE goods_inventory SET goods_num = ? WHERE wms_type = ? AND barCode = ?]; Deadlock found when trying to get lock; try restarting transaction; nested exception is java.sql.BatchUpdateException: Deadlock found when trying to get lock; try restarting transaction
at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:263)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
...
死锁原因
死锁:一般是事务相互等待对方资源,最后形成环路造成的。如下图示,左图两辆车造成未造成死锁了,右图四辆车造成了死锁
image
步骤 | T1 | T2 |
---|---|---|
1 | begin; | begin; |
2 | update xxx set name = 'aaa' where id = 1; | |
3 | update xxx set name = 'bbb' where id = 2; | |
4 | update xxx set name = 'ccc' where id = 1; | |
5 | update xxx set name = 'ddd' where id = 2; |
死锁分析:
- 根据异常栈信息,死锁出现时,执行的SQL是:UPDATE goods_inventory SET goods_num = ? WHERE wms_type = ? AND barCode = ?。该SQL使用batchUpdate执行,一次会更新几千条数据
- 线上,还有一个地方:订单下单的时候,会批量扣减库存,该处的SQL也是使用batchUpdate执行。SQL如下:
UPDATE goods_inventory SET goods_num = goods_num - ? WHERE barCode = ? AND wms_type = ? - 根据死锁出现的时间,两次,都恰好是整10分出现,而线上有一个定时任务以10分钟的频率同步一次库存。
根据1、3可以猜测,造成死锁的原因是两个地方同时对一个表进行了大的事务操作(batchUpdate),而又互相等待对方资源
为了进一步验证猜测,需要对MySQL死锁日志进行分析,在MySQL上执行如下命令
show engine innodb status
得到的日志如下(无关信息已被忽略):
------------------------
LATEST DETECTED DEADLOCK
------------------------
2018-01-24 11:00:03 7f186658a700
*** (1) TRANSACTION:
TRANSACTION 436005840, ACTIVE 0.104 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 6 lock struct(s), heap size 1184, 7 row lock(s), undo log entries 27
LOCK BLOCKING MySQL thread id: 940390784 block 823017822
MySQL thread id 823017822, OS thread handle 0x7f1864fbf700, query id 4007854070 10.26.162.33 aixuexidball updating
UPDATE goods_inventory SET goods_num = goods_num - 2 WHERE wms_type = 1 AND barCode = 'XSBSTG618H'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005840 lock_mode X locks rec but not gap waiting
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80000aa0; asc ;;
*** (2) TRANSACTION:
TRANSACTION 436005809, ACTIVE 1.314 sec starting index read
mysql tables in use 1, locked 1
15 lock struct(s), heap size 2936, 1659 row lock(s), undo log entries 33
MySQL thread id 940390784, OS thread handle 0x7f186658a700, query id 4007854164 10.26.162.33 aixuexidball updating
UPDATE goods_inventory SET goods_num = 1438 WHERE wms_type = 1 AND barCode = 'XSBSTG318H'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005809 lock_mode X locks rec but not gap
Record lock, heap no 4 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 11; hex 4753523250593131313751; asc GSR2PY1117Q;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 800001af; asc ;;
Record lock, heap no 7 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 11; hex 4753523250593131313753; asc GSR2PY1117S;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 800001b0; asc ;;
Record lock, heap no 10 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 11; hex 4753523250593131313843; asc GSR2PY1118C;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80000d0c; asc ;;
Record lock, heap no 13 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 11; hex 4753523250593131313848; asc GSR2PY1118H;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80000bf2; asc ;;
Record lock, heap no 16 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 11; hex 4753523254473130313751; asc GSR2TG1017Q;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 800001b1; asc ;;
...
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80000aa0; asc ;;
...
Record lock, heap no 388 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 11; hex 4757523250593130313843; asc GWR2PY1018C;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80001589; asc ;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005809 lock_mode X locks rec but not gap waiting
Record lock, heap no 317 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 10; hex 58534253544733313848; asc XSBSTG318H;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80000aa2; asc ;;
*** WE ROLL BACK TRANSACTION (1)
下面将逐段解释(由# 标识的文字是笔者注释)
第一部分是上次死锁发生的时间
------------------------
LATEST DETECTED DEADLOCK
------------------------
2018-01-24 11:00:03 7f186658a700
#这是上次发生死锁的时间
- 接着是发生死锁的第一个事务信息
*** (1) TRANSACTION:
TRANSACTION 436005840, ACTIVE 0.104 sec starting index read
# 事务436005840,事务处于活跃状态0.104s,正在使用索引读取数据行
mysql tables in use 1, locked 1
# 事务436005840正在使用1个table,且涉及锁的也是1个table
LOCK WAIT 6 lock struct(s), heap size 1184, 7 row lock(s), undo log entries 27
# 等待6把锁,占用内存1184字节,涉及7行记录(注:如果事务已经锁定了几行数据,这里将会有一行信息显示出锁定结构的数目和堆大小,堆是为了持有这些行锁而占用的内存大小)
LOCK BLOCKING MySQL thread id: 940390784 block 823017822
# 线程ID:940390784阻塞在线程ID:823017822上
MySQL thread id 823017822, OS thread handle 0x7f1864fbf700, query id 4007854070 10.26.162.33 aixuexidball updating
# 该事务的线程ID信息,操作系统句柄信息,连接来源、用户
UPDATE goods_inventory SET goods_num = goods_num - 2 WHERE wms_type = 1 AND barCode = 'XSBSTG618H'
# 事务涉及的SQL
- 当死锁发生时,第一个事务正在等待的锁信息
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
# 第一个事务正在等待锁被授予
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005840 lock_mode X locks rec but not gap waiting
# 等待的锁是一个record lock,空间id是4141,页面号是21,在页面的456bits处。锁发生在表managesystemv1.0.goods_inventory的索引idx_barcode_wms上,是一个X锁,但不是GAP锁
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
# record lock的heap no 位置
0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80000aa0; asc ;;
- 接着是发生死锁的第二个事务信息
*** (2) TRANSACTION:
TRANSACTION 436005809, ACTIVE 1.314 sec starting index read
# 事务436005809,事务处于活跃状态1.314s,正在使用索引读取数据行
mysql tables in use 1, locked 1
# 事务436005809正在使用1个table,且涉及锁的也是1个table
15 lock struct(s), heap size 2936, 1659 row lock(s), undo log entries 33
# 涉及15把锁,占用内存2936字节,锁定了1659行记录
MySQL thread id 940390784, OS thread handle 0x7f186658a700, query id 4007854164 10.26.162.33 aixuexidball updating
# 该事务的线程ID信息,操作系统句柄信息,连接来源、用户
UPDATE goods_inventory SET goods_num = 1438 WHERE wms_type = 1 AND barCode = 'XSBSTG318H'
# 第二个事务涉及的SQL
4.事务二持有的锁信息
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005809 lock_mode X locks rec but not gap
# 持有的锁是一个record lock,空间id是4141,页面号是21,在页面的456bits处。锁发生在表managesystemv1.0.goods_inventory的索引idx_barcode_wms上,是一个X锁,但不是GAP锁
Record lock, heap no 4 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 11; hex 4753523250593131313751; asc GSR2PY1117Q;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 800001af; asc ;;
...
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
# record lock的heap no 位置
# 这个就是事务1在等待的锁
0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80000aa0; asc ;;
...
Record lock, heap no 388 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 11; hex 4757523250593130313843; asc GWR2PY1018C;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80001589; asc ;;
5.当死锁发生时,第二个事务正在等待的锁等信息
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4141 page no 21 n bits 456 index `idx_barcode_wms` of table `managesystemv1.0`.`goods_inventory` trx id 436005809 lock_mode X locks rec but not gap waiting
Record lock, heap no 317 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
# 在等着heap no为317的锁
0: len 10; hex 58534253544733313848; asc XSBSTG318H;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80000aa2; asc ;;
*** WE ROLL BACK TRANSACTION (1)
# 事务1被回滚,如果两个事务回滚的开销不同(undo数量不同),回滚开销最小的那个事务。如果两个事务的回滚开销一样,选择后提交的事务进行回滚。
由2和4可以看出,事务1在等待Record lock, heap no 341,而这个锁正好被事务2持有。
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
*** (2) HOLDS THE LOCK(S):
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 10; hex 58534253544736313848; asc XSBSTG618H;;
1: len 4; hex 80000001; asc ;;
2: len 4; hex 80000aa0; asc ;;
由5可以看出,事务2在等待Record lock, heap no 317,虽然日志里没有明确告知谁持有该锁,但这个锁应该恰好被事务1持有。InnoDB不会显示所有持有和等待的锁,但是,它显示了足够的信息来帮助我们定位问题。
*** (2) HOLDS THE LOCK(S):
Record lock, heap no 341 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
因此,可以得到如下结论:
- T1持有no317,向T2索要no341
- T2持有no341,向T1索要no317
sequenceDiagram
T1(no317)->>T2(no341): 给我no341的锁
T2(no341)->>T1(no317): 给我no317的锁
T1(no317)->>T2(no341): 你先给我,一会我再给你
T2(no341)->>T1(no317): 不,你先给我,我再给你
最终,证实了最初的猜想是正确的。
如何尽可能避免死锁
- 以固定的顺序访问表和行
- 大事务拆分成小事务。大事务更倾向于死锁,如果业务允许,将大事务拆小。
- 在同一个事务中,尽可能做到一次锁定所需要的所有资源,减少死锁概率。
- 降低隔离级别。如果业务允许,将隔离级别调低也是较好的选择,比如将隔离级别从RR调整为RC,可以避免掉很多因为gap锁造成的死锁。
- 为表添加合理的索引。如果不走索引,将会为表的每一行记录添加上锁,増加死锁的概率。
网友评论