1. 问题现状
各个挂载moosefs 的系统或站点,出现大量502错误,最后系统、网站完全瘫痪。
2. 分析
登陆到出现502错误的的系统服务器,服务器系统负载特别高,CPU,内存都正常,本地磁盘IO也正常,netstat命令查看Recv-Q队列积压特别高,此时可以看到是MFS分布式出现了问题。
到moosefs master 的日志,出现如下错误日志:
Oct 10 13:01:49 VM_200_107 mfsmaster[31003]: connection with client(ip:172.16.200.81) has been closed by peer
Oct 10 13:01:49 VM_200_107 mfsmaster[31003]: connection with client(ip:172.16.115.33) has been closed by peer
##表示客户端和master的连接中断
Oct 4 08:01:28 VM_200_107 mfsmaster[2843]: connection with CS(172.16.200.102) has been closed by peer
Oct 4 08:01:28 VM_200_107 mfsmaster[2843]: connection with CS(172.16.200.103) has been closed by peer
Oct 4 08:01:28 VM_200_107 mfsmaster[2843]: connection with CS(172.16.200.104) has been closed by peer
##表示ChunkServer和Master的连接中断
Oct 4 08:01:28 VM_200_107 mfsmaster[2843]: chunkserver disconnected - ip: 172.16.200.102, port: 9422, usedspace: 463146741760 (431.34 GiB), totalspace: 2054567444480 (1913.47 GiB)
Oct 4 08:01:30 VM_200_107 mfsmaster[2843]: chunkserver disconnected - ip: 172.16.200.103, port: 9422, usedspace: 459528495104 (427.97 GiB), totalspace: 2054567444480 (1913.47 GiB)
Oct 4 08:01:31 VM_200_107 mfsmaster[2843]: chunkserver disconnected - ip: 172.16.200.104, port: 9422, usedspace: 461537153024 (429.84 GiB), totalspace: 2054567444480 (1913.47 GiB)
##ChunkServer 中断连接
Oct 10 13:01:52 VM_200_107 mfsmaster[31003]: connection with ML(127.0.0.1) has been closed by peer
##表示Metalogger和Master的连接中断
Oct 10 13:01:52 VM_200_107 mfsmaster[31003]: chunkserver register begin (packet version: 5) - ip: 172.16.200.102, port: 9422
Oct 10 13:01:52 VM_200_107 mfsmetalogger[31700]: connection was reset by Master
Oct 10 13:01:53 VM_200_107 mfsmaster[31003]: chunkserver register begin (packet version: 5) - ip: 172.16.200.103, port: 9422
Oct 10 13:01:54 VM_200_107 mfsmaster[31003]: chunkserver register end (packet version: 5) - ip: 172.16.200.102, port: 9422, usedspace: 490665570304 (456.97 GiB), totalspace: 2054567444480
Oct 10 13:01:54 VM_200_107 mfsmaster[31003]: chunkserver register end (packet version: 5) - ip: 172.16.200.103, port: 9422, usedspace: 486941249536 (453.50 GiB), totalspace: 2054567444480
##chunkserver重新连接
每到整点的时候,master 会fork一个子进程把内存中的数据快照到磁盘,如果数据量小或者磁盘很快,是不会影响master的响应的。一旦数据比较大或者磁盘很忙时(并且master还有很多访问),写快照的进程会让磁盘变得繁忙,导致另一个master进程在写changelog 时被阻塞了。
3. 解决改进
改进办法是使用更好的磁盘(SSD)或者更多内存(使得新写的快照不必立即刷新到磁盘),临时解决把swap分区扩大
网友评论