1. 3 stuck requests are blocked > 4096 sec. Implicated osds 75
注意: 3 requests are blocked > 4096 sec 有可能是在数据迁移过程中, 用户正在对该数据块进行访问, 但访问还没有完成, 数据就迁移到别的 OSD 中, 那么就会导致有请求被 block, 对用户也是有影响的
处理方法:
# ceph health detail 找到osd.75的block
#ceph osd tree 找到osd.75对应的主机
#systemctl restart ceph-osd@75.service 重启对应的osd服务
等待ceph对osd 执行 recovery 操作结束后恢复正常
2. 诊断磁盘故障,将故障磁盘踢出ceph集群
ceph osd down
处理方法:
1.重启该节点的osd服务
systemctl restart ceph osd@ID.service
systemctl restart ceph osd.target,service
2.同步时间
service ntp restart
3.查看网络是否正常
4.查看节点osd对应的磁盘是否正常
ceph-volume lvm list 查看osd对应的磁盘盘符
lsblk
5.将故障磁盘踢出ceph集群
sudo ceph osd out <id> #停止故障osd
sudo ceph osd crush remove osd.<id> #清除osd配置
sudo ceph auth del osd.<id>
sudo ceph osd rm osd.<id>
6.更换新硬盘并添加到集群中去
cd /home/ubuntu/ceph
sudo ceph-deploy disk zap <IP> /dev/sdf
sudo ceph-deploy osd create --bluestore <IP> --data /dev/sdf
3.ceph客户端请求响应延迟
#ceph health detail 报错信息如下:
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
mds<hostname>(mds.0): Client <hostname> failing to respond to capability release client_id: 5374472
MDS_SLOW_REQUEST 1 MDSs report slow requests
mds<hostname>(mds.0): 2 slow requests are blocked > 30 sec
处理方法:
清除次 ID 即可:https://blog.csdn.net/zuoyang1990/article/details/98530070
$ ceph daemon mds.<hostname> session ls|grep 284951
$ ceph tell mds.<hostname> session evict id=284951
如果报错如下:
$ ceph tell mds.<hostname>session evict id=284951
2020-08-13 10:45:03.869 7f271b7fe700 0 client.306366 ms_handle_reset on 10.100.21.95:6800/1646216103
2020-08-13 10:45:03.881 7f2730ff9700 0 client.316415 ms_handle_reset on 10.100.21.95:6800/1646216103
Error EAGAIN: MDS is replaying log
需要到 mds.0 节点执行,否则无法找到次 client。
转移走该节点的任务,重启该节点,挂载共享盘,开启任务接受
4. 1 MDSs report slow requests
一般是过段时间会自动恢复正常,若长时间不恢复i,处理方法如下:
重启 mon 即可解决:$ systemctl restart ceph-mon.target
如果无法解决需要重启 mds 解决: $ systemctl restart ceph-mds@${HOSTNAME}
5. 1 full osd(s) 2 nearfull osd(s)
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#no-free-drive-space
$ ceph osd dump | grep full_ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
处理方法:
$ ceph osd reweight 4 0.85 手动调整osd权重
$ ceph osd reweight-by-utilization 110 0.3 10 自动调整
$ ceph osd crush reweight osd.11 0.5 #调整WEIGHT
给cephfs扩容或着清除不需要的数据
6. clock skew detected on mon.[ hostname ]
8 osds down
#检查网络是否异常:
ping $hostname
9 packets transmitted, 4 received, 55% packet loss, time 7998ms #发现又掉包现象
#处理好网络问题,重启异常节点的 osd
$ sudo systemctl restart ceph-osd.target.service
$ sudo systemctl restart ceph-mon.service
$sudo systemctl restart ntp #同步节点时间
网友评论