美文网首页Ceph好文收藏
解决Ceph ERROR: unable to open OSD

解决Ceph ERROR: unable to open OSD

作者: 咏箢 | 来源:发表于2018-12-03 15:28 被阅读93次

    问题描述

    单节点的ceph环境,周末机房异常断电后,周一来发现环境大量pg不健康(pg down、stale等)、大量osd down,且手动启osd失败,查看osd日志中大量形如:ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-7: (2) No such file or directory的报错

    Ceph版本

    [root@hyhive /]# ceph version
    ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
    [root@hyhive /]# 
    

    初步环境问题定位

    1.查看osd的状态

    [root@hyhive osd]# ceph osd tree
    ID WEIGHT   TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY 
    -5 18.56999 root hdd_root                                       
    -6 18.56999     host hdd_host                                   
     1  1.85699         osd.1        down        0          1.00000 
     2  1.85699         osd.2        down        0          1.00000 
     3  1.85699         osd.3          up  1.00000          1.00000 
     4  1.85699         osd.4        down        0          1.00000 
     5  1.85699         osd.5          up  1.00000          1.00000 
     6  1.85699         osd.6          up  1.00000          1.00000 
     7  1.85699         osd.7        down        0          1.00000 
     8  1.85699         osd.8        down        0          1.00000 
     9  1.85699         osd.9        down        0          1.00000 
    10  1.85699         osd.10       down        0          1.00000 
    -3  0.88399 root ssd_root                                       
    -4  0.88399     host ssd_host                                   
     0  0.44199         osd.0        down        0          1.00000 
    11  0.44199         osd.11       down  1.00000          1.00000 
    -1 19.45399 root default                                        
    -2 19.45399     host hyhive                                     
     0  0.44199         osd.0        down        0          1.00000 
     1  1.85699         osd.1        down        0          1.00000 
     2  1.85699         osd.2        down        0          1.00000 
     3  1.85699         osd.3          up  1.00000          1.00000 
     4  1.85699         osd.4        down        0          1.00000 
     5  1.85699         osd.5          up  1.00000          1.00000 
     6  1.85699         osd.6          up  1.00000          1.00000 
     7  1.85699         osd.7        down        0          1.00000 
     8  1.85699         osd.8        down        0          1.00000 
     9  1.85699         osd.9        down        0          1.00000 
    10  1.85699         osd.10       down        0          1.00000 
    11  0.44199         osd.11       down  1.00000          1.00000 
    [root@hyhive osd]# 
    [root@hyhive osd]# 
    
    

    2.查看各磁盘的挂载状态

    [root@hyhive osd]# lsblk
    NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
    sda               8:0    0 138.8G  0 disk 
    ├─sda1            8:1    0     1M  0 part 
    ├─sda2            8:2    0   475M  0 part /boot
    └─sda3            8:3    0 131.9G  0 part 
      ├─centos-root 253:0    0  99.9G  0 lvm  /
      └─centos-swap 253:1    0    32G  0 lvm  [SWAP]
    sdb               8:16   0 447.1G  0 disk 
    ├─sdb1            8:17   0 442.1G  0 part 
    └─sdb2            8:18   0     5G  0 part 
    sdc               8:32   0 447.1G  0 disk 
    ├─sdc1            8:33   0 442.1G  0 part 
    └─sdc2            8:34   0     5G  0 part 
    sdd               8:48   0   1.8T  0 disk 
    ├─sdd1            8:49   0   1.8T  0 part 
    └─sdd2            8:50   0     5G  0 part 
    sde               8:64   0   1.8T  0 disk 
    ├─sde1            8:65   0   1.8T  0 part 
    └─sde2            8:66   0     5G  0 part 
    sdf               8:80   0   1.8T  0 disk 
    ├─sdf1            8:81   0   1.8T  0 part /var/lib/ceph/osd/ceph-3
    └─sdf2            8:82   0     5G  0 part 
    sdg               8:96   0   1.8T  0 disk 
    ├─sdg1            8:97   0   1.8T  0 part 
    └─sdg2            8:98   0     5G  0 part 
    sdh               8:112  0   1.8T  0 disk 
    ├─sdh1            8:113  0   1.8T  0 part /var/lib/ceph/osd/ceph-5
    └─sdh2            8:114  0     5G  0 part 
    sdi               8:128  0   1.8T  0 disk 
    ├─sdi1            8:129  0   1.8T  0 part /var/lib/ceph/osd/ceph-6
    └─sdi2            8:130  0     5G  0 part 
    sdj               8:144  0   1.8T  0 disk 
    ├─sdj1            8:145  0   1.8T  0 part 
    └─sdj2            8:146  0     5G  0 part 
    sdk               8:160  0   1.8T  0 disk 
    ├─sdk1            8:161  0   1.8T  0 part 
    └─sdk2            8:162  0     5G  0 part 
    sdl               8:176  0   1.8T  0 disk 
    ├─sdl1            8:177  0   1.8T  0 part 
    └─sdl2            8:178  0     5G  0 part 
    sdm               8:192  0   1.8T  0 disk 
    ├─sdm1            8:193  0   1.8T  0 part 
    └─sdm2            8:194  0     5G  0 part 
    [root@hyhive osd]# 
    

    从磁盘挂载情况来看,发现有很多盘没有挂载,怀疑与此相关,待一步确认

    3.查看ceph-osd日志
    手动尝试启osd.7的服务失败后,选择查看其日志,日志报错如下:

    [root@hyhive osd]# vim /var/log/ceph/ceph-osd.7.log
    
    2018-12-03 13:05:49.951385 7f060066f800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-7: (2) No such file or directory^[[0m
    2018-12-03 13:05:50.159180 7f6ddf984800  0 set uid:gid to 167:167 (ceph:ceph)
    2018-12-03 13:05:50.159202 7f6ddf984800  0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 27941
    2018-12-03 13:05:50.159488 7f6ddf984800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-7: (2) No such file or directory^[[0m
    2018-12-03 13:10:52.974345 7f1405dab800  0 set uid:gid to 167:167 (ceph:ceph)
    2018-12-03 13:10:52.974368 7f1405dab800  0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 34223
    2018-12-03 13:10:52.974634 7f1405dab800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-7: (2) No such file or directory^[[0m
    2018-12-03 13:10:53.123099 7f0f7af13800  0 set uid:gid to 167:167 (ceph:ceph)
    2018-12-03 13:10:53.123120 7f0f7af13800  0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 34295
    2018-12-03 13:10:53.123365 7f0f7af13800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-7: (2) No such file or directory^[[0m
    2018-12-03 13:10:53.275191 7f4a49579800  0 set uid:gid to 167:167 (ceph:ceph)
    2018-12-03 13:10:53.275212 7f4a49579800  0 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 34356
    2018-12-03 13:10:53.275464 7f4a49579800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-7: (2) No such file or directory^[[0m
    ~                                     
    

    cat /var/lib/ceph/osd/ceph-7目录发现其目录下为空

    处理步骤

    • 方法一

    思路:
    1.将一块盘挂载在临时目录下
    2.查看其osd对应编号信息
    3卸载临时挂载
    4将盘挂载在对应的/var/lib/ceph/osd/下
    5重启osd服务
    命令形如:
    mount /dev/{盘符} /mnt
    cat whoami
    umount /mnt
    mount -t xfs /dev/{盘符} /var/lib/ceph/osd/ceph-{osd_id}
    systemctl restart ceph-osd@{osd_id}


    具体的一次完整实践如下图:

    [root@hyhive osd]# mount /dev/sdm1 /mnt
    [root@hyhive osd]# 
    [root@hyhive osd]# cd /mnt
    [root@hyhive mnt]# ls
    activate.monmap  ceph_fsid  fsid     journal_uuid  magic  store_version  systemd  whoami
    active           current    journal  keyring       ready  superblock     type
    [root@hyhive mnt]# 
    [root@hyhive mnt]# cat whoami
    10
    [root@hyhive mnt]# 
    [root@hyhive mnt]# cd ../
    [root@hyhive /]# umount /mnt
    [root@hyhive /]# 
    [root@hyhive /]# 
    //注意:-o后面的参数是根据我们的环境要求调整的参数,可以不加
    [root@hyhive /]# mount -t xfs -o noatime,logbsize=128k /dev/sdm1 /var/lib/ceph/osd/ceph-10/   
    [root@hyhive /]# 
    [root@hyhive /]# lsblk
    NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
    sda               8:0    0 138.8G  0 disk 
    ├─sda1            8:1    0     1M  0 part 
    ├─sda2            8:2    0   475M  0 part /boot
    └─sda3            8:3    0 131.9G  0 part 
      ├─centos-root 253:0    0  99.9G  0 lvm  /
      └─centos-swap 253:1    0    32G  0 lvm  [SWAP]
    sdb               8:16   0 447.1G  0 disk 
    ├─sdb1            8:17   0 442.1G  0 part 
    └─sdb2            8:18   0     5G  0 part 
    sdc               8:32   0 447.1G  0 disk 
    ├─sdc1            8:33   0 442.1G  0 part 
    └─sdc2            8:34   0     5G  0 part 
    sdd               8:48   0   1.8T  0 disk 
    ├─sdd1            8:49   0   1.8T  0 part 
    └─sdd2            8:50   0     5G  0 part 
    sde               8:64   0   1.8T  0 disk 
    ├─sde1            8:65   0   1.8T  0 part 
    └─sde2            8:66   0     5G  0 part 
    sdf               8:80   0   1.8T  0 disk 
    ├─sdf1            8:81   0   1.8T  0 part /var/lib/ceph/osd/ceph-3
    └─sdf2            8:82   0     5G  0 part 
    sdg               8:96   0   1.8T  0 disk 
    ├─sdg1            8:97   0   1.8T  0 part 
    └─sdg2            8:98   0     5G  0 part 
    sdh               8:112  0   1.8T  0 disk 
    ├─sdh1            8:113  0   1.8T  0 part /var/lib/ceph/osd/ceph-5
    └─sdh2            8:114  0     5G  0 part 
    sdi               8:128  0   1.8T  0 disk 
    ├─sdi1            8:129  0   1.8T  0 part /var/lib/ceph/osd/ceph-6
    └─sdi2            8:130  0     5G  0 part 
    sdj               8:144  0   1.8T  0 disk 
    ├─sdj1            8:145  0   1.8T  0 part 
    └─sdj2            8:146  0     5G  0 part 
    sdk               8:160  0   1.8T  0 disk 
    ├─sdk1            8:161  0   1.8T  0 part 
    └─sdk2            8:162  0     5G  0 part 
    sdl               8:176  0   1.8T  0 disk 
    ├─sdl1            8:177  0   1.8T  0 part 
    └─sdl2            8:178  0     5G  0 part 
    sdm               8:192  0   1.8T  0 disk 
    ├─sdm1            8:193  0   1.8T  0 part /var/lib/ceph/osd/ceph-10
    └─sdm2            8:194  0     5G  0 part 
    [root@hyhive /]# 
    
    [root@hyhive /]# systemctl restart ceph-osd@10
    Job for ceph-osd@10.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@10.service" and "journalctl -xe" for details.
    To force a start use "systemctl reset-failed ceph-osd@10.service" followed by "systemctl start ceph-osd@10.service" again.
    [root@hyhive /]# 
    [root@hyhive /]# systemctl reset-failed ceph-osd@10
    [root@hyhive /]# 
    [root@hyhive /]# systemctl restart ceph-osd@10
    [root@hyhive /]# 
    [root@hyhive /]# systemctl status ceph-osd@10
    ● ceph-osd@10.service - Ceph object storage daemon
       Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)
       Active: active (running) since Mon 2018-12-03 13:15:27 CST; 6s ago
      Process: 39672 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
     Main PID: 39680 (ceph-osd)
       CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@10.service
               └─39680 /usr/bin/ceph-osd -f --cluster ceph --id 10 --setuser ceph --setgroup ceph
    
    Dec 03 13:15:27 hyhive systemd[1]: Starting Ceph object storage daemon...
    Dec 03 13:15:27 hyhive systemd[1]: Started Ceph object storage daemon.
    Dec 03 13:15:27 hyhive ceph-osd[39680]: starting osd.10 at :/0 osd_data /var/lib/ceph/osd/ceph-10 /var/lib/ceph/osd/ceph-10/journal
    [root@hyhive /]# 
    
    
    • 方法二:

    在处理一块盘后,发现方法一有点麻烦,反复的挂载、卸载,其实就是为了查看盘和osd的对应信息。那么有没有命令可以查看所有的盘和osd的对应信息呢?答案:ceph-disk list
    思路:
    1查看所有盘与其osd的对应信息
    2根据盘与osd的对应信息,挂载盘符
    3.重启osd服务
    命令如下:
    ceph-disk list
    mount -t xfs /dev/{盘符} /var/lib/ceph/osd/ceph-{osd_id}
    systemctl restart ceph-osd@{osd_id}


    一次完整操作如下:

    [root@hyhive /]# ceph-disk list
    /dev/dm-0 other, xfs, mounted on /
    /dev/dm-1 swap, swap
    /dev/sda :
     /dev/sda1 other, 21686148-6449-6e6f-744e-656564454649
     /dev/sda3 other, LVM2_member
     /dev/sda2 other, xfs, mounted on /boot
    /dev/sdb :
     /dev/sdb2 ceph journal, for /dev/sdb1
     /dev/sdb1 ceph data, prepared, cluster ceph, osd.11, journal /dev/sdb2
    /dev/sdc :
     /dev/sdc2 ceph journal, for /dev/sdc1
     /dev/sdc1 ceph data, prepared, cluster ceph, osd.0, journal /dev/sdc2
    /dev/sdd :
     /dev/sdd2 ceph journal, for /dev/sdd1
     /dev/sdd1 ceph data, prepared, cluster ceph, osd.1, journal /dev/sdd2
    /dev/sde :
     /dev/sde2 ceph journal, for /dev/sde1
     /dev/sde1 ceph data, prepared, cluster ceph, osd.2, journal /dev/sde2
    /dev/sdf :
     /dev/sdf2 ceph journal, for /dev/sdf1
     /dev/sdf1 ceph data, active, cluster ceph, osd.3, journal /dev/sdf2
    /dev/sdg :
     /dev/sdg2 ceph journal, for /dev/sdg1
     /dev/sdg1 ceph data, prepared, cluster ceph, osd.4, journal /dev/sdg2
    /dev/sdh :
     /dev/sdh2 ceph journal, for /dev/sdh1
     /dev/sdh1 ceph data, active, cluster ceph, osd.5, journal /dev/sdh2
    /dev/sdi :
     /dev/sdi2 ceph journal, for /dev/sdi1
     /dev/sdi1 ceph data, active, cluster ceph, osd.6, journal /dev/sdi2
    /dev/sdj :
     /dev/sdj2 ceph journal, for /dev/sdj1
     /dev/sdj1 ceph data, prepared, cluster ceph, osd.7, journal /dev/sdj2
    /dev/sdk :
     /dev/sdk2 ceph journal, for /dev/sdk1
     /dev/sdk1 ceph data, prepared, cluster ceph, osd.8, journal /dev/sdk2
    /dev/sdl :
     /dev/sdl2 ceph journal, for /dev/sdl1
     /dev/sdl1 ceph data, active, cluster ceph, osd.9, journal /dev/sdl2
    /dev/sdm :
     /dev/sdm2 ceph journal, for /dev/sdm1
     /dev/sdm1 ceph data, active, cluster ceph, osd.10, journal /dev/sdm2
    [root@hyhive /]# 
    
    [root@hyhive /]# mount -t xfs -o noatime,logbsize=128k /dev/sdk1 /var/lib/ceph/osd/ceph-8/
    [root@hyhive /]# systemctl reset-failed ceph-osd@8
    [root@hyhive /]# systemctl restart ceph-osd@8
    [root@hyhive /]# systemctl status ceph-osd@8
    ● ceph-osd@8.service - Ceph object storage daemon
       Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)
       Active: active (running) since Mon 2018-12-03 13:21:23 CST; 28s ago
      Process: 48154 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
     Main PID: 48161 (ceph-osd)
       CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@8.service
               └─48161 /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph
    
    Dec 03 13:21:23 hyhive systemd[1]: Starting Ceph object storage daemon...
    Dec 03 13:21:23 hyhive systemd[1]: Started Ceph object storage daemon.
    Dec 03 13:21:23 hyhive ceph-osd[48161]: starting osd.8 at :/0 osd_data /var/lib/ceph/osd/ceph-8 /var/lib/ceph/osd/ceph-8/journal
    Dec 03 13:21:49 hyhive ceph-osd[48161]: 2018-12-03 13:21:49.959213 7f9eec9e6800 -1 osd.8 129653 log_to_monitors {default=true}
    [root@hyhive /]# 
    
    

    处理完成后环境状态

    [root@hyhive /]# 
    [root@hyhive /]# ceph osd tree
    ID WEIGHT   TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY 
    -5 18.56999 root hdd_root                                       
    -6 18.56999     host hdd_host                                   
     1  1.85699         osd.1          up  1.00000          1.00000 
     2  1.85699         osd.2          up  1.00000          1.00000 
     3  1.85699         osd.3          up  1.00000          1.00000 
     4  1.85699         osd.4          up  1.00000          1.00000 
     5  1.85699         osd.5          up  1.00000          1.00000 
     6  1.85699         osd.6          up  1.00000          1.00000 
     7  1.85699         osd.7          up  1.00000          1.00000 
     8  1.85699         osd.8          up  1.00000          1.00000 
     9  1.85699         osd.9          up  1.00000          1.00000 
    10  1.85699         osd.10         up  1.00000          1.00000 
    -3  0.88399 root ssd_root                                       
    -4  0.88399     host ssd_host                                   
     0  0.44199         osd.0          up  1.00000          1.00000 
    11  0.44199         osd.11         up  1.00000          1.00000 
    -1 19.45399 root default                                        
    -2 19.45399     host hyhive                                     
     0  0.44199         osd.0          up  1.00000          1.00000 
     1  1.85699         osd.1          up  1.00000          1.00000 
     2  1.85699         osd.2          up  1.00000          1.00000 
     3  1.85699         osd.3          up  1.00000          1.00000 
     4  1.85699         osd.4          up  1.00000          1.00000 
     5  1.85699         osd.5          up  1.00000          1.00000 
     6  1.85699         osd.6          up  1.00000          1.00000 
     7  1.85699         osd.7          up  1.00000          1.00000 
     8  1.85699         osd.8          up  1.00000          1.00000 
     9  1.85699         osd.9          up  1.00000          1.00000 
    10  1.85699         osd.10         up  1.00000          1.00000 
    11  0.44199         osd.11         up  1.00000          1.00000 
    [root@hyhive /]# 
    [root@hyhive /]# ceph -s
        cluster 0eef9474-08c7-445e-98e9-35120d03bf19
         health HEALTH_WARN
                too many PGs per OSD (381 > max 300)
         monmap e1: 1 mons at {hyhive=192.168.3.1:6789/0}
                election epoch 23, quorum 0 hyhive
          fsmap e90: 1/1/1 up {0=hyhive=up:active}
         osdmap e129886: 12 osds: 12 up, 12 in
                flags sortbitwise,require_jewel_osds
          pgmap v39338084: 2288 pgs, 12 pools, 1874 GB data, 485 kobjects
                3654 GB used, 15800 GB / 19454 GB avail
                    2288 active+clean
      client io 4357 kB/s rd, 357 kB/s wr, 265 op/s rd, 18 op/s wr
    [root@hyhive /]# 
    
    

    相关文章

      网友评论

        本文标题:解决Ceph ERROR: unable to open OSD

        本文链接:https://www.haomeiwen.com/subject/tqrycqtx.html