美文网首页Work, Work~
关于HDFS文件块丢失/损坏的相关问题

关于HDFS文件块丢失/损坏的相关问题

作者: 海边的贝壳林 | 来源:发表于2018-05-10 15:45 被阅读0次

打开Ambari看到hdfs报警[alert]: Total Blocks:[*], Missing Blocks:[*], 发现是有些文件块损坏了. 启动hdfs的时候发现也起不来了, 日志一直循环下面的东西.

Retrying after 10 seconds. Reason: Execution of '/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs dfsadmin -fs hdfs://test01.bigdata.hbh:8020 -safemode get | grep 'Safe mode is OFF'' returned 1.

NameNode一直处于安全模式

[root@test01 ~]# sudo -u hdfs hdfs dfsadmin -fs hdfs://test01.bigdata.hbh:8020 -safemode get
Safe mode is ON

打开NameNode UI可以看到如下的描述:

Safe mode is ON. The reported blocks 4156 needs additional 2 blocks to reach the threshold 1.0000 of total blocks 4157. The number of live datanodes 4 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.

说明我们的损坏的文件比例超过了阈值, 这个阈值配置在hdfs中, 下图是从Ambari的配置管理, 这里配置的是100%, 也就是说不允许任何一个块损坏掉. 如果我们配置成99%应该就不会触发safemode了.


image.png

问题描述: 测试集群上的硬盘容量很小, 只有几十G, 之前做基准测试的时候就把磁盘写满了, 导致数据块丢失, 系统启动都是有问题的, 一直说hdfs在safe mode.

基础

什么是safe mode

怎么样触发safe mode

丢了一部分副本的数据

检查

[hdfs@test01 ~]$ hadoop fsck /user/root/.staging/job_1515575016190_0003/job.jar -files -blocks -locations -racks
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Connecting to namenode via http://test01.bigdata.hbh:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&racks=1&path=%2Fuser%2Froot%2F.staging%2Fjob_1515575016190_0003%2Fjob.jar
FSCK started by hdfs (auth:SIMPLE) from /172.16.201.200 for path /user/root/.staging/job_1515575016190_0003/job.jar at Fri Jan 26 16:11:15 CST 2018
/user/root/.staging/job_1515575016190_0003/job.jar 272019 bytes, 1 block(s):  Under replicated BP-1912246748-192.168.89.173-1513143837848:blk_1073751971_11222. Target Replicas is 10 but found 4 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
0. BP-1912246748-192.168.89.173-1513143837848:blk_1073751971_11222 len=272019 repl=4 [/default-rack/172.16.201.200:50010, /default-rack/172.16.201.201:50010, /default-rack/172.16.201.202:50010, /default-rack/172.16.201.204:50010]

Status: HEALTHY
 Total size:    272019 B
 Total dirs:    0
 Total files:   1
 Total symlinks:        0
 Total blocks (validated):  1 (avg. block size 272019 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   1 (100.0 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    3
 Average block replication: 4.0
 Corrupt blocks:        0
 Missing replicas:      6 (60.0 %)
 Number of data-nodes:      4
 Number of racks:       1
FSCK ended at Fri Jan 26 16:11:15 CST 2018 in 0 milliseconds


The filesystem under path '/user/root/.staging/job_1515575016190_0003/job.jar' is HEALTHY
[hdfs@test01 ~]$

脏数据

[hdfs@test01 ~]$ hadoop fsck /apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793 -files -blocks -locations -racks
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Connecting to namenode via http://test01.bigdata.hbh:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&racks=1&path=%2Fapps%2Fhbase%2Fdata%2FoldWALs%2Ftest01.bigdata.hbh%252C16020%252C1515637923065.default.1515745933793
FSCK started by hdfs (auth:SIMPLE) from /172.16.201.200 for path /apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793 at Fri Jan 26 16:12:22 CST 2018
/apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793 91 bytes, 1 block(s):
/apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793: CORRUPT blockpool BP-1912246748-192.168.89.173-1513143837848 block blk_1073753448
 MISSING 1 blocks of total size 91 B
0. BP-1912246748-192.168.89.173-1513143837848:blk_1073753448_12711 len=91 MISSING!

Status: CORRUPT
 Total size:    91 B
 Total dirs:    0
 Total files:   1
 Total symlinks:        0
 Total blocks (validated):  1 (avg. block size 91 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:  1 (100.0 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:    1
  MISSING BLOCKS:   1
  MISSING SIZE:     91 B
  CORRUPT BLOCKS:   1
  ********************************
 Minimally replicated blocks:   0 (0.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   0 (0.0 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    3
 Average block replication: 0.0
 Corrupt blocks:        1
 Missing replicas:      0
 Number of data-nodes:      4
 Number of racks:       1
FSCK ended at Fri Jan 26 16:12:22 CST 2018 in 1 milliseconds


The filesystem under path '/apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793' is CORRUPT

处理问题

查到具体哪个DataNode的哪些文件是丢失/损坏了的

[root@test01 ~]# sudo -u hdfs hdfs fsck /apps/hbase/data/oldWALs/ | egrep -v '^\.+$' | egrep -v '^$'
Connecting to namenode via http://test01.bigdata.hbh:50070/fsck?ugi=hdfs&path=%2Fapps%2Fhbase%2Fdata%2FoldWALs
FSCK started by hdfs (auth:SIMPLE) from /172.16.201.200 for path /apps/hbase/data/oldWALs at Thu Feb 08 09:57:58 CST 2018
/apps/hbase/data/oldWALs/test02.bigdata.hbh%2C16020%2C1515637922143..meta.1515745955950.meta: CORRUPT blockpool BP-1912246748-192.168.89.173-1513143837848 block blk_1073753450
/apps/hbase/data/oldWALs/test02.bigdata.hbh%2C16020%2C1515637922143..meta.1515745955950.meta: MISSING 1 blocks of total size 91 B..
/apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606: CORRUPT blockpool BP-1912246748-192.168.89.173-1513143837848 block blk_1073753446
/apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606: MISSING 1 blocks of total size 91 B.Status: CORRUPT
 Total size:    182 B
 Total dirs:    1
 Total files:   2
 Total symlinks:        0
 Total blocks (validated):  2 (avg. block size 91 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:  2 (100.0 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:    2
  MISSING BLOCKS:   2
  MISSING SIZE:     182 B
  CORRUPT BLOCKS:   2
  ********************************
 Minimally replicated blocks:   0 (0.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   0 (0.0 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    3
 Average block replication: 0.0
 Corrupt blocks:        2
 Missing replicas:      0
 Number of data-nodes:      4
 Number of racks:       1
FSCK ended at Thu Feb 08 09:57:58 CST 2018 in 1 milliseconds
The filesystem under path '/apps/hbase/data/oldWALs' is CORRUPT
[root@test01 ~]# sudo -u hdfs hadoop fs -rm /apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606
18/02/08 09:58:17 INFO fs.TrashPolicyDefault: Moved: 'hdfs://test01.bigdata.hbh:8020/apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606' to trash at: hdfs://test01.bigdata.hbh:8020/user/hdfs/.Trash/Current/apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606
[root@test01 ~]# sudo -u hdfs hdfs fsck /apps/hbase/data/oldWALs/ | egrep -v '^\.+$' | egrep -v '^$'
Connecting to namenode via http://test01.bigdata.hbh:50070/fsck?ugi=hdfs&path=%2Fapps%2Fhbase%2Fdata%2FoldWALs
FSCK started by hdfs (auth:SIMPLE) from /172.16.201.200 for path /apps/hbase/data/oldWALs at Thu Feb 08 09:58:24 CST 2018
/apps/hbase/data/oldWALs/test02.bigdata.hbh%2C16020%2C1515637922143..meta.1515745955950.meta: CORRUPT blockpool BP-1912246748-192.168.89.173-1513143837848 block blk_1073753450
/apps/hbase/data/oldWALs/test02.bigdata.hbh%2C16020%2C1515637922143..meta.1515745955950.meta: MISSING 1 blocks of total size 91 B.Status: CORRUPT
 Total size:    91 B
 Total dirs:    1
 Total files:   1
 Total symlinks:        0
 Total blocks (validated):  1 (avg. block size 91 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:  1 (100.0 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:    1
  MISSING BLOCKS:   1
  MISSING SIZE:     91 B
  CORRUPT BLOCKS:   1
  ********************************
 Minimally replicated blocks:   0 (0.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   0 (0.0 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    3
 Average block replication: 0.0
 Corrupt blocks:        1
 Missing replicas:      0
 Number of data-nodes:      4
 Number of racks:       1
FSCK ended at Thu Feb 08 09:58:24 CST 2018 in 1 milliseconds
The filesystem under path '/apps/hbase/data/oldWALs' is CORRUPT

相关文章

网友评论

    本文标题:关于HDFS文件块丢失/损坏的相关问题

    本文链接:https://www.haomeiwen.com/subject/jaazzxtx.html