美文网首页
Hadoop input split size vs block

Hadoop input split size vs block

作者: SeanC52111 | 来源:发表于2017-10-09 19:27 被阅读0次

    The answer by @user1668782 is a great explanation for the question and I'll try to give a graphical depiction of it.
    Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)

    enter image description here
    If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.

    enter image description here
    Block 1 contains the entire first record and a 28MB chunk of the second record.
    If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.
    This is the exact problem that input splits solve. Input splits respects logical record boundaries.

    Lets Assume the input split size is 200MB

    enter image description here
    Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.

    This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.

    Hope this helps.

    相关文章

      网友评论

          本文标题:Hadoop input split size vs block

          本文链接:https://www.haomeiwen.com/subject/hmgpyxtx.html