Hadoop教程:流

作者: 逍遥ii | 来源:发表于2018-12-01 20:28 被阅读10次

Hadoop流是Hadoop发行版附带的实用程序。这个实用程序允许您使用任何可执行文件或脚本作为mapper 和/或reducer创建和运行Map/Reduce作业。

Python例子

对于Hadoop流,我们正在考虑word-count 问题。Hadoop中的任何工作都必须有两个阶段:mapper和reducer。我们已经在python脚本中为mapper和reducer编写了在Hadoop下运行它的代码。也可以用Perl和Ruby编写相同的代码。

Mapper Phase Code

!/usr/bin/python
import sys
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Break the line into words words = myline.split() 
# Iterate the words list for myword in words: 
# Write the results to standard output print '%s\t%s' % (myword, 1)

确保该文件具有执行权限(chmod +x /home/ expert/hadoop-1.2.1/mapper.py)。

Reducer Phase Code

#!/usr/bin/python
from operator import itemgetter 
import sys 
current_word = ""
current_count = 0 
word = "" 
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Split the input we got from mapper.py word, count = myline.split('\t', 1) 
# Convert count variable to integer 
   try: 
      count = int(count) 
except ValueError: 
   # Count was not a number, so silently ignore this line continue
if current_word == word: 
   current_count += count 
else: 
   if current_word: 
      # Write result to standard output print '%s\t%s' % (current_word, current_count) 
   current_count = count
   current_word = word
# Do not forget to output the last word if needed! 
if current_word == word: 
   print '%s\t%s' % (current_word, current_count)

将mapper和reducer代码分别保存在Hadoop home目录下的mapper.py and reducer.py 文件。确保这些文件具有执行权限(chmod +x mapper.py)。和chmod +x reducer.py)。由于python对缩进敏感,所以可以从下面的链接下载相同的代码。

WordCount程序的执行

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.
2.1.jar \
   -input input_dirs \ 
   -output output_dir \ 
   -mapper <path/mapper.py \ 
   -reducer <path/reducer.py

其中“\”用于行延续,以确保清晰的可读性。

例如

./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -output myoutput -mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py

流是如何工作的

在上面的例子中,mapper和reducer都是python脚本,它们从标准输入读取输入并将输出输出到标准输出。该实用程序将创建Map/Reduce作业,将作业提交到适当的集群,并监视作业的进度,直到作业完成。

当为mappers指定脚本时,每个mapper任务将在初始化mapper时作为单独的进程启动脚本。当mapper任务运行时,它将其输入转换为行,并将这些行提供给流程的标准输入(STDIN)。同时,mapper从流程的标准输出(STDOUT)中收集面向行的输出,并将每一行转换为键/值对,作为mapper的输出进行收集。默认情况下,直到第一个制表符的行前缀是键,行其余部分(不包括制表符)是值。如果行中没有制表符,则将整行视为键,值为null。但是,这可以根据需要定制。

当为reducers指定脚本时,每个reducer任务将作为单独的进程启动脚本,然后初始化reducer。当reducer任务运行时,它将输入键/值对转换为行,并将这些行提供给流程的标准输入(STDIN)。同时,reducer从流程的标准输出(STDOUT)中收集面向行的输出,将每一行转换为键/值对,作为reducer的输出进行收集。默认情况下,直到第一个制表符的行前缀是键,行其余部分(不包括制表符)是值。但是,这可以根据特定的需求进行定制。

重要的命令

Parameters Description
-input directory/file-name Input location for mapper. (Required)
-output directory-name Output location for reducer. (Required)
-mapper executable or script or JavaClassName Mapper executable. (Required)
-reducer executable or script or JavaClassName Reducer executable. (Required)
-file file-name Makes the mapper, reducer, or combiner executable available locally on the compute nodes.
-inputformat JavaClassName Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default.
-outputformat JavaClassName Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default.
-partitioner JavaClassName Class that determines which reduce a key is sent to.
-combiner streamingCommand or JavaClassName Combiner executable for map output.
-cmdenv name=value Passes the environment variable to streaming commands.
-inputreader For backwards-compatibility: specifies a record reader class (instead of an input format class).
-verbose Verbose output.
-lazyOutput Creates output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write).
-numReduceTasks Specifies the number of reducers.
-mapdebug Script to call when map task fails.
-reducedebug Script to call when reduce task fails.

原文链接: https://www.tutorialspoint.com/...

相关文章

网友评论

    本文标题:Hadoop教程:流

    本文链接:https://www.haomeiwen.com/subject/pigrcqtx.html