Hadoop教程：流

作者: 逍遥ii | 来源:发表于2018-12-01 20:28 被阅读10次

Hadoop教程：流
大数据教程
hadoop日记2——安装分布式hadoop2.7.3
Hadoop安装教程_单机/伪分布式配置_Hadoop2.6.4
陆嘉恒《Hadoop实战》
[Cookbook]《Hadoop Mapreduce Cook
Hive 基础搭建教程
Hadoop-Streaming(流)
Storm和hadoop对比及storm组件
大数据开发必读书目（持续更新）

Hadoop流是Hadoop发行版附带的实用程序。这个实用程序允许您使用任何可执行文件或脚本作为mapper 和/或reducer创建和运行Map/Reduce作业。

Python例子

对于Hadoop流，我们正在考虑word-count 问题。Hadoop中的任何工作都必须有两个阶段:mapper和reducer。我们已经在python脚本中为mapper和reducer编写了在Hadoop下运行它的代码。也可以用Perl和Ruby编写相同的代码。

Mapper Phase Code

!/usr/bin/python
import sys
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Break the line into words words = myline.split() 
# Iterate the words list for myword in words: 
# Write the results to standard output print '%s\t%s' % (myword, 1)

确保该文件具有执行权限(chmod +x /home/ expert/hadoop-1.2.1/mapper.py)。

Reducer Phase Code

#!/usr/bin/python
from operator import itemgetter 
import sys 
current_word = ""
current_count = 0 
word = "" 
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Split the input we got from mapper.py word, count = myline.split('\t', 1) 
# Convert count variable to integer 
   try: 
      count = int(count) 
except ValueError: 
   # Count was not a number, so silently ignore this line continue
if current_word == word: 
   current_count += count 
else: 
   if current_word: 
      # Write result to standard output print '%s\t%s' % (current_word, current_count) 
   current_count = count
   current_word = word
# Do not forget to output the last word if needed! 
if current_word == word: 
   print '%s\t%s' % (current_word, current_count)

将mapper和reducer代码分别保存在Hadoop home目录下的mapper.py and reducer.py 文件。确保这些文件具有执行权限(chmod +x mapper.py)。和chmod +x reducer.py)。由于python对缩进敏感，所以可以从下面的链接下载相同的代码。

WordCount程序的执行

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.
2.1.jar \
   -input input_dirs \ 
   -output output_dir \ 
   -mapper <path/mapper.py \ 
   -reducer <path/reducer.py

其中“\”用于行延续，以确保清晰的可读性。

例如

./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -output myoutput -mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py

流是如何工作的

在上面的例子中，mapper和reducer都是python脚本，它们从标准输入读取输入并将输出输出到标准输出。该实用程序将创建Map/Reduce作业，将作业提交到适当的集群，并监视作业的进度，直到作业完成。

当为mappers指定脚本时，每个mapper任务将在初始化mapper时作为单独的进程启动脚本。当mapper任务运行时，它将其输入转换为行，并将这些行提供给流程的标准输入(STDIN)。同时，mapper从流程的标准输出(STDOUT)中收集面向行的输出，并将每一行转换为键/值对，作为mapper的输出进行收集。默认情况下，直到第一个制表符的行前缀是键，行其余部分(不包括制表符)是值。如果行中没有制表符，则将整行视为键，值为null。但是，这可以根据需要定制。

当为reducers指定脚本时，每个reducer任务将作为单独的进程启动脚本，然后初始化reducer。当reducer任务运行时，它将输入键/值对转换为行，并将这些行提供给流程的标准输入(STDIN)。同时，reducer从流程的标准输出(STDOUT)中收集面向行的输出，将每一行转换为键/值对，作为reducer的输出进行收集。默认情况下，直到第一个制表符的行前缀是键，行其余部分(不包括制表符)是值。但是，这可以根据特定的需求进行定制。

重要的命令

Parameters	Description
-input directory/file-name	Input location for mapper. (Required)
-output directory-name	Output location for reducer. (Required)
-mapper executable or script or JavaClassName	Mapper executable. (Required)
-reducer executable or script or JavaClassName	Reducer executable. (Required)
-file file-name	Makes the mapper, reducer, or combiner executable available locally on the compute nodes.
-inputformat JavaClassName	Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default.
-outputformat JavaClassName	Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default.
-partitioner JavaClassName	Class that determines which reduce a key is sent to.
-combiner streamingCommand or JavaClassName	Combiner executable for map output.
-cmdenv name=value	Passes the environment variable to streaming commands.
-inputreader	For backwards-compatibility: specifies a record reader class (instead of an input format class).
-verbose	Verbose output.
-lazyOutput	Creates output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write).
-numReduceTasks	Specifies the number of reducers.
-mapdebug	Script to call when map task fails.
-reducedebug	Script to call when reduce task fails.