ChainMapper/ChainReducer实现原理及案例分

作者: yanzhelee | 来源:发表于2017-08-20 21:29 被阅读20次

ChainMapper/ChainReducer实现原理及案例分
VPN技术专题系列目录
VPN原理及实现 6：TCP封装的隧道对于拥塞控制的意义
Linux VPN技术概论 2
Linux VPN技术概论 1
VPN原理及实现 3：隧道的一种实现
VPN原理及实现 5：TCP还是UDP
VPN原理及实现 4：虚拟网卡构建VPN
VPN原理及实现 1：VPN概念及要点
VPN原理及实现 2：一般理论

ChainMapper/ChainReducer实现原理及案例分析

ChainMapper/ChainReducer的实现原理

ChainMapper/ChainReducer主要为了解决线性链式Mapper而提出的。也就是说，在Map或者Reduce阶段存在多个Mapper，这些Mapper像linux管道一样，前一个Mapper的输出结果直接重定向到下一个Mapper的输入，形成一个流水线，形式类似于[MAP + REDUCE MAP*]。下图展示了一个典型的ChainMapper/ChainReducer的应用场景。
在Map阶段，数据依次经过Mapper1和Mapper2处理；在Reducer阶段，数据经过shuffle和sort排序后，交给对应的Reduce处理，但Reducer处理之后还可以交给其它的Mapper进行处理，最终产生的结果写入到hdfs输出目录上。

注意：对于任意一个MapReduce作业，Map和Reduce阶段可以有无限多个Mapper，但是Reducer只能有一个。

通过链式MapReducer模式可以有效的减少网络间传输数据的带宽，因为大量的计算基本都是在本地进行的。如果通过迭代作业的方式实现多个MapReduce作业组合的话就会在网络间传输大量的数据，这样会非常的耗时。

ChainMapper官方说明

ChainMapper类允许使用多个Map子类作为一个Map任务。

这些map子类的执行与liunx的管道命令十分相似，第一个map的输出会成为第二个map的输入，第二个map的输出也会变成第三个map的输入，以此类推，直到最后一个map的输出会变成整个mapTask的输出。

该特性的关键功能是链中的Mappers不需要知道它们是在链中执行的。这使具有可重用的专门的映射器可以组合起来，在单个任务中执行组合操作。

注意:在创建链式是每个Mapper的键/值的输出是链中下一个Mapper或Reducer的输入。它假定所有的映射器和链中的Reduce都使用匹配输出和输入键和值类，因为没有对链接代码进行转换。

使用方法

...
Job = Job.getInstance(conf);

Configuration mapAConf = new Configuration(false);
...
ChainMapper.addMapper(job, AMap.class, LongWritable.class, Text.class,
 Text.class, Text.class, true, mapAConf);

Configuration mapBConf = new Configuration(false);
...
ChainMapper.addMapper(job, BMap.class, Text.class, Text.class,
 LongWritable.class, Text.class, false, mapBConf);

...

job.waitForComplettion(true);
   ...

addMapper函数的参数说明

static void addMapper(Job job, Class<? extends Mapper> klass,
  Class<?> inputKeyClass, Class<?> inputValueClass,
  Class<?> outputKeyClass, Class<?> outputValueClass,
  Configuration mapperConf)
## 参数的含义如下
# 1. job
# 2. 此map的class
# 3. 此map的输入的key类型
# 4. 此map的输入的value类型
# 5. 此map的输出的key类型
# 6. 此map的输出的value类型
# 7. 此map的配置文件类conf

ChainReducer官方说明

ChainReducer类允许多个map在reduce执行完之后执行在一个reducerTask中，
reducer的每一条输出，都被作为输入给ChainReducer类设置的第一个map，然后第一个map的输出作为第二个map的输入，以此类推，最后一个map的输出会作为整个reducerTask的输出，写到磁盘上。

使用方法


...
Job = new Job(conf);
....

Configuration reduceConf = new Configuration(false);
...
ChainReducer.setReducer(job, XReduce.class, LongWritable.class, Text.class,
  Text.class, Text.class, true, reduceConf);

ChainReducer.addMapper(job, CMap.class, Text.class, Text.class,
  LongWritable.class, Text.class, false, null);

ChainReducer.addMapper(job, DMap.class, LongWritable.class, Text.class,
  LongWritable.class, LongWritable.class, true, null);

...

job.waitForCompletion(true);
...

setReducer函数的参数说明

static void setReducer(Job job, Class<? extends Reducer> klass,
 Class<?> inputKeyClass, Class<?> inputValueClass,
  Class<?> outputKeyClass, Class<?> outputValueClass,
   Configuration reducerConf)
## 参数的含义如下
# 1. job
# 2. 此reducer的class
# 3. 此reducer的输入的key类型
# 4. 此reducer的输入的value类型
# 5. 此reducer的输出的key类型
# 6. 此reducer的输出的value类型
# 7. 此reducer的配置文件类conf

案例

案例描述

统计出一篇文章的高频词汇（只收集出现次数大于3的单词），去除谓词，并且过滤掉敏感词汇。

实现方法

在MapTask中有三个子Mapper，分别命名为M1,M2,M3，在ReduceTask阶段有一个Reduce命名为R1和一个Mpaaer命名为RM1。

MapTask阶段

M1负责将文本内容按行切分每个单词，M2负责将M1输出的单词进行谓词过滤，M3将M2输出的内容进行敏感词过滤。

ReduceTask阶段

Reduce过程中R1负责将shuffle阶段中的单词进行统计，统计好之后将结果交给RM1处理，RM1主要是将单词数量大于5的单词进行输出。

上述方法只是为了展示ChainMapper/ReducerMapper的使用过程，所以观者勿喷。

代码

Mapper1

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class Mapper1 extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        System.out.println("Mapper1 setup===========");
    }

    @Override
    protected void map(LongWritable key, Text value, Context context)
    throws IOException, InterruptedException {
        System.out.println("map1===========" + value.toString());
        String line = value.toString() ;
        String[] strArr = line.split(" ") ;

        for (String w: strArr) {
            context.write(new Text(w), new IntWritable(1));
        }
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        System.out.println("Mapper1 cleanup===========");
    }
}

Mapper2

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * 该Mapper是用于过滤谓词，但是过滤单词不是本文的关键，所以为了演示方便
 * 这里只过滤一个单词‘of’
 */
public class Mapper2 extends Mapper<Text, IntWritable, Text, IntWritable> {
    protected void setup(Context context) throws IOException, InterruptedException {
        System.out.println("Mapper2 setup===========");
    }
    @Override
    protected void map(Text key, IntWritable value, Context context)
    throws IOException, InterruptedException {
        System.out.println("map2==================" + key.toString() + ":" + value.toString());
        //过滤单词'of'
        if (! key.toString().equals("of")){
            context.write(key, value);
        }
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        System.out.println("Mapper2 cleanup===========");
    }
}

Mapper3

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * 该Mapper是用于过滤敏感词汇，但是过滤单词不是本文的关键，所以为了演示方便
 * 这里只过滤一个单词‘xxx’
 */
public class Mapper3 extends Mapper<Text, IntWritable, Text, IntWritable> {
    protected void setup(Context context) throws IOException, InterruptedException {
        System.out.println("Mapper3 setup===========");
    }
    @Override
    protected void map(Text key, IntWritable value, Context context)
    throws IOException, InterruptedException {
        System.out.println("map3==================" + key.toString() + ":" + value.toString());
        //过滤单词'google'
        if (! key.toString().equals("xxx")){
            context.write(key, value);
        }
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        System.out.println("Mapper3 cleanup===========");
    }
}

Reducer1

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * Created by yanzhe on 2017/8/18.
 */
public class Reducer1 extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        System.out.println("Reducer1 setup===========");
    }

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {
        int count = 0 ;
        for (IntWritable iw: values) {
            count += iw.get();
        }
        context.write(key, new IntWritable(count));
        System.out.println("reduce=========" + key.toString() + ":" + count);
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        System.out.println("Reducer1 cleanup===========");
    }
}

ReduceMapper

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * Created by yanzhe on 2017/8/18.
 */
public class ReducerMapper1 extends Mapper<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        System.out.println("ReducerMapper1 setup===========");
    }

    @Override
    protected void map(Text key, IntWritable value, Context context)
    throws IOException, InterruptedException {
        if (value.get() > 5)
            context.write(key, value);

        System.out.println("reduceMap======" + key.toString() + ":" + value.toString());
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        System.out.println("ReducerMapper1 cleanup===========");
    }
}

App

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.chain.ChainReducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * Created by yanzhe on 2017/8/18.
 */
public class App {
    public static void main(String[] args) throws Exception {

        args = new String[]{"d:/java/mr/data/data.txt", "d:/java/mr/out"} ;

        Configuration conf = new Configuration();

        FileSystem fs = FileSystem.get(conf) ;

        Path outPath = new Path(args[1]) ;
        if (fs.exists(outPath)){
            fs.delete(outPath,true) ;
        }

        Job job = Job.getInstance(conf) ;

        ChainMapper.addMapper(job,Mapper1.class, LongWritable.class, Text.class, Text.class, IntWritable.class,job.getConfiguration());

        ChainMapper.addMapper(job,Mapper2.class, Text.class,IntWritable.class, Text.class, IntWritable.class,job.getConfiguration());

        ChainMapper.addMapper(job,Mapper3.class, Text.class,IntWritable.class, Text.class, IntWritable.class,job.getConfiguration());

        ChainReducer.setReducer(job, Reducer1.class, Text.class, IntWritable.class, Text.class, IntWritable.class,job.getConfiguration());

        ChainReducer.addMapper(job, ReducerMapper1.class, Text.class,
                IntWritable.class, Text.class, IntWritable.class, job.getConfiguration());

        FileInputFormat.addInputPath(job,new Path(args[0]));

        FileOutputFormat.setOutputPath(job,outPath);

        job.setNumReduceTasks(2);
        job.setCombinerClass(Combiner1.class);
        job.setPartitionerClass(MyPartitioner.class);

        job.waitForCompletion(true) ;

    }
}

ChainMapper/ChainReducer实现原理及案例分
ChainMapper/ChainReducer实现原理及案例分析 ChainMapper/ChainReduce...
VPN技术专题系列目录
VPN原理及实现1：VPN概念及要点 VPN原理及实现2：一般理论 VPN原理及实现3：隧道的一种实现 VPN原理...
VPN原理及实现 6：TCP封装的隧道对于拥塞控制的意义
系列目录 VPN原理及实现1：VPN概念及要点 VPN原理及实现2：一般理论 VPN原理及实现3：隧道的一种实现 ...
Linux VPN技术概论 2
系列目录 VPN原理及实现1：VPN概念及要点 VPN原理及实现2：一般理论 VPN原理及实现3：隧道的一种实现 ...
Linux VPN技术概论 1
系列目录 VPN原理及实现1：VPN概念及要点 VPN原理及实现2：一般理论 VPN原理及实现3：隧道的一种实现 ...
VPN原理及实现 3：隧道的一种实现
系列目录 VPN原理及实现1：VPN概念及要点 VPN原理及实现2：一般理论 VPN原理及实现3：隧道的一种实现 ...
VPN原理及实现 5：TCP还是UDP
系列目录 VPN原理及实现1：VPN概念及要点 VPN原理及实现2：一般理论 VPN原理及实现3：隧道的一种实现 ...
VPN原理及实现 4：虚拟网卡构建VPN
系列目录 VPN原理及实现1：VPN概念及要点 VPN原理及实现2：一般理论 VPN原理及实现3：隧道的一种实现 ...
VPN原理及实现 1：VPN概念及要点
系列目录 VPN原理及实现1：VPN概念及要点 VPN原理及实现2：一般理论 VPN原理及实现3：隧道的一种实现 ...
VPN原理及实现 2：一般理论
系列目录 VPN原理及实现1：VPN概念及要点 VPN原理及实现2：一般理论 VPN原理及实现3：隧道的一种实现 ...

ChainMapper/ChainReducer实现原理及案例分

ChainMapper/ChainReducer实现原理及案例分析

ChainMapper/ChainReducer的实现原理

ChainMapper官方说明

使用方法

addMapper函数的参数说明

ChainReducer官方说明

使用方法

setReducer函数的参数说明

案例

案例描述

实现方法

代码

Mapper1

Mapper2

Mapper3

Reducer1

ReduceMapper

App

相关文章

ChainMapper/ChainReducer实现原理及案例分

VPN技术专题系列目录

VPN原理及实现 6：TCP封装的隧道对于拥塞控制的意义

Linux VPN技术概论 2

Linux VPN技术概论 1

VPN原理及实现 3：隧道的一种实现

VPN原理及实现 5：TCP还是UDP

VPN原理及实现 4：虚拟网卡构建VPN

VPN原理及实现 1：VPN概念及要点

VPN原理及实现 2：一般理论

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

hadoop技术开发