美文网首页
WordCount程序

WordCount程序

作者: cccccttttyyy | 来源:发表于2018-07-27 13:20 被阅读0次

此章只需要hadoop运行在本地模式,采用新API写第一个MapReduce程序 WordCount程序,并对此程序进行简单解释。

  1. 新建一个maven程序,添加hadoop依赖
<dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.8.4</version>
    </dependency>

2.编写WordCount代码

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class Main
{
    public static class MyMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String word = value.toString();
            context.write(new Text(word), new IntWritable(1));
        }
    }
    public static class MyReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val:values) {
                sum+= val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(Main.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        boolean status = job.waitForCompletion(true);
        if(status){
            System.exit(0);
        }
        else{
            System.exit(1);
        }
    }
}
  1. 若是需要引入第三方库
    要将之前代码的Job类放到run方法中,再用ToolRunner类来运行,为避免增加执行时参数,打jar包时要打成 *-with-dependencies.jar类型的。
        public int run(String[] allArgs) throws Exception{
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(Main.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        String[] args = new GenericOptionsParser(getConf(),allArgs).getRemainingArgs();

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.submit();
        return 0;
        }
        public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        ToolRunner.run(new Main(),args)
         }

3.代码分析
两个继承自Mapper和Reducer的类是固定的,根据传入数据与传出类型来定义后面的数据类型。重写其中的map和reduce方法。map是负责处理输入格式为键值对的输入记录,输出也是键值对。reduce收到map产生的键和键对应的集合,处理后生成零个至多个键值对作为输出。
新API采用Job.class来配置作业,提交作业,控制作业运行和监控作业执行情况.引入第三方库时,ToolRunner.run()方法负责解析-libjars参数,他将这个任务委托给GenericOptionsParser.class执行。

4.打包后在hadoop本地环境执行
./hadoop jar MapReduceTest.jar Main /home/haduser/file/input/me /home/haduser/file/output

5.结果


image.png image.png

相关文章

网友评论

      本文标题:WordCount程序

      本文链接:https://www.haomeiwen.com/subject/jkywmftx.html