美文网首页
Pig从入门到精通7:Pig实现WordCount程序

Pig从入门到精通7:Pig实现WordCount程序

作者: 金字塔下的小蜗牛 | 来源:发表于2020-04-04 23:42 被阅读0次

    1.环境准备
    (1)启动Hadoop集群

    [root@bigdata ~]# start-all.sh
    [root@bigdata ~]# jps
    2096 NameNode
    2422 SecondaryNameNode
    2232 DataNode
    2586 ResourceManager
    2813 NodeManager
    3037 Jps
    (2)启动HistoryServer服务器

    [root@bigdata ~]# mr-jobhistory-daemon.sh start historyserver
    starting historyserver, logging to /root/trainings/hadoop-2.7.3/logs/mapred-root-historyserver-bigdata.out
    [root@bigdata ~]# jps
    2096 NameNode
    3123 Jps
    2422 SecondaryNameNode
    2232 DataNode
    2586 ResourceManager
    3084 JobHistoryServer
    2813 NodeManager
    (3)启动Pig的集群模式

    [root@bigdata ~]# pig
    18/09/26 00:02:32 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
    18/09/26 00:02:32 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
    18/09/26 00:02:32 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
    2018-09-26 00:02:32,804 [main] INFO org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
    2018-09-26 00:02:32,804 [main] INFO org.apache.pig.Main - Logging error messages to: /root/pig_1537891352803.log
    2018-09-26 00:02:32,830 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
    2018-09-26 00:02:33,289 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
    2018-09-26 00:02:33,289 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://bigdata:9000
    2018-09-26 00:02:33,812 [main] INFO org.apache.pig.PigServer - Pig Script ID for the session: PIG-default-4db6a82f-0910-4950-a889-e1d7ee031cce
    2018-09-26 00:02:33,812 [main] WARN org.apache.pig.PigServer - ATS is disabled since yarn.timeline-service.enabled set to false
    grunt>
    (4)上传测试数据到HDFS

    grunt> copyFromLocal /root/input/data.txt /input
    grunt> cat /input/data.txt
    I love Beijing
    I love China
    Beijing is the capital of China
    2.WordCount程序
    (1)加载数据

    grunt> lines = load '/input/data.txt' as (line:chararray);
    (2)分词操作

    grunt> words = foreach lines generate flatten(TOKENIZE(line)) as word;
    2018-09-26 00:55:07,771 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager

    • Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold
      = 489580128, usageThreshold = 489580128
      (3)按词分组

    grunt> grpd = group words by word;
    (4)按词计算

    grunt> cntd = foreach grpd generate group, COUNT(words);
    (5)打印结果

    grunt> dump cntd;
    log
    (I,2)
    (is,1)
    (of,1)
    (the,1)
    (love,2)
    (China,2)
    (Beijing,2)
    (capital,1)
    可以看到,使用PigLatin实现WordCount程序,只需要4句话即可,大大提高了MapReduce程序的开发效率。

    相关文章

      网友评论

          本文标题:Pig从入门到精通7:Pig实现WordCount程序

          本文链接:https://www.haomeiwen.com/subject/tepkdhtx.html