Pig 安装及使用

作者: 鹅鹅鹅_ | 来源:发表于2019-01-01 11:52 被阅读0次

Pig 安装及使用
pig 编译及安装
Pig安装及入门案例
pig安装
pig 导出导入
学习小组Day4笔记--monocyte
Windows版 Jenkins 自动化部署
Pig学习与实践
webpack基本使用
安装 Anaconda

一、Pig简介

1、Pig与Mapreduce

当业务比较复杂的时候，使用MapReduce将会是一个很复杂的事情，比如你需要对数据进行很多预处理或转换，以便能够适应MapReduce的处理模式。另一方面，编写MapReduce程序，发布及运行作业都将是一个比较耗时的事情。

Pig的出现很好的弥补了这一不足。Pig能够让你专心于数据及业务本身，而不是纠结于数据的格式转换以及MapReduce程序的编写。本质是上来说，当你使用Pig进行处理时，Pig本身会在后台生成一系列的MapReduce操作来执行任务，但是这个过程对用户来说是透明。

相比Java的MapReduce api，Pig为大型数据集的处理提供了更高层次的抽象，与MapReduce相比，Pig提供了更丰富的数据结构，一般都是多值和嵌套的数据结构。Pig还提供了一套更强大的数据变换操作，包括在MapReduce中被忽视的连接Join操作。
2、Pig组成

Pig包括两部分：
- 用于描述数据流的语言，称为Pig Latin。
- 用于执行Pig Latin程序的执行环境，当前有两个环境：单JVM中的本地执行环境和Hadoop集群上的分布式执行环境。
Pig内部，每个操作或变换是对输入进行数据处理，然后产生输出结果，这些变换操作被转换成一系列MapReduce作业，Pig让程序员不需要知道这些转换具体是如何进行的，这样工程师可以将精力集中在数据上，而非执行的细节上。

二、 Pig安装

Pig作为客户端程序运行，即使你准备在Hadoop集群上使用Pig，你也不需要在集群上做任何安装。Pig从本地提交作业，并和Hadoop进行交互。

1、下载并解压

[hadoop@master ~]$ wget http://mirror.bit.edu.cn/apache/pig/latest/pig-0.16.0.tar.gz
[hadoop@master ~]$ ls pig-0.16.0.tar.gz 
pig-0.16.0.tar.gz
[hadoop@master ~]$ tar xvf pig-0.16.0.tar.gz 
[hadoop@master ~]$ cd pig-0.16.0
[hadoop@master pig-0.16.0]$ ls
bin          conf     ivy      lib      LICENSE.txt             pig-0.16.0-core-h2.jar  scripts  test
build.xml    contrib  ivy.xml  lib-src  NOTICE.txt              README.txt              shims    tutorial
CHANGES.txt  docs     legacy   license  pig-0.16.0-core-h1.jar  RELEASE_NOTES.txt       src
[hadoop@master pig-0.16.0]$

2、设置环境变量

[hadoop@master pig-0.16.0]$ vim ~/.bash_profile
export PIG_INSTALL=/home/hadoop/pig-0.16.0
export PATH=$PATH:$PIG_INSTALL/bin
[hadoop@master pig-0.16.0]$ source ~/.bash_profile

3、验证
执行以下命令，查看Pig是否可用：
```
[hadoop@master pig-0.16.0]$ pig -help
```

三、Pig运行

1、两种运行模式

本地模式
Grunt是Pig的外壳程序（shell）。本地模式下，Pig运行在单个JVM中，访问本地文件系统，该模式用于测试或处理小规模数据集。

[hadoop@master ~]$ pig -x local
17/04/21 18:31:07 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/04/21 18:31:07 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
grunt>

Mapreduce 模式
在MapReduce模式下，Pig将查询翻译为MapReduce作业，然后在Hadoop集群上执行。

[hadoop@master ~]$ pig -x mapreduce
17/04/21 18:32:55 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/04/21 18:32:55 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
17/04/21 18:32:55 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
grunt>

2、示例运行

运行一个简单的示例，就是把linux下的/etc/passwd文件的第一列提取出来输出，用MapReduce模式跑，效果就是输出所有用户名。首先把/etc/passwd文件put到hadoop的hdfs上，命令如下：

[hadoop@master ~]$ hdfs dfs -mkdir pigtest
[hadoop@master ~]$ hdfs dfs -put /etc/passwd pigtest/

然后进入Pig shell，运行命令，以':'分隔提取A，然后把A的第一列放入B，dump打出B

grunt> A = load 'pigtest/passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id; 
grunt>  dump B;
2017-04-21 18:44:33,261 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2017-04-21 18:44:33,267 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-04-21 18:44:33,267 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(root)
(bin)
(daemon)
(adm)
(lp)
(sync)
(shutdown)
(halt)
(mail)
(uucp)
(operator)
(games)
(gopher)
(ftp)
(nobody)
(dbus)
(hacluster)
(rpc)
(oprofile)
(named)
(bacula)
(nscd)
#存入DFHS文件系统
grunt>store B into 'hdfs_dir';

3、查找每年最高温度

我们以查找最高气温为例，演示如何利用Pig统计每年的最高气温。假设数据文件内容如下（每行一个记录，tab分割）：

以local模式进入pig，依次输入以下命令（注意以分号结束语句）：

records = load ‘/home/user/input/temperature1.txt’ as (year: chararray,temperature: int);
dump records;
describe records;
valid_records = filter records by temperature!=999;
grouped_records = group valid_records by year;
dump grouped_records;
describe grouped_records;
max_temperature = foreach grouped_records generate group,MAX(valid_records.temperature);

--备注：valid_records是字段名，在上一语句的describe命令结果中可以查看到group_records 的具体结构。

[hadoop@master ~]$ pig -x local
grunt> records = load '/home/hadoop/pig/temperature.txt' as (year: chararray,temperature: int);
grunt> dump records;
(1990,21)
(1990,18)
(1991,21)
(1992,30)
(1992,999)
(1990,23)
grunt> describe records;
records: {year: chararray,temperature: int}
grunt> valid_records = filter records by temperature!=999;
grunt> grouped_records = group valid_records by year;
grunt> dump grouped_records;
grunt> describe grouped_records;
grouped_records: {group: chararray,valid_records: {(year: chararray,temperature: int)}}
grunt> max_temperature = foreach grouped_records generate group,MAX(valid_records.temperature);
grunt> dump max_temperature;
(1990,23)
(1991,21)
(1992,30)

四、Pig使用场景

Pig并不适合所有的数据处理任务，和MapReduce一样，它是为数据批处理而设计的，如果想执行的查询只涉及一个大型数据集的一小部分数据，Pig的实现不会很好，因为它要扫描整个数据集或其中很大一部分。
随着新版本发布，Pig的表现和原生MapRedece程序差距越来越小，因为Pig的开发团队使用了复杂、精巧的算法来实现Pig的关系操作。除非你愿意花大量时间来优化Java MapReduce程序，否则使用Pig Latin来编写查询的确能帮你节约时间。