美文网首页
Pig从入门到精通6:PigLatin语句

Pig从入门到精通6:PigLatin语句

作者: 金字塔下的小蜗牛 | 来源:发表于2020-04-04 23:42 被阅读0次

    PigLatin语句类似于SQL语句,但是具有自己的语法格式。本节就来介绍一下PigLatin语句的使用。

    1.PigLatin语句
    常用的PigLatin语句有下面这些:

    load:加载数据,生成一张表。
    order *** by *** :排序。
    group *** by *** :分组。
    filter *** by *** :过滤,选择满足条件的记录,类似where语句。
    generate:提取列,相当于在select中指定列。
    foreach:对表中每一条记录做某种操作。
    join:连接操作,多表查询。
    union:集合运算,求并集。
    dump:将数据打印到屏幕上。
    store:将数据保存到HDFS上。
    注意:

    (1)PigLatin语句和Spark的算子(API)操作非常像。

    (2)使用PigLatin语句需要启动Yarn的HistoryServer。

    2.实战案例
    2.1环境准备
    (1)启动Hadoop集群

    [root@bigdata ~]# start-all.sh
    [root@bigdata ~]# jps
    2096 NameNode
    2422 SecondaryNameNode
    2232 DataNode
    2586 ResourceManager
    2813 NodeManager
    3037 Jps
    (2)启动HistoryServer服务器

    [root@bigdata ~]# mr-jobhistory-daemon.sh start historyserver
    starting historyserver, logging to /root/trainings/hadoop-2.7.3/logs/mapred-root-historyserver-bigdata.out
    [root@bigdata ~]# jps
    2096 NameNode
    3123 Jps
    2422 SecondaryNameNode
    2232 DataNode
    2586 ResourceManager
    3084 JobHistoryServer
    2813 NodeManager
    (3)启动Pig的集群模式

    [root@bigdata ~]# pig
    18/09/26 00:02:32 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
    18/09/26 00:02:32 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
    18/09/26 00:02:32 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
    2018-09-26 00:02:32,804 [main] INFO org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
    2018-09-26 00:02:32,804 [main] INFO org.apache.pig.Main - Logging error messages to: /root/pig_1537891352803.log
    2018-09-26 00:02:32,830 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
    2018-09-26 00:02:33,289 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
    2018-09-26 00:02:33,289 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://bigdata:9000
    2018-09-26 00:02:33,812 [main] INFO org.apache.pig.PigServer - Pig Script ID for the session: PIG-default-4db6a82f-0910-4950-a889-e1d7ee031cce
    2018-09-26 00:02:33,812 [main] WARN org.apache.pig.PigServer - ATS is disabled since yarn.timeline-service.enabled set to false
    grunt>
    (4)上传测试数据到HDFS

    grunt> copyFromLocal /root/input/emp.csv /input
    grunt> copyFromLocal /root/input/dept.csv /input

    grunt> ls /input
    hdfs://bigdata:9000/input/dept.csv<r 1> 84
    hdfs://bigdata:9000/input/emp.csv<r 1> 617

    grunt> cat /input/dept.csv
    10,ACCOUNTING,NEW YORK
    20,RESEARCH,DALLAS
    30,SALES,CHICAGO
    40,OPERATIONS,BOSTON

    grunt> cat /input/emp.csv
    7369,SMITH,CLERK,7902,1980/12/17,800,,20
    7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
    7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30
    7566,JONES,MANAGER,7839,1981/4/2,2975,,20
    7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
    7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30
    7782,CLARK,MANAGER,7839,1981/6/9,2450,,10
    7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20
    7839,KING,PRESIDENT,,1981/11/17,5000,,10
    7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30
    7876,ADAMS,CLERK,7788,1987/5/23,1100,,20
    7900,JAMES,CLERK,7698,1981/12/3,950,,30
    7902,FORD,ANALYST,7566,1981/12/3,3000,,20
    7934,MILLER,CLERK,7782,1982/1/23,1300,,10
    2.2一些例子
    例1:创建员工表,并且指定表结构,数据类型和分隔符(默认的分隔符是:tab,csv文件的默认分隔符是逗号)。

    grunt> emp = load '/input/emp.csv' using PigStorage(',') as
    (empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);
    查看表结构:json格式(可以嵌套),列的默认类型是bytearray

    grunt> describe emp;
    emp: {empno: int,ename: chararray,job: chararray,mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int}
    查看表数据:

    grunt> dump emp;
    log
    (7369,SMITH,CLERK,7902,1980/12/17,800,,20)
    (7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30)
    (7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30)
    (7566,JONES,MANAGER,7839,1981/4/2,2975,,20)
    (7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30)
    (7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30)
    (7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)
    (7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20)
    (7839,KING,PRESIDENT,,1981/11/17,5000,,10)
    (7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30)
    (7876,ADAMS,CLERK,7788,1987/5/23,1100,,20)
    (7900,JAMES,CLERK,7698,1981/12/3,950,,30)
    (7902,FORD,ANALYST,7566,1981/12/3,3000,,20)
    (7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
    创建部门表:

    grunt> dept = load '/input/dept.csv' using PigStorage(',') as (deptno:int,dname:chararray,loc:chararray);
    grunt> describe dept;
    dept: {deptno: int,dname: chararray,loc: chararray}
    log
    (10,ACCOUNTING,NEW YORK)
    (20,RESEARCH,DALLAS)
    (30,SALES,CHICAGO)
    (40,OPERATIONS,BOSTON)
    例2:查询员工信息:员工号,姓名,薪水。

    grunt> emp1 = foreach emp generate empno,ename,sal;
    grunt> dump emp1;
    log
    (7369,SMITH,800)
    (7499,ALLEN,1600)
    (7521,WARD,1250)
    (7566,JONES,2975)
    (7654,MARTIN,1250)
    (7698,BLAKE,2850)
    (7782,CLARK,2450)
    (7788,SCOTT,3000)
    (7839,KING,5000)
    (7844,TURNER,1500)
    (7876,ADAMS,1100)
    (7900,JAMES,950)
    (7902,FORD,3000)
    (7934,MILLER,1300)
    例3:查询10号部门的员工:

    grunt> emp2 = filter emp by deptno == 10; 注意:两个等号
    grunt> dump emp2;
    log
    (7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)
    (7839,KING,PRESIDENT,,1981/11/17,5000,,10)
    (7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
    例4:查询员工信息按照月薪排序:

    grunt> emp3 = order emp by sal;
    grunt> dump emp3;
    log
    (7369,SMITH,CLERK,7902,1980/12/17,800,,20)
    (7900,JAMES,CLERK,7698,1981/12/3,950,,30)
    (7876,ADAMS,CLERK,7788,1987/5/23,1100,,20)
    (7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30)
    (7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30)
    (7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
    (7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30)
    (7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30)
    (7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)
    (7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30)
    (7566,JONES,MANAGER,7839,1981/4/2,2975,,20)
    (7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20)
    (7902,FORD,ANALYST,7566,1981/12/3,3000,,20)
    (7839,KING,PRESIDENT,,1981/11/17,5000,,10)
    例5:求每个部门工资的最大值:

    grunt> emp41 = group emp by deptno;
    grunt> describe emp41;
    emp41: {group: int,emp: {(empno: int,ename: chararray,job: chararray,
    mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int)}}
    注意:可以看到表emp41中嵌套了一张表emp。
    grunt> dump emp41;
    log
    (10,{(7934,MILLER,CLERK,7782,1982/1/23,1300,,10),
    (7839,KING,PRESIDENT,,1981/11/17,5000,,10),
    (7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)})
    (20,{(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20),
    (7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20),
    (7369,SMITH,CLERK,7902,1980/12/17,800,,20),
    (7566,JONES,MANAGER,7839,1981/4/2,2975,,20),
    (7902,FORD,ANALYST,7566,1981/12/3,3000,,20)})
    (30,{(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30),
    (7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30),
    (7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30),
    (7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30),
    (7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30),
    (7900,JAMES,CLERK,7698,1981/12/3,950,,30)})

    grunt> emp42 = foreach emp41 generate group,MAX(emp.sal); 注意:函数要大写
    grunt> describe emp42;
    emp42: {group: int,int}
    grunt> dump emp42;
    log
    (10,5000)
    (20,3000)
    (30,2850)
    例6:多表查询:查询员工信息:部门名称和员工姓名

    grunt> emp51 = join dept by deptno, emp by deptno;
    grunt> describe emp51;
    emp51: {dept::deptno: int,dept::dname: chararray,dept::loc: chararray,
    emp::empno: int,emp::ename: chararray,emp::job: chararray,emp::mgr: int,
    emp::hiredate: chararray,emp::sal: int,emp::comm: int,emp::deptno: int}
    grunt> dump emp51;
    log
    (10,ACCOUNTING,NEW YORK,7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
    (10,ACCOUNTING,NEW YORK,7839,KING,PRESIDENT,,1981/11/17,5000,,10)
    (10,ACCOUNTING,NEW YORK,7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)
    (20,RESEARCH,DALLAS,7876,ADAMS,CLERK,7788,1987/5/23,1100,,20)
    (20,RESEARCH,DALLAS,7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20)
    (20,RESEARCH,DALLAS,7369,SMITH,CLERK,7902,1980/12/17,800,,20)
    (20,RESEARCH,DALLAS,7566,JONES,MANAGER,7839,1981/4/2,2975,,20)
    (20,RESEARCH,DALLAS,7902,FORD,ANALYST,7566,1981/12/3,3000,,20)
    (30,SALES,CHICAGO,7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30)
    (30,SALES,CHICAGO,7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30)
    (30,SALES,CHICAGO,7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30)
    (30,SALES,CHICAGO,7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30)
    (30,SALES,CHICAGO,7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30)
    (30,SALES,CHICAGO,7900,JAMES,CLERK,7698,1981/12/3,950,,30)

    grunt> emp52 = foreach emp51 generate dept::dname, emp::ename;
    grunt> dump emp52;
    (ACCOUNTING,MILLER)
    (ACCOUNTING,KING)
    (ACCOUNTING,CLARK)
    (RESEARCH,ADAMS)
    (RESEARCH,SCOTT)
    (RESEARCH,SMITH)
    (RESEARCH,JONES)
    (RESEARCH,FORD)
    (SALES,TURNER)
    (SALES,ALLEN)
    (SALES,BLAKE)
    (SALES,MARTIN)
    (SALES,WARD)
    (SALES,JAMES)
    例7:查询10号和20号部门的员工信息,将结果保存到HDFS上。

    grunt> emp61 = filter emp by deptno == 10;
    grunt> emp62 = filter emp by deptno == 20;
    grunt> emp6 = union emp61, emp62;
    grunt> store emp6 into '/output/emp6.txt' using PigStorage(',');


    Output(s):
    Successfully stored 8 records (334 bytes) in: "/output/emp6.txt"


    grunt> ls /output/emp6.txt
    hdfs://bigdata:9000/output/emp6.txt/_SUCCESS<r 1> 0
    hdfs://bigdata:9000/output/emp6.txt/part-m-00000<r 1> 209
    hdfs://bigdata:9000/output/emp6.txt/part-m-00001<r 1> 125
    grunt> cat /output/emp6.txt
    7369,SMITH,CLERK,7902,1980/12/17,800,,20
    7566,JONES,MANAGER,7839,1981/4/2,2975,,20
    7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20
    7876,ADAMS,CLERK,7788,1987/5/23,1100,,20
    7902,FORD,ANALYST,7566,1981/12/3,3000,,20
    7782,CLARK,MANAGER,7839,1981/6/9,2450,,10
    7839,KING,PRESIDENT,,1981/11/17,5000,,10
    7934,MILLER,CLERK,7782,1982/1/23,1300,,10
    PigLatin语句中只有dump,store语句会触发MapReduce计算,其他语句不会触发MapReduce计算。这一点类似于Spark的懒惰性(在Spark中只有Action算子会触发Spark计算,而Transformation算子不会触发Spark计算)。

    相关文章

      网友评论

          本文标题:Pig从入门到精通6:PigLatin语句

          本文链接:https://www.haomeiwen.com/subject/xjpkdhtx.html