美文网首页
hive执行计划举例

hive执行计划举例

作者: DobeWang | 来源:发表于2018-06-06 18:18 被阅读0次

    执行计划例子:

    insert overwrite TABLE lpx SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) ;

    OK

    ABSTRACT SYNTAX TREE:

      (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) t1) (TOK_TABREF (TOK_TABNAME invites) t2) (= (. (TOK_TABLE_OR_COL t1) bar) (. (TOK_TABLE_OR_COL t2) bar)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME lpx))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) bar)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) foo)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) foo)))))

    STAGE DEPENDENCIES:

      Stage-1 is a root stage   /根

      Stage-0 depends on stages: Stage-1 /0依赖1

      Stage-2 depends on stages: Stage-0 /2依赖0

    STAGE PLANS:

    Stage: Stage-1

    Map Reduce//这个阶段是一个mapreduce作业      Alias -> Map Operator Tree:   //map操作树,对应map阶段

    t1

    TableScan   //扫描表获取数据   from加载表,描述中有行数和大小等

                alias: t1     //表别名

    Reduce Output Operator //这里描述map的输出,也就是reduce的输入。比如key,partition,sort等信息 

                  key expressions:  //t1表输出到reduce阶段的key信息

    expr: bar

    type: string

    sort order: +  //一个排序字段,这个排序字段是key=bar,多个排序字段多个+

    Map-reduce partition columns:  //partition的信息,由此也可以看出hive在join的时候会以join on后的列作为partition的列,以保证具有相同此列的值的行被分到同一个reduce中去

    expr: bar

    type: string

    tag: 0                         //对t1表打标签

    value expressions:   //t1表输出到reduce阶段的value信息

    expr: foo

    type: int

    expr: bar

    type: string

    t2

    TableScan

    alias: t2

    Reduce Output Operator

    key expressions:

    expr: bar

    type: string

    sort order: +

    Map-reduce partition columns:

    expr: bar

    type: string

    tag: 1

    value expressions:

    expr: foo

    type: int

    Reduce Operator Tree://reduce操作树,相当于reduce阶段Join Operator

    condition map:

    Inner Join 0 to 1

              condition expressions:

                0 {VALUE._col0} {VALUE._col1} //对应前面t1.bar, t1.foo

                1 {VALUE._col0} //对应前面t2.foo

              handleSkewJoin: false

              outputColumnNames: _col0, _col1, _col5

    Select Operator //筛选列,描述中有列名、类型,输出类型、大小等。

    expressions:

    expr: _col1

    type: string

    expr: _col0

    type: int

    expr: _col5

                      type: int

                outputColumnNames: _col0, _col1, _col2   //为临时结果字段按规则起的临时字段名

    File Output Operator //输出结果到临时文件中,描述介绍了压缩格式、输出文件格式。

    compressed: false

    GlobalTableId: 1

    table:

    input format: org.apache.hadoop.mapred.TextInputFormat

    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

    name: default.lpx

      Stage: Stage-0

    Move Operator //Stage-0简单把结果从临时目录,移动到表lpx相关的目录。

    tables:

    replace: true

    table:

    input format: org.apache.hadoop.mapred.TextInputFormat

    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

    name: default.lpx

    Stage: Stage-2

    Stats-Aggr Operator

    ========================================

    ========================================

    从信息头:

    STAGE DEPENDENCIES:

    Stage-1 is a root stage

    Stage-0 depends on stages: Stage-1

    Stage-2 depends on stages: Stage-0

    从这里可以看出Plan计划的Job任务结构,整个任务会分为3个Job执行,第一个Job将由Stage-1构成;

    第二个Job处理由Stage-0构成,Stage-0的处理必须依赖Stage-1阶段的结果;

    第三个Job处理由Stage-2构成,Stage-2的处理必须依赖Stage-0阶段的结果。

    下面分别解释Stage-1和Stage-0,执行SQL可以分成两步:(1)SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar);

    (2)

    insert overwrite TABLE lpx;

    Stage: Stage-1对应一次完整的Map Reduce任务,包括:Map Operator Tree和Reduce Operator Tree两步操作,Map Operator Tree对应Map任务,Reduce Operator Tree对应Reduce任务。从Map Operator Tree阶段可以看出进行了两个并列的操作t1和t2,分别SELECT t1.bar, t1.foo FROM t1;和SELECT t2.foo FROM t2;而且两个Map任务分别产生了Reduce阶段的输入[Reduce Output Operator]。从Reduce Operator Tree分析可以看到如下信息,条件连接Map的输出以及通过预定义的输出格式生成符合default.lpx的存储格式的数据存储到HDFS中。在我们创建lpx表的时候,没有指定该表的存储格式,默认会以Text为存储格式,输入输出会以TextInputFormat与TextOutputFormat进行读写:

    table:

    input format: org.apache.hadoop.mapred.TextInputFormat

    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

    name: default.lpx

    input format的值对应org.apache.hadoop.mapred.TextInputFormat,这是因为在开始的Map阶段产生的临时输出文件是以TextOutputFormat格式保存的,自然Reduce的读取是由TextInputFormat格式处理读入数据。这些是由Hadoop的MapReduce处理细节来控制,而Hive只需要指定处理格式即可。

    Serde值为org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe类,这时这个对象的保存的值为_col0, _col1, _col2,也就是我们预期要查询的t1.bar, t1.foo, t2.foo,这个值具体的应该为_col0+表lpx设置的列分割符+_col1+表lpx设置的列分割符+_col2。outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat可以知道output的处理是使用该类来处理的。

    Stage-0对应上面提到的第二步操作。这时stage-1产生的临时处理文件举例如tmp,需要经过stage-0阶段操作处理到lpx表中。Move Operator代表了这并不是一个MapReduce任务,只需要调用MoveTask的处理就行,在处理之前会去检查输入文件是否符合lpx表的存储格式。

    hive执行计划作用

    分析作业执行过程,优化作业执行流程,提升作业执行效率;例如,数据过滤条件从reduce端提前到map端,有效减少map/reduce间shuffle数据量,提升作业执行效率;

    提前过滤数据数据集,减少不必要的读取操作;例如: hive join操作先于where条件顾虑,将分区条件放入on语句中,能够有效减少输入数据集;

    执行计划分析问题hql

    select a.*, b.cust_uid

    from ods_ad_bid_deliver_info b join mds_ad_algo_feed_monitor_data_table a

    where a.dt<=20140101 and a.dt<=20140108 and key='deliver_id_bucket_id' and a.dt=b.dt and a.key_slice=b.deliver_id

    ==========================================================================

    ==========================================================================

    执行计划:

    抽象语法树:

    ABSTRACT SYNTAX TREE:

      (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME ods_ad_bid_deliver_info) b) (TOK_TABREF (TOK_TABNAME mds_ad_algo_feed_monitor_data_table) a))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_ALLCOLREF (TOK_TABNAME a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) cust_uid))) (TOK_WHERE (and (and (and (and (<= (. (TOK_TABLE_OR_COL a) dt) 20140101) (<= (. (TOK_TABLE_OR_COL a) dt) 20140108)) (= (TOK_TABLE_OR_COL key) 'deliver_id_bucket_id')) (= (. (TOK_TABLE_OR_COL a) dt) (. (TOK_TABLE_OR_COL b) dt))) (= (. (TOK_TABLE_OR_COL a) key_slice) (. (TOK_TABLE_OR_COL b) deliver_id))))))

    STAGE DEPENDENCIES:

      Stage-1 is a root stage

      Stage-0 is a root stage

    STAGE PLANS:

      Stage: Stage-1

        Map Reduce

          Alias -> Map Operator Tree:

            a

             TableScan

                alias: a

               Filter Operator

                  predicate:

                      expr: (key = 'deliver_id_bucket_id') //按key指定值在map阶段过滤

                      type: boolean

                 Reduce Output Operator

                    sort order:

                    tag: 1

                    value expressions: //select *导致输出到reduce的数据是全部的列信息

                         expr: key

                          type: string

                          expr: key_slice

                          type: string

                          expr: billing_mode_slice

                          type: string

                          expr: bucket_id

                          type: string

                          expr: ctr

                          type: string

                          expr: ecpm

                          type: string

                          expr: auc

                          type: string

                          expr: pctr

                          type: string

                          expr: pctr_ctr

                          type: string

                          expr: total_pv

                          type: string

                          expr: total_click

                          type: string

                          expr: dt

                          type: string

            b

             TableScan

                alias: b

               Reduce Output Operator

                  sort order:

                  tag: 0

                  value expressions:

                        expr: deliver_id

                        type: string

                        expr: cust_uid

                        type: string

                        expr: dt

                        type: string

          Reduce Operator Tree:

           Join Operator

              condition map:

                   Inner Join 0 to 1

              condition expressions:

                0 {VALUE._col0} {VALUE._col6} {VALUE._col35}

                1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}

              handleSkewJoin: false

              outputColumnNames: _col0, _col6, _col35, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49

             Filter Operator

                predicate:

                   expr: (((((_col49 <= 20140101) and (_col49 <= 20140108)) and (_col38 = 'deliver_id_bucket_id')) and (_col49 = _col35)) and (_col39 = _col0))

                    type: boolean

               Select Operator

                  expressions:

                        expr: _col38

                        type: string

                        expr: _col39

                        type: string

                        expr: _col40

                        type: string

                        expr: _col41

                        type: string

                        expr: _col42

                        type: string

                        expr: _col43

                        type: string

                        expr: _col44

                        type: string

                        expr: _col45

                        type: string

                        expr: _col46

                        type: string

                        expr: _col47

                        type: string

                        expr: _col48

                        type: string

                        expr: _col49

                        type: string

                        expr: _col6

                        type: string

                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

                 File Output Operator

                    compressed: false

                    GlobalTableId: 0

                    table:

                        input format: org.apache.hadoop.mapred.TextInputFormat

                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

      Stage: Stage-0

        Fetch Operator

          limit: -1

    优化之后hql:

    select a.*, b.cust_uid

    from ods_ad_bid_deliver_info b

    join mds_ad_algo_feed_monitor_data_table a

    on(a.dt<=20140101 and a.dt<=20140108 and a.dt=b.dt and a.key_slice=b.deliver_id and a.key='deliver_id_bucket_id')

    =================================================================

    =================================================================

    执行计划:

    抽象语法树:

    STAGE DEPENDENCIES:

      Stage-1 is a root stage

      Stage-0 is a root stage

    STAGE PLANS:

      Stage: Stage-1

        Map Reduce

          Alias -> Map Operator Tree:

            a

              TableScan

                alias: a

               Filter Operator

                  predicate:

                      expr: (key = 'deliver_id_bucket_id')

                      type: boolean

                 Filter Operator

                    predicate:

                        expr: (dt <= 20140101)  //分区过滤条件在map端生效

                        type: boolean

                   Filter Operator

                      predicate:

                          expr: (dt <= 20140108)  //分区过滤条件在map端生效

                          type: boolean

                      Filter Operator

                        predicate:

                            expr: (key = 'deliver_id_bucket_id')

                            type: boolean

                        Reduce Output Operator

                          key expressions:

                                expr: dt

                                type: string

                                expr: key_slice

                                type: string

                          sort order: ++

                          Map-reduce partition columns:

                                expr: dt

                                type: string

                                expr: key_slice

                                type: string

                          tag: 1

                          value expressions:

                                expr: key

                                type: string

                                expr: key_slice

                                type: string

                                expr: billing_mode_slice

                                type: string

                                expr: bucket_id

                                type: string

                                expr: ctr

                                type: string

                                expr: ecpm

                                type: string

                                expr: auc

                                type: string

                                expr: pctr

                                type: string

                                expr: pctr_ctr

                                type: string

                                expr: total_pv

                                type: string

                                expr: total_click

                                type: string

                                expr: dt

                                type: string

            b

              TableScan

                alias: b

                Reduce Output Operator

                  key expressions:

                        expr: dt

                        type: string

                        expr: deliver_id

                        type: string

                  sort order: ++

                  Map-reduce partition columns:

                        expr: dt

                        type: string

                        expr: deliver_id

                        type: string

                  tag: 0

                  value expressions:

                        expr: cust_uid

                        type: string

          Reduce Operator Tree:

            Join Operator

              condition map:

                   Inner Join 0 to 1

              condition expressions:

                0 {VALUE._col6}

                1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}

              handleSkewJoin: false

              outputColumnNames: _col6, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49

              Select Operator

                expressions:

                      expr: _col38

                      type: string

                      expr: _col39

                      type: string

                      expr: _col40

                      type: string

                      expr: _col41

                      type: string

                      expr: _col42

                      type: string

                      expr: _col43

                     type: string

                      expr: _col44

                      type: string

                      expr: _col45

                      type: string

                      expr: _col46

                      type: string

                      expr: _col47

                      type: string

                      expr: _col48

                      type: string

                      expr: _col49

                      type: string

                      expr: _col6

                      type: string

                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

                File Output Operator

                  compressed: false

                  GlobalTableId: 0

                  table:

                      input format: org.apache.hadoop.mapred.TextInputFormat

                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

      Stage: Stage-0

        Fetch Operator

          limit: -1

    例子:

    select * from emp e

    left join dept d on e.deptno=d.deptno

    where d.dt='2018-06-04';

    花费时间:Time taken: 44.401 seconds, Fetched: 17 row(s)

    执行计划:

    STAGE DEPENDENCIES:

      Stage-4 is a root stage

      Stage-3 depends on stages: Stage-4

      Stage-0 depends on stages: Stage-3

    STAGE PLANS:

      Stage: Stage-4

        Map Reduce Local Work  /本地执行

          Alias -> Map Local Tables:

            d

              Fetch Operator

                limit: -1

          Alias -> Map Local Operator Tree:

            d

              TableScan

                alias: d

                Statistics: Num rows: 1 Data size: 168 Basic stats: PARTIAL Column stats: PARTIAL

    HashTable Sink Operator/ReduceSinkOperator将Map端的字段组合序列化为Reduce Key/value, Partition Key,只可能出现在Map阶段,同时也标志着Hive生成的MapReduce程序中Map阶段的结束。

                  keys:

                    0 deptno (type: string)

                    1 deptno (type: string)

      Stage: Stage-3

        Map Reduce

          Map Operator Tree:

              TableScan

                alias: e

                Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL

                Map Join Operator

                  condition map:

                       Left Outer Join0 to 1

                  keys:

                    0 deptno (type: string)

                    1 deptno (type: string)

                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15

                  Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                  Filter Operator

                    predicate: (_col15 = '2018-06-04') (type: boolean)

                    Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                    Select Operator

                      expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: str

    ing), _col12 (type: string), _col13 (type: string), _col14 (type: string), '2018-06-04' (type: string)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

                      Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                      File Output Operator

                        compressed: false

                        Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                        table:

                            input format: org.apache.hadoop.mapred.TextInputFormat

                            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

          Local Work:

            Map Reduce Local Work

      Stage: Stage-0

        Fetch Operator

          limit: -1

          Processor Tree:

            ListSink

    select * from emp e

    left join dept d on (e.deptno=d.deptno and  d.dt='2018-06-04');

    花费时间:Time taken: 23.804 seconds, Fetched: 17 row(s)

    STAGE DEPENDENCIES:

      Stage-4 is a root stage

      Stage-3 depends on stages: Stage-4

      Stage-0 depends on stages: Stage-3

    STAGE PLANS:

      Stage: Stage-4

        Map Reduce Local Work

          Alias -> Map Local Tables:

            d

              Fetch Operator

                limit: -1

          Alias -> Map Local Operator Tree:

            d

              TableScan

                alias: d

                filterExpr: (dt = '2018-06-04') (type: boolean)

                Statistics: Num rows: 1 Data size: 84 Basic stats: PARTIAL Column stats: PARTIAL

                HashTable Sink Operator

                  keys:

                    0 deptno (type: string)

                    1 deptno (type: string)

      Stage: Stage-3

        Map Reduce

          Map Operator Tree:

              TableScan

                alias: e

                Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL

                Map Join Operator

                  condition map:

                       Left Outer Join0 to 1

                  keys:

                    0 deptno (type: string)

                    1 deptno (type: string)

                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15

                  Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                  Select Operator

                    expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: strin

    g), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string)                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

                    Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                    File Output Operator

                      compressed: false

                      Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                      table:

                          input format: org.apache.hadoop.mapred.TextInputFormat

                          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

          Local Work:

            Map Reduce Local Work

      Stage: Stage-0

        Fetch Operator

          limit: -1

          Processor Tree:

            ListSink

    相关文章

      网友评论

          本文标题:hive执行计划举例

          本文链接:https://www.haomeiwen.com/subject/bmkisftx.html