大数据开发之Hive优化篇2-Hive的explain命令

作者: 只是甲 | 来源:发表于2021-01-20 09:31 被阅读0次

大数据开发之Hive优化篇2-Hive的explain命令
Mysql(二) -- 慢查询优化
第三个模块 MySQL-UUID、分词字典、MySQL全文索引
大数据开发之Hive案例篇2-hive复合类型array处理
大数据开发之Hive篇2-Hive数据类型及文件格式
大数据开发之Hive优化篇1-Hive 优化概述
大数据开发之Hive优化篇8-Hive Job优化
Hive优化
MySQL 性能优化神器 Explain 使用分析
Mysql优化

备注:
Hive 版本 2.1.1

一.Hive explain命令概述

Hive的explain命令用来看Hive sql的执行计划，通过分析执行计划来达到优化Hive sql的目的。

语法:

EXPLAIN [EXTENDED|CBO|AST|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION|ANALYZE] query

Hive查询被转换成一个阶段序列(它更像是一个有向非循环图)。这些阶段可能是map/reduce阶段，甚至可能是执行元存储或文件系统操作(如移动和重命名)的阶段。explain输出有三部分:

查询语句的抽象语法树（AST）
执行计划不同阶段间的依赖关系
每个阶段的描述

阶段本身的描述显示了一系列操作符以及与操作符关联的元数据。元数据可能包括FilterOperator的过滤表达式、SelectOperator的选择表达式或FileSinkOperator的输出文件名。

二.Hive explain 案例

map/reduce阶段本身有两个部分:
从表别名到映射操作符树的映射——该映射告诉映射器要调用哪个操作符树来处理来自特定表的行或前一个Map /reduce阶段的结果。

以一个简单的group by语句来看explain的效果:

hive> 
    > explain extended 
    > select t.sale_date ,
    >        t.prod_name ,
    >        sum(t.sale_nums) total_nums
    > from ods_fact_sale t
    > group by t.sale_date,t.prod_name ;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 914)
      DagName: root_20201230141411_24e56f4b-fdc1-461f-8ac9-81594cc3b1ad:1
      Vertices:
        Map 1 
            Map Operator Tree:                    //发生在job的 map 处理阶段过程
                TableScan                              //读取表的数据
                  alias: t                                  //读取表的别名
                  Statistics: Num rows: 767830000 Data size: 30653263662 Basic stats: COMPLETE Column stats: NONE  //表的统计信息
                  GatherStats: false
                  Select Operator
                    expressions: sale_date (type: string), prod_name (type: string), sale_nums (type: int)
                    outputColumnNames: sale_date, prod_name, sale_nums
                    Statistics: Num rows: 767830000 Data size: 30653263662 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: sum(sale_nums)
                      keys: sale_date (type: string), prod_name (type: string)
                      mode: hash
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 767830000 Data size: 30653263662 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string), _col1 (type: string)
                        null sort order: aa
                        sort order: ++
                        Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
                        Statistics: Num rows: 767830000 Data size: 30653263662 Basic stats: COMPLETE Column stats: NONE
                        tag: -1
                        value expressions: _col2 (type: bigint)
                        auto parallelism: false
            Path -> Alias:
              hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale [t]
            Path -> Partition:
              hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale 
                Partition
                  base file name: ods_fact_sale
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","prod_name":"true","sale_date":"true","sale_nums":"true"}}
                    bucket_count -1
                    columns id,sale_date,prod_name,sale_nums
                    columns.comments 
                    columns.types bigint:string:string:int
                    comment Imported by sqoop on 2020/11/25 19:14:01
                    field.delim �
                    file.inputformat org.apache.hadoop.mapred.TextInputFormat
                    file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    line.delim 

                    location hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale
                    name test.ods_fact_sale
                    numFiles 4
                    numRows 767830000
                    rawDataSize 30653263662
                    serialization.ddl struct ods_fact_sale { i64 id, string sale_date, string prod_name, i32 sale_nums}
                    serialization.format �
                    serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    totalSize 31421093662
                    transient_lastDdlTime 1608796507
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                
                    input format: org.apache.hadoop.mapred.TextInputFormat         //输入格式 Sequence文件格式
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   //输出格式 Sequence文件格式
                    properties:
                      COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","prod_name":"true","sale_date":"true","sale_nums":"true"}}
                      bucket_count -1
                      columns id,sale_date,prod_name,sale_nums
                      columns.comments 
                      columns.types bigint:string:string:int
                      comment Imported by sqoop on 2020/11/25 19:14:01
                      field.delim �
                      file.inputformat org.apache.hadoop.mapred.TextInputFormat
                      file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      line.delim 

                      location hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale
                      name test.ods_fact_sale
                      numFiles 4
                      numRows 767830000
                      rawDataSize 30653263662
                      serialization.ddl struct ods_fact_sale { i64 id, string sale_date, string prod_name, i32 sale_nums}
                      serialization.format �
                      serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      totalSize 31421093662
                      transient_lastDdlTime 1608796507
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: test.ods_fact_sale
                  name: test.ods_fact_sale
            Truncated Path -> Alias:
              /test.db/ods_fact_sale [t]
        Reducer 2 
            Execution mode: vectorized
            Needs Tagging: false
            Reduce Operator Tree:          //发生在job的 reduce 处理阶段过程
              Group By Operator
                aggregations: sum(VALUE._col0)
                keys: KEY._col0 (type: string), KEY._col1 (type: string)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 383915000 Data size: 15326631831 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  directory: hdfs://nameservice1/tmp/hive/root/2c65817a-8195-495c-9218-5435913ad279/hive_2020-12-30_14-14-11_934_7035611081945658845-1/-mr-10000/.hive-staging_hive_2020-12-30_14-14-11_934_7035611081945658845-1/-ext-10001
                  NumFilesPerFileSink: 1
                  Statistics: Num rows: 383915000 Data size: 15326631831 Basic stats: COMPLETE Column stats: NONE
                  Stats Publishing Key Prefix: hdfs://nameservice1/tmp/hive/root/2c65817a-8195-495c-9218-5435913ad279/hive_2020-12-30_14-14-11_934_7035611081945658845-1/-mr-10000/.hive-staging_hive_2020-12-30_14-14-11_934_7035611081945658845-1/-ext-10001/
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat  
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat 式
                      properties:
                        columns _col0,_col1,_col2
                        columns.types string:string:bigint
                        escape.delim \
                        hive.serialization.extend.additional.nesting.levels true
                        serialization.escape.crlf true
                        serialization.format 1
                        serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  TotalFiles: 1
                  GatherStats: false
                  MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: -1                       //--job中没有Limit 
      Processor Tree:
        ListSink

Time taken: 0.731 seconds, Fetched: 136 row(s)
hive>

参考:

1.https://www.cnblogs.com/skyl/p/4737411.html

大数据开发之Hive优化篇2-Hive的explain命令
备注:Hive 版本 2.1.1 一.Hive explain命令概述 Hive的explain命令用来看Hive...
Mysql(二) -- 慢查询优化
关于explain命令 explain命令常用在在优化sql慢查询中，一般常见的问题都能通过explain找到优化...
第三个模块 MySQL-UUID、分词字典、MySQL全文索引
论mysql5.7.13性能优化之索引优化mysql优化（1）show命令慢查询日志 explain profi...
大数据开发之Hive案例篇2-hive复合类型array处理
一.需求 1.1 数据准备查看数据： 1.2 需求需要将service_param这个复合类型的array第一...
大数据开发之Hive篇2-Hive数据类型及文件格式
备注:Hive 版本 2.1.1 一.Hive的数据类型 1.1 数值类型 Hive的数值类型与关系型数据库的My...
大数据开发之Hive优化篇1-Hive 优化概述
备注:Hive 版本 2.1.1 一.Hive的实现原理 Hive的编译器将HQL转换成一组操作符(Operato...
大数据开发之Hive优化篇8-Hive Job优化
备注:Hive 版本 2.1.1 Hive job优化概述实际开发过程中，经常会遇到hive sql运行比较慢的...
Hive优化
Hive数据倾斜优化总结 Hive数据倾斜优化分为配置优化和SQL优化优先原则：数据不怕多，避免倾斜。减少J...
MySQL 性能优化神器 Explain 使用分析
转载链接MySQL 性能优化神器 Explain 使用分析简介 MySQL 提供了一个 EXPLAIN 命令, ...
Mysql优化
1、软优化 1.1 EXPLAIN或DESC命令使用该命令分析一条查询语句的执行信息，其中会显示索引和查询数据读...