美文网首页
kylin-剪枝优化工具-使用衍生维度

kylin-剪枝优化工具-使用衍生维度

作者: 李小李的路 | 来源:发表于2020-02-21 17:03 被阅读0次
    • 基于kylin-3.0

    衍生维度

    • 首先观察一下日期维度表kylin_cal_dt;
    CREATE TABLE `kylin_cal_dt`(
      `cal_dt` date COMMENT 'Date, PK', 
      `year_beg_dt` date COMMENT 'YEAR Begin Date', 
      `qtr_beg_dt` date COMMENT 'Quarter Begin Date', 
      `month_beg_dt` date COMMENT 'Month Begin Date', 
      `week_beg_dt` date COMMENT 'Week Begin Date', 
      `age_for_year_id` smallint, 
      `age_for_qtr_id` smallint, 
      `age_for_month_id` smallint, 
      `age_for_week_id` smallint, 
      `age_for_dt_id` smallint, 
      `age_for_rtl_year_id` smallint, 
      `age_for_rtl_qtr_id` smallint, 
      `age_for_rtl_month_id` smallint, 
      `age_for_rtl_week_id` smallint, 
      `age_for_cs_week_id` smallint, 
      `day_of_cal_id` int, 
      `day_of_year_id` smallint, 
      `day_of_qtr_id` smallint, 
      `day_of_month_id` smallint, 
      `day_of_week_id` int, 
      `week_of_year_id` tinyint, 
      `week_of_cal_id` int, 
      `month_of_qtr_id` tinyint, 
      `month_of_year_id` tinyint, 
      `month_of_cal_id` smallint, 
      `qtr_of_year_id` tinyint, 
      `qtr_of_cal_id` smallint, 
      `year_of_cal_id` smallint, 
      `year_end_dt` string, 
      `qtr_end_dt` string, 
      `month_end_dt` string, 
      `week_end_dt` string, 
      `cal_dt_name` string, 
      `cal_dt_desc` string, 
      `cal_dt_short_name` string, 
      `ytd_yn_id` tinyint, 
      `qtd_yn_id` tinyint, 
      `mtd_yn_id` tinyint, 
      `wtd_yn_id` tinyint, 
      `season_beg_dt` string, 
      `day_in_year_count` smallint, 
      `day_in_qtr_count` tinyint, 
      `day_in_month_count` tinyint, 
      `day_in_week_count` tinyint, 
      `rtl_year_beg_dt` string, 
      `rtl_qtr_beg_dt` string, 
      `rtl_month_beg_dt` string, 
      `rtl_week_beg_dt` string, 
      `cs_week_beg_dt` string, 
      `cal_date` string, 
      `day_of_week` string, 
      `month_id` string, 
      `prd_desc` string, 
      `prd_flag` string, 
      `prd_id` string, 
      `prd_ind` string, 
      `qtr_desc` string, 
      `qtr_id` string, 
      `qtr_ind` string, 
      `retail_week` string, 
      `retail_year` string, 
      `retail_start_date` string, 
      `retail_wk_end_date` string, 
      `week_ind` string, 
      `week_num_desc` string, 
      `week_beg_date` string, 
      `week_end_date` string, 
      `week_in_year_id` string, 
      `week_id` string, 
      `week_beg_end_desc_mdy` string, 
      `week_beg_end_desc_md` string, 
      `year_id` string, 
      `year_ind` string, 
      `cal_dt_mns_1year_dt` string, 
      `cal_dt_mns_2year_dt` string, 
      `cal_dt_mns_1qtr_dt` string, 
      `cal_dt_mns_2qtr_dt` string, 
      `cal_dt_mns_1month_dt` string, 
      `cal_dt_mns_2month_dt` string, 
      `cal_dt_mns_1week_dt` string, 
      `cal_dt_mns_2week_dt` string, 
      `curr_cal_dt_mns_1year_yn_id` tinyint, 
      `curr_cal_dt_mns_2year_yn_id` tinyint, 
      `curr_cal_dt_mns_1qtr_yn_id` tinyint, 
      `curr_cal_dt_mns_2qtr_yn_id` tinyint, 
      `curr_cal_dt_mns_1month_yn_id` tinyint, 
      `curr_cal_dt_mns_2month_yn_id` tinyint, 
      `curr_cal_dt_mns_1week_yn_ind` tinyint, 
      `curr_cal_dt_mns_2week_yn_ind` tinyint, 
      `rtl_month_of_rtl_year_id` string, 
      `rtl_qtr_of_rtl_year_id` tinyint, 
      `rtl_week_of_rtl_year_id` tinyint, 
      `season_of_year_id` tinyint, 
      `ytm_yn_id` tinyint, 
      `ytq_yn_id` tinyint, 
      `ytw_yn_id` tinyint, 
      `kylin_cal_dt_cre_date` string, 
      `kylin_cal_dt_cre_user` string, 
      `kylin_cal_dt_upd_date` string, 
      `kylin_cal_dt_upd_user` string)
    COMMENT 'Date Dimension Table'
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
    WITH SERDEPROPERTIES ( 
      'field.delim'=',', 
      'serialization.format'=',') 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      'hdfs://bigdata/user/hive/warehouse/kylin_cal_dt'
    TBLPROPERTIES (
      'transient_lastDdlTime'='1580892438')
    
    • 这是一个常见的时间维度表,里面充斥这各种用途的时间维度,如每个日期对应的星期,每个日期对应的月份等。这些维度可以被分析师用来灵活地进行各个时间粒度上的聚合分析,而不需要进行额外的上卷操作。
    • 但是如果为了这个目的一下子引入这么多维度,会导致Cube中的Cuboid的总数量呈爆炸式的增长,往往得不偿失。
    • 在实际的使用中,可以在维度中只放入这个维度表的主键(在底层实现中,我们更偏向使用事实表上的外键,因为在Inner Join的情况下,事实表外键和维度表主键一致的,而在left join的情况下事实表外键是维度表主键的超集),也就是说只物化按日期(CAL_DT)聚合的Cuboid。当用户需要按照更高的粒度如按周、按月来进行聚合时,在查询时会获取按日期聚合的Cuboid数据,并在查询引擎中实时地进行上卷操作,那么就达到了牺牲一部分运行时性能来节省Cube空间占用的目的。
    • kylin将这样的理念包装成一个简单的优化工具--衍生维度。将一个维度表上的维度设置为衍生维度,则这个维度不会参与预计算,而是使用维度表的主键(其实是事实表上相应的外键)来代替它。
    • kylin会在底层记录维度表主键与维度表其他维度之间的映射关系,以便在查询时能够动态地将维度表的主键"翻译"成这些非主键维度,并进行实时聚合。
    • 上面的描述虽然听起来有些复杂,但是使用起来其实非常简单,在创建Cube的Cube designer第二步添加维度的时候,选择Derived 而非 Normal,如下图所示:
      衍生维度
    • 衍生维度在Cube中不参加预计算,事实上如果前往Cube Designer的Advanced Setting,在Aggregation GroupsRowkeys部分也完全看不到这些衍生维度,甚至在这些地方也找不到维度表KYLIN_CAL_DT的主键,因为如前所述,Kylin实际上是用事实表上的外键作为作为这些衍生维度背后真正的有效维度的。在kylin官方案例 kylin-learn_kylin表结构解读
      KYLIN_CAL_DT通过以下方式链接:
    Join Condition:
    KYLIN_SALES.PART_DT = KYLIN_CAL_DT.CAL_DT
    
    • 因此,在Advanced Setting的Rowkeys部分就会看到PART_DT而看不到CAL_DT,更看不到哪些KYLIN_CAL_DT表上的衍生维度,如下图:
      Rowkey设计
    • 虽然衍生维度具有非常大的吸引力,但也不是说所有的维度表上的维度都得变成衍生维度,如果从维度表主键到某个维度表维度所需要的聚合工作量非常大,例如:从CAL_DTYEAR_BEG_DT基本上需要365:1的聚合量,那么将YEAR_BEG_DT作为一个普通的维度,而不是衍生维度可能是一种更好的选择。这种情况下,YEAR_BEG_DT会参与预计算,也会有一些包含YEAR_BEG_DT的Cuboid生成。

    相关文章

      网友评论

          本文标题:kylin-剪枝优化工具-使用衍生维度

          本文链接:https://www.haomeiwen.com/subject/mvtfqhtx.html