美文网首页程序猿阵线联盟-汇总各类技术干货PostgreSQL
PostgreSQL 源码解读(35)- 查询语句#20(查询优

PostgreSQL 源码解读(35)- 查询语句#20(查询优

作者: EthanHe | 来源:发表于2018-09-06 12:21 被阅读9次

    本节简单介绍了PG查询优化中对Having和Group By子句的简化处理。

    一、基本概念

    简化Having语句
    把Having中的约束条件,如满足可以提升到Where条件中的,则移动到Where子句中,否则仍保留在Having语句中.这样做的目的是因为Having过滤在Group by之后执行,如能把Having中的过滤提升到Where中,则可以提前执行"选择"运算,减少Group by的开销.
    以下语句,条件dwbh='1002'提升到Where中执行:

    testdb=# explain verbose select a.dwbh,a.xb,count(*) 
    testdb-# from t_grxx a 
    testdb-# group by a.dwbh,a.xb
    testdb-# having count(*) >= 1 and dwbh = '1002';
                                     QUERY PLAN                                  
    -----------------------------------------------------------------------------
     GroupAggregate  (cost=15.01..15.06 rows=1 width=84)
       Output: dwbh, xb, count(*)
       Group Key: a.dwbh, a.xb
       Filter: (count(*) >= 1) -- count(*) >= 1 仍保留在Having中
       ->  Sort  (cost=15.01..15.02 rows=2 width=76)
             Output: dwbh, xb
             Sort Key: a.xb
             ->  Seq Scan on public.t_grxx a  (cost=0.00..15.00 rows=2 width=76)
                   Output: dwbh, xb
                   Filter: ((a.dwbh)::text = '1002'::text) -- 提升到Where中,扫描时过滤Tuple
    (10 rows)
    

    如存在Group by & Grouping sets则不作处理:

    testdb=# explain verbose
    testdb-# select a.dwbh,a.xb,count(*) 
    testdb-# from t_grxx a 
    testdb-# group by 
    testdb-# grouping sets ((a.dwbh),(a.xb),())
    testdb-# having count(*) >= 1 and dwbh = '1002'
    testdb-# order by a.dwbh,a.xb;
                                      QUERY PLAN                                   
    -------------------------------------------------------------------------------
     Sort  (cost=28.04..28.05 rows=3 width=84)
       Output: dwbh, xb, (count(*))
       Sort Key: a.dwbh, a.xb
       ->  MixedAggregate  (cost=0.00..28.02 rows=3 width=84)
             Output: dwbh, xb, count(*)
             Hash Key: a.dwbh
             Hash Key: a.xb
             Group Key: ()
             Filter: ((count(*) >= 1) AND ((a.dwbh)::text = '1002'::text)) -- 扫描数据表后再过滤
             ->  Seq Scan on public.t_grxx a  (cost=0.00..14.00 rows=400 width=76)
                   Output: dwbh, grbh, xm, xb, nl
    (11 rows)
    

    简化Group by语句
    如Group by中的字段列表已包含某个表主键的所有列,则该表在Group by语句中的其他列可以删除,这样的做法有利于提升在Group by过程中排序或Hash的性能,减少不必要的开销.

    testdb=# explain verbose select a.dwbh,a.dwmc,count(*) 
    testdb-# from t_dwxx a 
    testdb-# group by a.dwbh,a.dwmc
    testdb-# having count(*) >= 1;
                                    QUERY PLAN                                
    --------------------------------------------------------------------------
     HashAggregate  (cost=13.20..15.20 rows=53 width=264)
       Output: dwbh, dwmc, count(*)
       Group Key: a.dwbh, a.dwmc -- 分组键为dwbh & dwmc
       Filter: (count(*) >= 1)
       ->  Seq Scan on public.t_dwxx a  (cost=0.00..11.60 rows=160 width=256)
             Output: dwmc, dwbh, dwdz
    (6 rows)
    
    testdb=# alter table t_dwxx add primary key(dwbh); -- 添加主键
    ALTER TABLE
    testdb=# explain verbose select a.dwbh,a.dwmc,count(*) 
    from t_dwxx a 
    group by a.dwbh,a.dwmc
    having count(*) >= 1;
                                  QUERY PLAN                               
    -----------------------------------------------------------------------
     HashAggregate  (cost=1.05..1.09 rows=1 width=264)
       Output: dwbh, dwmc, count(*)
       Group Key: a.dwbh -- 分组键只保留dwbh
       Filter: (count(*) >= 1)
       ->  Seq Scan on public.t_dwxx a  (cost=0.00..1.03 rows=3 width=256)
             Output: dwmc, dwbh, dwdz
    (6 rows)
    

    二、源码解读

    相关处理的源码位于文件subquery_planner.c中,主函数为subquery_planner,代码片段如下:

         /*
          * In some cases we may want to transfer a HAVING clause into WHERE. We
          * cannot do so if the HAVING clause contains aggregates (obviously) or
          * volatile functions (since a HAVING clause is supposed to be executed
          * only once per group).  We also can't do this if there are any nonempty
          * grouping sets; moving such a clause into WHERE would potentially change
          * the results, if any referenced column isn't present in all the grouping
          * sets.  (If there are only empty grouping sets, then the HAVING clause
          * must be degenerate as discussed below.)
          *
          * Also, it may be that the clause is so expensive to execute that we're
          * better off doing it only once per group, despite the loss of
          * selectivity.  This is hard to estimate short of doing the entire
          * planning process twice, so we use a heuristic: clauses containing
          * subplans are left in HAVING.  Otherwise, we move or copy the HAVING
          * clause into WHERE, in hopes of eliminating tuples before aggregation
          * instead of after.
          *
          * If the query has explicit grouping then we can simply move such a
          * clause into WHERE; any group that fails the clause will not be in the
          * output because none of its tuples will reach the grouping or
          * aggregation stage.  Otherwise we must have a degenerate (variable-free)
          * HAVING clause, which we put in WHERE so that query_planner() can use it
          * in a gating Result node, but also keep in HAVING to ensure that we
          * don't emit a bogus aggregated row. (This could be done better, but it
          * seems not worth optimizing.)
          *
          * Note that both havingQual and parse->jointree->quals are in
          * implicitly-ANDed-list form at this point, even though they are declared
          * as Node *.
          */
         newHaving = NIL;
         foreach(l, (List *) parse->havingQual)//存在Having条件语句
         {
             Node       *havingclause = (Node *) lfirst(l);//获取谓词
     
             if ((parse->groupClause && parse->groupingSets) ||
                 contain_agg_clause(havingclause) ||
                 contain_volatile_functions(havingclause) ||
                 contain_subplans(havingclause))
             {
                 /* keep it in HAVING */
                 //如果有Group&&Group Sets语句
                 //保持不变
                 newHaving = lappend(newHaving, havingclause);
             }
             else if (parse->groupClause && !parse->groupingSets)
             {
                 /* move it to WHERE */
                 //只有group语句,可以加入到jointree的条件中
                 parse->jointree->quals = (Node *)
                     lappend((List *) parse->jointree->quals, havingclause);
             }
             else//既没有group也没有grouping set,拷贝一份到jointree的条件中
             {
                 /* put a copy in WHERE, keep it in HAVING */
                 parse->jointree->quals = (Node *)
                     lappend((List *) parse->jointree->quals,
                             copyObject(havingclause));
                 newHaving = lappend(newHaving, havingclause);
             }
         }
         parse->havingQual = (Node *) newHaving;//调整having子句
     
         /* Remove any redundant GROUP BY columns */
         remove_useless_groupby_columns(root);//去掉group by中无用的数据列
    
    

    remove_useless_groupby_columns

     /*
      * remove_useless_groupby_columns
      *      Remove any columns in the GROUP BY clause that are redundant due to
      *      being functionally dependent on other GROUP BY columns.
      *
      * Since some other DBMSes do not allow references to ungrouped columns, it's
      * not unusual to find all columns listed in GROUP BY even though listing the
      * primary-key columns would be sufficient.  Deleting such excess columns
      * avoids redundant sorting work, so it's worth doing.  When we do this, we
      * must mark the plan as dependent on the pkey constraint (compare the
      * parser's check_ungrouped_columns() and check_functional_grouping()).
      *
      * In principle, we could treat any NOT-NULL columns appearing in a UNIQUE
      * index as the determining columns.  But as with check_functional_grouping(),
      * there's currently no way to represent dependency on a NOT NULL constraint,
      * so we consider only the pkey for now.
      */
     static void
     remove_useless_groupby_columns(PlannerInfo *root)
     {
         Query      *parse = root->parse;//查询树
         Bitmapset **groupbyattnos;//位图集合
         Bitmapset **surplusvars;//位图集合
         ListCell   *lc;
         int         relid;
     
         /* No chance to do anything if there are less than two GROUP BY items */
         if (list_length(parse->groupClause) < 2)//如果只有1个ITEMS,无需处理
             return;
     
         /* Don't fiddle with the GROUP BY clause if the query has grouping sets */
         if (parse->groupingSets)//存在Grouping sets,不作处理
             return;
     
         /*
          * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
          * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
          * that are GROUP BY items.
          */
         //用于分组的属性 
         groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
                                                (list_length(parse->rtable) + 1));
         foreach(lc, parse->groupClause)
         {
             SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
             TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
             Var        *var = (Var *) tle->expr;
     
             /*
              * Ignore non-Vars and Vars from other query levels.
              *
              * XXX in principle, stable expressions containing Vars could also be
              * removed, if all the Vars are functionally dependent on other GROUP
              * BY items.  But it's not clear that such cases occur often enough to
              * be worth troubling over.
              */
             if (!IsA(var, Var) ||
                 var->varlevelsup > 0)
                 continue;
     
             /* OK, remember we have this Var */
             relid = var->varno;
             Assert(relid <= list_length(parse->rtable));
             groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
                                                   var->varattno - FirstLowInvalidHeapAttributeNumber);
         }
     
         /*
          * Consider each relation and see if it is possible to remove some of its
          * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
          * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
          * of the column attnos of RTE k that are removable GROUP BY items.
          */
         surplusvars = NULL;         /* don't allocate array unless required */
         relid = 0;
         //如某个Relation的分组键中已含主键列,去掉其他列
         foreach(lc, parse->rtable)
         {
             RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
             Bitmapset  *relattnos;
             Bitmapset  *pkattnos;
             Oid         constraintOid;
     
             relid++;
     
             /* Only plain relations could have primary-key constraints */
             if (rte->rtekind != RTE_RELATION)
                 continue;
     
             /* Nothing to do unless this rel has multiple Vars in GROUP BY */
             relattnos = groupbyattnos[relid];
             if (bms_membership(relattnos) != BMS_MULTIPLE)
                 continue;
     
             /*
              * Can't remove any columns for this rel if there is no suitable
              * (i.e., nondeferrable) primary key constraint.
              */
             pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
             if (pkattnos == NULL)
                 continue;
     
             /*
              * If the primary key is a proper subset of relattnos then we have
              * some items in the GROUP BY that can be removed.
              */
             if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
             {
                 /*
                  * To easily remember whether we've found anything to do, we don't
                  * allocate the surplusvars[] array until we find something.
                  */
                 if (surplusvars == NULL)
                     surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
                                                          (list_length(parse->rtable) + 1));
     
                 /* Remember the attnos of the removable columns */
                 surplusvars[relid] = bms_difference(relattnos, pkattnos);
     
                 /* Also, mark the resulting plan as dependent on this constraint */
                 parse->constraintDeps = lappend_oid(parse->constraintDeps,
                                                     constraintOid);
             }
         }
     
         /*
          * If we found any surplus Vars, build a new GROUP BY clause without them.
          * (Note: this may leave some TLEs with unreferenced ressortgroupref
          * markings, but that's harmless.)
          */
         if (surplusvars != NULL)
         {
             List       *new_groupby = NIL;
     
             foreach(lc, parse->groupClause)
             {
                 SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
                 TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
                 Var        *var = (Var *) tle->expr;
     
                 /*
                  * New list must include non-Vars, outer Vars, and anything not
                  * marked as surplus.
                  */
                 if (!IsA(var, Var) ||
                     var->varlevelsup > 0 ||
                     !bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
                                    surplusvars[var->varno]))
                     new_groupby = lappend(new_groupby, sgc);
             }
     
             parse->groupClause = new_groupby;
         }
     }
    

    三、参考资料

    planner.c

    相关文章

      网友评论

        本文标题:PostgreSQL 源码解读(35)- 查询语句#20(查询优

        本文链接:https://www.haomeiwen.com/subject/xyxxgftx.html