美文网首页SQL
SQL for Data Analysis - Lesson 4

SQL for Data Analysis - Lesson 4

作者: IntoTheVoid | 来源:发表于2018-08-26 20:00 被阅读15次

    子查询

    目的: 有时候,你要回答的问题无法通过直接处理数据库中的现有表格获得答案。但是,如果我们能通过现有的表格创建新的表格,我们就能查询这些新的表格,并回答我们的问题。

    子查询简介

    子查询表格表达式都是用来通过查询创建一个表格,然后再编写一个查询来与这个新创建的表格进行互动。每当我们需要使用现有表格创建新的表格,然后需要再次查询时,就表明我们需要使用某种子查询。

    某些查询(也成为内层查询或嵌套查询Subqueries)是可用于多个步骤的工具.

    以营销经理为例, 哪些渠道平均每天发送给Parch&Posey的流量最大?

    • 首先查询基础表, 确保数据可用
    • 统计每天每个渠道的所有事件
    • 将以上查询构建为子查询表, 要做到这一点, 首先将以上查询放入括号中, 并将其用在编写下个查询的FROM子句, 子查询需要别名, 在括号后面添加别名
    • 在创建的事件列中计算平均值, 由于子查询就像FROM 子句的一个表格一样, 我们将在子查询之后放置一个GROUP BY子句, 由于现在是根据这个新的聚合重新查询, 因此不再需要在子查询中使用ORDER BY语句
    SELECT channel,
           AVG(event_count) AS avg_event_count
    FROM
    (SELECT DATE_TRUNC('day', occurred_at) AS day,
           channel,
           COUNT(*) AS event_count
      FROM demo.web_events_full
    GROUP BY 1, 2
     ) sub
    
     GROUP BY 1
     ORDER BY 2 DESC
    

    以上代码, 首先运行内层查询, 然后针对内层查询创建的结果集来运行外层查询, 查询结果如下

    image.png

     
     

    子查询格式

    在使用子查询时,要让读者能够轻松地判断查询的哪个部分将一起执行。大部分人的做法是按照某种方式缩进子查询.

    • 格式糟糕的查询

    示例1: 无法判断查询的作用

    image.png

    示例2: 不是太糟糕, 但是还有更好的写法.

    image.png
    • 格式清晰的查询

    示例3:

    SELECT *
    FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
                    channel, COUNT(*) as events
          FROM web_events 
          GROUP BY 1,2
          ORDER BY 3 DESC) sub;
    

    示例4:

    SELECT *
    FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
                    channel, COUNT(*) as events
          FROM web_events 
          GROUP BY 1,2
          ORDER BY 3 DESC) sub
    GROUP BY channel
    ORDER BY 2 DESC;
    

     
     

    更多子查询内容

    子查询可以在查询中的若干地方使用, 无论是表名称, 列名称还是个别值, 尤其适用于条件逻辑, 可与WHEREJOIN子句协同使用, 或用于CASE语句的WHEN部分.

    例如, 你可能希望仅返回与Parch&Posies首个订单相同月份的订单

    要获取首个订单的日期, 你可以编写一个带有MIN函数的子查询, 以下代码获取首个订单的信息

    SELECT MIN(occurred_at) AS min
      FROM orders
    

    在以上代码中, 再添加一个DATE_TRUNC函数来获取首个订单的月份

    SELECT DATE_TRUNC('month', MIN(occurred_at)) AS min
      FROM orders
    

    最后使用上面的子查询, 来使用WHERE筛选订单表, 并按occurred_at列进行分类,

    SELECT *
      FROM orders
      WHERE DATE_TRUNC('month', occurred_at) =
            (SELECT DATE_TRUNC('month', MIN(occurred_at)) AS min
               FROM orders)
      ORDER BY occurred_at
    

    此查询可行的原因是子查询的结果只有一个单元格

    image.png

    大多数条件逻辑都可以和包含单个单元格结果的子查询一起使用, 但是IN是唯一一种, 当内层查询包含多个结果时可用的条件逻辑类型.

    提示
    注意,在条件语句中编写子查询时,不能包含别名。这是因为该子查询会被当做单个值(或者对于 IN 情况是一组值),而不是一个表格。

    同时注意,这里的查询对应的是单个值。如果我们返回了整个列,则需要使用 IN 来执行逻辑参数。如果我们要返回整个表格,则必须为该表格使用别名,并对整个表格执行其他逻辑。

     
     
    更多的子查询练习
    相关ERD

    image.png

    1.提供每个区域拥有最高销售额 (total_amt_usd) 的销售代表的姓名。
    步骤:

    首先,我要算出与每个销售代表相关的总销售额 (total_amt_usd),并且要得出他们所在的区域。以下查询提供了这一信息。
    SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
    FROM sales_reps s
    JOIN accounts a
    ON a.sales_rep_id = s.id
    JOIN orders o
    ON o.account_id = a.id
    JOIN region r
    ON r.id = s.region_id
    GROUP BY 1,2
    ORDER BY 3 DESC;

    接着,得出每个区域的最高销售额,然后使用该信息从最终结果中获取这些行。
    SELECT region_name, MAX(total_amt) total_amt
    FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
    FROM sales_reps s
    JOIN accounts a
    ON a.sales_rep_id = s.id
    JOIN orders o
    ON o.account_id = a.id
    JOIN region r
    ON r.id = s.region_id
    GROUP BY 1, 2) inner1
    GROUP BY 1;

    最后, 本质上,这是两个表格的连接,其中区域和销售额相匹配。完整代码如下

    SELECT t2.sales_rep_name, t2.region, t2.total_usd
      FROM (SELECT region, 
                   MAX(total_usd) AS total_usd
              FROM (SELECT r.name region, 
                           s.name sales_rep_name,
                           SUM(o.total_amt_usd) total_usd
                      FROM orders o
                      JOIN accounts a
                        ON o.account_id = a.id
                      JOIN sales_reps s
                        ON a.sales_rep_id = s.id
                      JOIN region r
                        ON s.region_id = r.id
                  GROUP BY 1,2) inner1
            GROUP BY 1) t1
    
    JOIN (SELECT r.name region, 
                 s.name sales_rep_name,
                 SUM(o.total_amt_usd) total_usd
            FROM orders o
            JOIN accounts a
              ON o.account_id = a.id
            JOIN sales_reps s
              ON a.sales_rep_id = s.id
            JOIN region r
              ON s.region_id = r.id
          GROUP BY 1,2
          ORDER BY 3 DESC) t2
    
    ON t1.region = t2.region AND t1.total_usd = t2.total_usd
    
    

    方案二:使用with设置子表(推荐)

    WITH t1 AS (
      SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
       FROM sales_reps s
       JOIN accounts a
       ON a.sales_rep_id = s.id
       JOIN orders o
       ON o.account_id = a.id
       JOIN region r
       ON r.id = s.region_id
       GROUP BY 1,2
       ORDER BY 3 DESC), 
    t2 AS (
       SELECT region_name, MAX(total_amt) total_amt
       FROM t1
       GROUP BY 1)
    SELECT t1.rep_name, t1.region_name, t1.total_amt
    FROM t1
    JOIN t2
    ON t1.region_name = t2.region_name AND t1.total_amt = t2.total_amt;
    

    2.对于具有最高销售额 (total_amt_usd) 的区域,总共下了多少个订单 (total count orders) ?

    SELECT t1.region, t1.total_usd, t2.total_orders
      FROM (SELECT region,
                   MAX(total_usd) AS total_usd
              FROM (SELECT r.name region, 
                           SUM(o.total) total_orders,
                           SUM(o.total_amt_usd) total_usd
                      FROM orders o
                      JOIN accounts a
                        ON o.account_id = a.id
                      JOIN sales_reps s
                        ON a.sales_rep_id = s.id
                      JOIN region r
                        ON s.region_id = r.id
                  GROUP BY 1) inner1
            GROUP BY 1
            ORDER BY 2 DESC
              LIMIT 1) t1
    备注:以上代码先找出最大销售的区域, 此为一个区域
      JOIN (SELECT r.name region, 
                   SUM(o.total) total_orders,
                   SUM(o.total_amt_usd) total_usd
              FROM orders o
              JOIN accounts a
                ON o.account_id = a.id
              JOIN sales_reps s
                ON a.sales_rep_id = s.id
              JOIN region r
                ON s.region_id = r.id
            GROUP BY 1) t2
        ON t1.region = t2.region
    备注: t1只有一个值,因此在和t2进行匹配时,只能匹配一行结果, 也就是我们需要的
    
    

    方案二

    WITH t1 AS (
       SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
       FROM sales_reps s
       JOIN accounts a
       ON a.sales_rep_id = s.id
       JOIN orders o
       ON o.account_id = a.id
       JOIN region r
       ON r.id = s.region_id
       GROUP BY r.name), 
    t2 AS (
       SELECT MAX(total_amt)
       FROM t1)
    SELECT r.name, COUNT(o.total) total_orders
    FROM sales_reps s
    JOIN accounts a
    ON a.sales_rep_id = s.id
    JOIN orders o
    ON o.account_id = a.id
    JOIN region r
    ON r.id = s.region_id
    GROUP BY r.name
    HAVING SUM(o.total_amt_usd) = (SELECT * FROM t2)
    

    3.对于购买标准纸张数量 (standard_qty) 最多的客户(在作为客户的整个时期内),有多少客户的购买总数依然更多?

    SELECT a.name
      FROM orders o
      JOIN accounts a
        ON a.id = o.account_id
    GROUP BY 1 
    HAVING SUM(o.total) > (SELECT a.name
                             FROM (SELECT a.name,
                                          SUM(o.standard_qty) standard,
                                          SUM(o.total) total
                                     FROM orders o
                                     JOIN accounts a
                                       ON o.account_id = a.id
                                   GROUP BY 1
                                   ORDER BY 2 DESC
                                    LIMIT 1) sub)
    
    sub子查询代表的是标准纸张数量 (standard_qty) 最多的客户, 
    使其和所有客户的所有订单数进行比较, 也就是HAVING语句.
    
    

    方案二

    WITH t1 AS (
      SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total
      FROM accounts a
      JOIN orders o
      ON o.account_id = a.id
      GROUP BY 1
      ORDER BY 2 DESC
      LIMIT 1), 
    t2 AS (
      SELECT a.name
      FROM orders o
      JOIN accounts a
      ON a.id = o.account_id
      GROUP BY 1
      HAVING SUM(o.total) > (SELECT total FROM t1))
    SELECT COUNT(*)
    FROM t2;
    

    4.对于(在作为客户的整个时期内)总消费 (total_amt_usd) 最多的客户,他们在每个渠道上有多少 web_events?

    SELECT a.name, w.channel, COUNT(*)
    FROM accounts a
    JOIN web_events w
    ON a.id = w.account_id AND a.id =  (SELECT id
                         FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
                               FROM orders o
                               JOIN accounts a
                               ON a.id = o.account_id
                               GROUP BY a.id, a.name
                               ORDER BY 3 DESC
                               LIMIT 1) inner_table)
    GROUP BY 1, 2
    ORDER BY 3 DESC;
    

    方案二

    WITH t1 AS (
       SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
       FROM orders o
       JOIN accounts a
       ON a.id = o.account_id
       GROUP BY a.id, a.name
       ORDER BY 3 DESC
       LIMIT 1)
    SELECT a.name, w.channel, COUNT(*)
    FROM accounts a
    JOIN web_events w
    ON a.id = w.account_id AND a.id =  (SELECT id FROM t1)
    GROUP BY 1, 2
    ORDER BY 3 DESC;
    

    5.对于总消费前十名的客户,他们的平均终身消费 (total_amt_usd) 是多少?

    SELECT AVG(tot_spent)
    FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
          FROM orders o
          JOIN accounts a
          ON a.id = o.account_id
          GROUP BY a.id, a.name
          ORDER BY 3 DESC
           LIMIT 10) temp;
    
    
    

    方案二

    WITH t1 AS (
       SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
       FROM orders o
       JOIN accounts a
       ON a.id = o.account_id
       GROUP BY a.id, a.name
       ORDER BY 3 DESC
       LIMIT 10)
    SELECT AVG(tot_spent)
    FROM t1;
    

    6.比所有客户的平均消费高的企业平均终身消费 (total_amt_usd) 是多少?

    SELECT AVG(avg_amt)
    FROM (SELECT o.account_id, AVG(o.total_amt_usd) avg_amt
        FROM orders o
        GROUP BY 1
        HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
                                   FROM orders o
                                   JOIN accounts a
                                   ON a.id = o.account_id)) temp_table;
    
    
    

    方案二

    WITH t1 AS (
       SELECT AVG(o.total_amt_usd) avg_all
       FROM orders o
       JOIN accounts a
       ON a.id = o.account_id),
    t2 AS (
       SELECT o.account_id, AVG(o.total_amt_usd) avg_amt
       FROM orders o
       GROUP BY 1
       HAVING AVG(o.total_amt_usd) > (SELECT * FROM t1))
    SELECT AVG(avg_amt)
    FROM t2;
    

    相关文章

      网友评论

        本文标题:SQL for Data Analysis - Lesson 4

        本文链接:https://www.haomeiwen.com/subject/wlyoiftx.html