美文网首页
数仓--Hive-面试之Hive手写SQL案例

数仓--Hive-面试之Hive手写SQL案例

作者: 李小李的路 | 来源:发表于2019-05-10 23:33 被阅读0次

    既然是手写代码,那么就需要会手写,手写这个是重点

    1-请详细描述将一个有结构的文本文件student.txt导入到一个hive表中的步骤,及其关键字

    • 假设student.txt 有以下几列:id,name,gender三列
    • 1-创建数据库 create database student_info;
    • 2-创建hive表 student
    create external table student_info.student(
    id string comment '学生id',
    name string comment '学生姓名',
    gender string comment '学生性别'
    ) comment "学生信息表"
    row format delimited fields terminated by '\t'
    line terminated by '\n'
    stored as textfile
    location "/user/root/student";
    
    • 3-加载数据
    load data local inpath '/root/student.txt' into table student_info.student  location "/user/root/student" ;
    
    • 4- 进入hive-cli,查看相应的表结构
      select * from student_info.student limit 10;

    划重点:要回手写代码

    2-利用HQL实现以下功能

    2-1-创建表

    • 创建员工基本信息表(EmployeeInfo),字段包括(员工 ID,员工姓名,员工身份证号,性别,年龄,所属部门,岗位,入职公司时间,离职公司时间),分区字段为入职公司时间,其行分隔符为”\n “,字段分隔符为”\t “。其中所属部门包括行政部、财务部、研发部、教学部,其对应岗位包括行政经理、行政专员、财务经理、财务专员、研发工程师、测试工程师、实施工程师、讲师、助教、班主任等,时间类型值如:2018-05-10 11:00:00

    • 创建员工收入表(IncomeInfo),字段包括(员工 ID,员工姓名,收入金额,收入所属
      月份,收入类型,收入薪水的时间),分区字段为发放薪水的时间,其中收入类型包括薪资、奖金、公司福利、罚款四种情况 ; 时间类型值如:2018-05-10 11:00:00。

    注意:时间类型是2018-05-10 11:00:00,需要对字段进行处理

    • 创建员工基本信息表
    create external table test.employee_info(
    id string comment '员工id',
    name string comment '员工姓名',
    indentity_card string comment '身份证号',
    gender string comment '性别',
    department string comment '所属部门',
    post string comment '岗位',
    hire_date string comment '入职时间',
    departure_date string comment '离职时间'
    ) comment "员工基本信息表"
    partitioned by (day string comment "员工入职时间")
    row format delimited fields terminated by '\t'
    lines terminated by  '\n'
    stored as textfile 
    location '/user/root/employee';
    
    • 创建员工收入表
    create external table test.income_info(
    id string comment '员工id',
    name string comment '员工姓名',
    income_data string comment '收入',
    income_month string comment '收入所属月份',
    income_type string comment '收入类型',
    income_datetime string comment '收入薪水时间'
    ) comment '员工收入表'
    partitioned by (day string comment "员工发放薪水时间")
    row format delimited fields terminated by '\t'
    lines terminated by  '\n'
    stored as textfile 
    location '/user/root/income';
    

    2-2用 HQL 实现,求公司每年的员工费用总支出各是多少,并按年份降序排列?

    • 重点对时间类型 2018-05-10 11:00:00 进行内置函数处理
    • 需要读取income_info全量表,按照分区时间进行聚合,因为收入类型里面有罚款一项,所以需要在员工发放的钱中扣除罚款的钱。
    • 不采用join、对数据一次遍历输出结果,
    • 对于大数据量的情况下,要考虑对数据进行一次遍历求出结果
    select 
        income_year,(income_data-(nvl(penalty_data,0))) as company_cost
    from
    (
        -- 统计员工收入金额和罚款金额,输出 2019 500 10
        select 
            income_year,
            sum(case when income_type!='罚款' then data_total else 0 end) as income_data,
            sum(case when income_type='罚款' then data_total else 0 end) as penalty_data
        from
        (
        -- 按照年份、收入类型求收入金额
        select 
            year(to_date(income_datetime)) as income_year,
            income_type,
            sum(income_data) as data_total
        from
            test.income_info
        group by 
            year(to_date(income_datetime)) ,income_type
        ) tmp_a
        group by  tmp_a.income_year
    ) as  temp
    order by income_year desc;
    

    2-3用 HQL 实现,求各部门每年的员工费用总支出各是多少,并按年份降序,按部门的支出升序排列?

    • 保证对数据的一次遍历
    --根据id关联得出department,和消费类型
    select 
        income_year,department,
        (sum(case when income_type!='罚款' then income_data else 0 end) - sum(case when income_type='罚款' then income_data else 0 end) ) as department_cost
    from
    (
        -- 先对员工进行薪资类别的聚合统计
        select 
            id,year(to_date(income_datetime)) as income_year,income_type,sum(income_data) as income_data
        from 
            test.income_info
        group by 
        year(to_date(income_datetime)),id,income_type
    ) temp_a
    inner join
        test.employee_info b
    on
        temp_a.id=b.id
    group by
        department,income_year
    order by income_year desc , department_cost asc;
    

    2-4用 HQL 实现,求各部门历史所有员工费用总支出各是多少,按总支出多少排名降序,遇到值相等情况,不留空位。

    • 根据2-3中的中间结果进行修改
    • 注意历史上所有的数据
    select department,department_cost,dense_rank() over(order by department_cost desc) as cost_rank
    from
    (
    --根据id关联得出department,和消费类型
    select 
        department,
        (sum(case when income_type!='罚款' then income_data else 0 end) - sum(case when income_type='罚款' then income_data else 0 end) ) as department_cost
    from
    (
        -- 先对员工进行薪资类别的聚合统计
        select 
            id,income_type,sum(income_data) as income_data
        from 
            test.income_info
        group by 
        id,income_type
    ) temp_a
    inner join
        test.employee_info b
    on
        temp_a.id=b.id
    group by
        department
    ) tmp_c ;
    

    2-5 用 HQL 实现,创建并生成员工薪资收入动态变化表,即员工 ID,员工姓名,员工本月薪资,本月薪资发放时间,员工上月薪资,上月薪资发放时间。分区字段为本月薪资发放时间。

    • 感觉应该使用动态分区插入的特性?-但是不知道该怎么写
    • 先创建表,再采用insert into table **** select ***
    • 要考虑到离职和入职的员工,这一点需要考虑到,full join
    • 两张表进行full join,过滤day is null
    • 需要concat year month to_date内置函数处理
    • 这个题需要考虑的比较多
    create external table test.income_dynamic(
    id string comment '员工id',
    name string comment '员工姓名',
    income_data_current string comment '本月收入',
    income_datetime_current string comment 本月'收入薪水时间',
    income_data_last   string comment '上月收入',
    income_datetime_last string comment '上月收入薪水时间',
    ) comment '员工收入动态表'
    partitioned by (day string comment "员工本月发放薪水时间")
    row format delimited fields terminated by '\t'
    lines terminated by  '\n'
    stored as textfile 
    location '/user/root/income';
    -- ------------------------------------------------------------------------------
    -- 动态分区插入
    -- 插入语句
    -- 采用full join
    insert into table test.income_dynamic partition(day)
    select 
        (case when id_a is not null then id_a else id_b end ) as id,
        (case when name_a is not null then name_a else name_b end )  as name ,
        income_data,income_datetime,income_data_b,income_datetime_b,day
    from
        (
        -- 选出表中所有的数据
        select
            id as id_a,name as name_a,income_data,income_datetime,day,concat(year(to_date(day)),month(to_date(day))) as day_flag
        from 
            test.income_info
        where 
            income_type='薪资' ) tmp_a
    full outer join
        (
        -- 将表中的收到薪水的日期整体加一个月
        select
            id as id_b,name as name_b,income_data as income_data_b,income_datetime as  income_datetime_b,concat(year(add_months(to_date(day),1)),month(add_months(to_date(day),1))) as   month_flag
        from 
            test.income_info
        where 
            income_type='薪资'
        ) tmp_b
        on 
            tmp_a.day_flag=tmp_b.month_flag
        and 
            tmp_a.id_a=tmp_b.id_b
    where day is not null
    ;
    

    2-6 用 HQL 实现,薪资涨幅方面,2018 年 5 月份谁的工资涨的最多,谁的涨幅最大?

    • 再2-5的基础上做比较简单,仅仅利用select部分即可;或者是再2-5的基础上做就行

    Hive行列转换

    1、问题
    hive如何将
    a       b       1
    a       b       2
    a       b       3
    c       d       4
    c       d       5
    c       d       6
    变为:
    a       b       1,2,3
    c       d       4,5,6
    -------------------------------------------------------------------------------------------
    2、数据
    test.txt
    a       b       1 
    a       b       2 
    a       b       3 
    c       d       4 
    c       d       5 
    c       d       6
    ------------------------------------------------------------------------------------------- 
    3、答案
    1.建表
    drop table tmp_jiangzl_test;
    create table tmp_jiangzl_test
    (
    col1 string,
    col2 string,
    col3 string
    )
    row format delimited fields terminated by '\t'
    stored as textfile;
    -- 加载数据
    load data local inpath '/home/jiangzl/shell/test.txt' into table tmp_jiangzl_test;
    2.处理
    select col1,col2,concat_ws(',',collect_set(col3)) 
    from tmp_jiangzl_test  
    group by col1,col2;
    ---------------------------------------------------------------------------------------
    collect_set/concat_ws语法参考链接:https://blog.csdn.net/waiwai3/article/details/79071544
    https://blog.csdn.net/yeweiouyang/article/details/41286469   [Hive]用concat_w实现将多行记录合并成一行
    ---------------------------------------------------------------------------------------
    二、列转行
    1、问题
    hive如何将
    a       b       1,2,3
    c       d       4,5,6
    变为:
    a       b       1
    a       b       2
    a       b       3
    c       d       4
    c       d       5
    c       d       6
    ---------------------------------------------------------------------------------------------
    2、答案
    1.建表
    
    drop table tmp_jiangzl_test;
    create table tmp_jiangzl_test
    (
    col1 string,
    col2 string,
    col3 string
    )
    row format delimited fields terminated by '\t'
    stored as textfile;
    处理:
    select col1, col2, col5
    from tmp_jiangzl_test a 
    lateral  view explode(split(col3,',')) b AS col5;
    ---------------------------------------------------------------------------------------
    lateral  view 语法参考链接:
    https://blog.csdn.net/clerk0324/article/details/58600284
    
    

    Hive实现wordcount

    1.创建数据库
    create database wordcount;
    2.创建外部表
    create external table word_data(line string) row format delimited fields terminated by ',' location '/home/hadoop/worddata';
    3.映射数据表
    load data inpath '/home/hadoop/worddata' into table word_data;
    4.这里假设我们的数据存放在hadoop下,路径为:/home/hadoop/worddata,里面主要是一些单词文件,内容大概为:
    hello man
    what are you doing now
    my running
    hello
    kevin
    hi man
    执行了上述hql就会创建一张表src_wordcount,内容是这些文件的每行数据,每行数据存在字段line中,select * from word_data;就可以看到这些数据
    
    5.根据MapReduce的规则,我们需要进行拆分,把每行数据拆分成单词,这里需要用到一个hive的内置表生成函数(UDTF):explode(array),参数是array,其实就是行变多列:
    
    create table words(word string);
    insert into table words select explode(split(line, " ")) as word from word_data;
    
    6.查看words表内容
    OK
    hello
    man
    what
    are
    you
    doing
    now
    my
    running
    hello
    kevin
    hi
    man
    split是拆分函数,跟java的split功能一样,这里是按照空格拆分,所以执行完hql语句,words表里面就全部保存的单个单词
    7.group by统计单词
        select word, count(*) from wordcount.words group by word;
    wordcount.words 库名称.表名称,group by word这个word是create table words(word string) 命令创建的word string
    
    结果:
    are     1
    doing   1
    hello   2
    hi      1
    kevin   1
    man     2
    my      1
    now     1
    running 1
    what    1
    you     1
    

    Hive取TopN

    • rank() over()
    • dense_rank() over()
    • row_number() over()

    求取指定状态下的订单id

    • 给一张订单表,统计只购买过面粉的用户;(重点在于仅仅购买过面粉的客户)
      eg:order:order_id,buyer_id,order_time.....

    在保证一次遍历的情况下,重点是O(1)复杂度

    select buyer_id
    from
    (
    select buyer_id,sum(case when order_id='面粉' then 0 else 1 end) as flag
    from order
    ) as tmp
    where flag=0;
    

    微博体系中互粉的有多少组

    • 在微博粉丝表中,互相关注的人有多少组,例如:A-->B;B-->A;A和B互粉,称为一组。
      表结构:id,keep_id,time.... (id,keep_id可作为联合主键)
    • 借助Hive进行实现
    select count(*)/2 as weibo_relation_number
    from
    (
      (select concat(id,keep_id) as flag from weibo_relation)
      union all  --全部合并到一起,不能提前去重
      (select concat(keep_id,id) as flag from weibo_relation)
    ) as tmp
    having count(flag) =2
    group by flag;
    

    购买了香蕉的人买了多少东西

    • 这个是一个很经典的问题,购买了香蕉的人买了多少东西
    • 数据还是延用上一个问题的数据和表结构,即理解为关注C的人总共关注了多少人
    • 仔细理解是需要对关注的人进行去重统计
    select count(distinct keep_id) as total_keep_id
    from weibo_relation
    where id
      in
    (select id from weibo_relation where keep_id='c')
    

    相关文章

      网友评论

          本文标题:数仓--Hive-面试之Hive手写SQL案例

          本文链接:https://www.haomeiwen.com/subject/rwdaoqtx.html