美文网首页Hadoop系我爱编程
Hadoop(二十六)HIVE的高级应用之用户行为归类

Hadoop(二十六)HIVE的高级应用之用户行为归类

作者: 文子轩 | 来源:发表于2018-06-21 22:44 被阅读15次

    一.HIVE的复合数据类型

    • collect_set 对字段进行表排序,对字段进行高级过滤
    • collect_list 数据可以重复
    • array_contains 对字段进行模糊匹配
    • sort_arrary 对数组进行排序
    • INPUT_FILE_NAME 位置中的文件名
    • BLOCK_OFFSET_INSIDE_FILE 在文件中位置

    二.广告用户归类管理

    • 数据表的设计


      图片.png

    黄色的是最初的两个表的数据click_log和ad_list表

    • 数据表结构及数据

      --========== click_log ==========--
      
      11    ad_101  2014-05-01 06:01:12.334+01
      22    ad_102  2014-05-01 07:28:12.342+01
      33    ad_103  2014-05-01 07:50:12.33+01
      11    ad_104  2014-05-01 09:27:12.33+01
      22    ad_103  2014-05-01 09:03:12.324+01
      33    ad_102  2014-05-02 19:10:12.343+01
      11    ad_101  2014-05-02 09:07:12.344+01
      35    ad_105  2014-05-03 11:07:12.339+01
      22    ad_104  2014-05-03 12:59:12.743+01
      77    ad_103  2014-05-03 18:04:12.355+01
      99    ad_102  2014-05-04 00:36:39.713+01
      33    ad_101  2014-05-04 19:10:12.343+01
      11    ad_101  2014-05-05 09:07:12.344+01
      35    ad_102  2014-05-05 11:07:12.339+01
      22    ad_103  2014-05-05 12:59:12.743+01
      77    ad_104  2014-05-05 18:04:12.355+01
      99    ad_105  2014-05-05 20:36:39.713+01
      

    创建表的语句和查询的语句 click_log

    CREATE EXTERNAL TABLE click_log (
        cookie_id  STRING
      , ad_id      STRING
      , ts         STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    

    ad_list的表

    ad_101  http://abcn.net/    catalog8|catalog1
    ad_102  http://www.abcn.net/    catalog6|catalog3
    ad_103  http://fxlive.de/   catalog7
    ad_104  http://fxlive.fr/   catalog5|catalog1|catalog4|catalog9
    ad_105  http://fxlive.eu/   atalog1|catalog4|catalog9
    ad_105  http://fxlive.eu/   
    

    创建表的语句

    CREATE EXTERNAL TABLE ad_list (
        ad_id    STRING
      , url      STRING
      , catalogs array<STRING>
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY '|'
    

    查询的语句

    select click.cookie_id, click.ad_id, click.amount, ad_list_string.catalogs as orders
    from (
      select cookie_id, ad_id, count(1) as amount from click_log group by cookie_id, ad_id
    ) click
    join ad_list_string
    on (ad_list_string.ad_id = click.ad_id);
    
    
    select ad_id, catalog from ad_list LATERAL VIEW OUTER explode(catalogs) t AS catalog;
    select ad_id, collect_set(catalog) from ad_list LATERAL VIEW OUTER explode(catalogs) t AS catalog group by ad_id;
    
    
    select click.cookie_id, ad.catalog from click_log click
    left outer join (
      select ad_id, catalog from ad_list LATERAL VIEW OUTER explode(catalogs) t AS catalog
    ) ad
    on (click.ad_id = ad.ad_id);
    

    创建cookie_catalogs表是前面的两个表关联的作用

    create table cookie_cats as
    select click.cookie_id, ad.catalog, count(1) as weight from click_log click
    left outer join (
      select ad_id, catalog from ad_list LATERAL VIEW OUTER explode(catalogs) t AS catalog
    ) ad
    on (click.ad_id = ad.ad_id)
    group by click.cookie_id, ad.catalog
    order by cookie_id, weight desc;
    

    相关文章

      网友评论

        本文标题:Hadoop(二十六)HIVE的高级应用之用户行为归类

        本文链接:https://www.haomeiwen.com/subject/cfqhyftx.html