美文网首页我是程序员大数据数据库知识点
Data Lake Analytics + OSS数据文件格式处

Data Lake Analytics + OSS数据文件格式处

作者: 阿里云云栖号 | 来源:发表于2018-11-23 14:42 被阅读26次

    0. 前言

    Data Lake Analytics是Serverless化的云上交互式查询分析服务。用户可以使用标准的SQL语句,对存储在OSS、TableStore上的数据无需移动,直接进行查询分析。

    目前该产品已经正式登陆阿里云,欢迎大家申请试用,体验更便捷的数据分析服务。
    请参考https://help.aliyun.com/document_detail/70386.html 进行产品开通服务申请。

    在上一篇教程中,我们介绍了如何分析CSV格式的TPC-H数据集。除了纯文本文件(例如,CSV,TSV等),用户存储在OSS上的其他格式的数据文件,也可以使用Data Lake Analytics进行查询分析,包括ORC, PARQUET, JSON, RCFILE, AVRO甚至ESRI规范的地理JSON数据,还可以用正则表达式匹配的文件等。

    本文详细介绍如何根据存储在OSS上的文件格式使用Data Lake Analytics (下文简称 DLA)进行分析。DLA内置了各种处理文件数据的SerDe(Serialize/Deserilize的简称,目的是用于序列化和反序列化)实现,用户无需自己编写程序,基本上能选用DLA中的一款或多款SerDe来匹配您OSS上的数据文件格式。如果还不能满足您特殊文件格式的处理需求,请联系我们,尽快为您实现。

    1. 存储格式与SerDe

    用户可以依据存储在OSS上的数据文件进行建表,通过STORED AS 指定数据文件的格式。
    例如,

    CREATE EXTERNAL TABLE nation (
        N_NATIONKEY INT, 
        N_NAME STRING, 
        N_REGIONKEY INT, 
        N_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';
    

    建表成功后可以使用SHOW CREATE TABLE语句查看原始建表语句。

    mysql> show create table nation;
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | CREATE EXTERNAL TABLE `nation`(
      `n_nationkey` int,
      `n_name` string,
      `n_regionkey` int,
      `n_comment` string)
    ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '|'
    STORED AS `TEXTFILE`
    LOCATION
      'oss://test-bucket-julian-1/tpch_100m/nation'|
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (1.81 sec)
    

    下表中列出了目前DLA已经支持的文件格式,当针对下列格式的文件建表时,可以直接使用STORED AS,DLA会选择合适的SERDE/INPUTFORMAT/OUTPUTFORMAT。

    在指定了STORED AS 的同时,还可以根据具体文件的特点,指定SerDe (用于解析数据文件并映射到DLA表),特殊的列分隔符等。
    后面的部分会做进一步的讲解。

    2. 示例

    2.1 CSV文件

    CSV文件,本质上还是纯文本文件,可以使用STORED AS TEXTFILE。
    列与列之间以逗号分隔,可以通过ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 表示。

    普通CSV文件

    例如,数据文件oss://bucket-for-testing/oss/text/cities/city.csv的内容为

    Beijing,China,010
    ShangHai,China,021
    Tianjin,China,022
    

    建表语句可以为

    CREATE EXTERNAL TABLE city (
        city STRING, 
        country STRING, 
        code INT
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    STORED AS TEXTFILE 
    LOCATION 'oss://bucket-for-testing/oss/text/cities';
    

    使用OpenCSVSerde__处理引号__引用的字段

    OpenCSVSerde在使用时需要注意以下几点:

    1. 用户可以为行的字段指定字段分隔符、字段内容引用符号和转义字符,例如:WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "" );
    2. 不支持字段内嵌入的行分割符;
    3. 所有字段定义STRING类型;
    4. 其他数据类型的处理,可以在SQL中使用函数进行转换。
      例如,
    CREATE EXTERNAL TABLE test_csv_opencsvserde (
      id STRING,
      name STRING,
      location STRING,
      create_date STRING,
      create_timestamp STRING,
      longitude STRING,
      latitude STRING
    ) 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    with serdeproperties(
    "separatorChar"=",",
    "quoteChar"="\"",
    "escapeChar"="\\"
    )
    STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1';
    

    自定义分隔符

    需要自定义列分隔符(FIELDS TERMINATED BY),转义字符(ESCAPED BY),行结束符(LINES TERMINATED BY)。
    需要在建表语句中指定

    ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '\t'
        ESCAPED BY '\\'
        LINES TERMINATED BY '\n'
    

    忽略CSV文件中的HEADER

    在csv文件中,有时会带有HEADER信息,需要在数据读取时忽略掉这些内容。这时需要在建表语句中定义skip.header.line.count。

    例如,数据文件oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl的内容如下:

    N_NATIONKEY|N_NAME|N_REGIONKEY|N_COMMENT
    0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|
    1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon|
    2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |
    3|CANADA|1|eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold|
    4|EGYPT|4|y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d|
    5|ETHIOPIA|0|ven packages wake quickly. regu|
    

    相应的建表语句为:

    CREATE EXTERNAL TABLE nation_header (
        N_NATIONKEY INT, 
        N_NAME STRING, 
        N_REGIONKEY INT, 
        N_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl'
    TBLPROPERTIES ("skip.header.line.count"="1");
    

    skip.header.line.count的取值x和数据文件的实际行数n有如下关系:

    • 当x<=0时,DLA在读取文件时,不会过滤掉任何信息,即全部读取;
    • 当0
    • 当x>=n时,DLA在读取文件时,会过滤掉所有的文件内容。

    2.2 TSV文件

    与CSV文件类似,TSV格式的文件也是纯文本文件,列与列之间的分隔符为Tab。

    例如,数据文件oss://bucket-for-testing/oss/text/cities/city.tsv的内容为

    Beijing    China    010
    ShangHai    China    021
    Tianjin    China    022
    

    建表语句可以为

    CREATE EXTERNAL TABLE city (
        city STRING, 
        country STRING, 
        code INT
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
    STORED AS TEXTFILE 
    LOCATION 'oss://bucket-for-testing/oss/text/cities';
    

    2.3 多字符数据字段分割符文件

    假设您的数据字段的分隔符包含多个字符,可采用如下示例建表语句,其中每行的数据字段分割符为“||”,可以替换为您具体的分割符字符串。

    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
    with serdeproperties(
    "field.delim"="||"
    )
    

    示例:

    CREATE EXTERNAL TABLE test_csv_multidelimit (
      id STRING,
      name STRING,
      location STRING,
      create_date STRING,
      create_timestamp STRING,
      longitude STRING,
      latitude STRING
    ) 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
    with serdeproperties(
    "field.delim"="||"
    )
    STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities/';
    

    2.4 JSON文件

    DLA可以处理的JSON文件通常以纯文本的格式存储,在建表时除了要指定STORED AS TEXTFILE, 还要定义SERDE。
    在JSON文件中,每行必须是一个完整的JSON对象。
    例如,下面的文件格式是不被接受的

    {"id": 123, "name": "jack", 
    "c3": "2001-02-03 12:34:56"}
    {"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
    {"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
    {"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}
    

    需要改写成:

    {"id": 123, "name": "jack", "c3": "2001-02-03 12:34:56"}
    {"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
    {"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
    {"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}
    

    不含嵌套的JSON数据

    建表语句可以写

    CREATE EXTERNAL TABLE t1 (id int, name string, c3 timestamp)
    STORED AS JSON
    LOCATION 'oss://path/to/t1/directory';
    
    含有嵌套的JSON文件

    使用struct和array结构定义嵌套的JSON数据。
    例如,用户原始数据(注意:无论是否嵌套,一条完整的JSON数据都只能放在一行上,才能被Data Lake Analytics处理):

    {       "DocId": "Alibaba",         "User_1": {             "Id": 1234,             "Username": "bob1234",          "Name": "Bob",          "ShippingAddress": {                    "Address1": "969 Wenyi West St.",                     "Address2": null,                       "City": "Hangzhou",                      "Province": "Zhejiang"           },              "Orders": [{                            "ItemId": 6789,                                 "OrderDate": "11/11/2017"                       },                      {                               "ItemId": 4352,                                 "OrderDate": "12/12/2017"                       }               ]       } }
    

    使用在线JSON格式化工具格式化后,数据内容如下:

    {
        "DocId": "Alibaba", 
        "User_1": {
            "Id": 1234, 
            "Username": "bob1234", 
            "Name": "Bob", 
            "ShippingAddress": {
                "Address1": "969 Wenyi West St.", 
                "Address2": null, 
                "City": "Hangzhou", 
                "Province": "Zhejiang"
            }, 
            "Orders": [
                {
                    "ItemId": 6789, 
                    "OrderDate": "11/11/2017"
                }, 
                {
                    "ItemId": 4352, 
                    "OrderDate": "12/12/2017"
                }
            ]
        }
    }
    

    则建表语句可以写成如下(注意:LOCATION中指定的路径必须是JSON数据文件所在的目录,该目录下的所有JSON文件都能被识别为该表的数据):

    CREATE EXTERNAL TABLE json_table_1 (
        docid string,
        user_1 struct<
                id:INT,
                username:string,
                name:string,
                shippingaddress:struct<
                                address1:string,
                                address2:string,
                                city:string,
                                province:string
                                >,
                orders:array<
                        struct<
                            itemid:INT,
                            orderdate:string
                        >
                >
        >
    )
    STORED AS JSON
    LOCATION 'oss://xxx/test/json/hcatalog_serde/table_1/';
    

    对该表进行查询:

    select * from json_table_1;
    
    +---------+----------------------------------------------------------------------------------------------------------------+
    | docid   | user_1                                                                                                         |
    +---------+----------------------------------------------------------------------------------------------------------------+
    | Alibaba | [1234, bob1234, Bob, [969 Wenyi West St., null, Hangzhou, Zhejiang], [[6789, 11/11/2017], [4352, 12/12/2017]]] |
    +---------+----------------------------------------------------------------------------------------------------------------+
    

    对于struct定义的嵌套结构,可以通过“.”进行层次对象引用,对于array定义的数组结构,可以通过“[数组下标]”(注意:数组下标从1开始)进行对象引用。

    select DocId,
           User_1.Id,
           User_1.ShippingAddress.Address1,
           User_1.Orders[1].ItemId
    from json_table_1
    where User_1.Username = 'bob1234'
      and User_1.Orders[2].OrderDate = '12/12/2017';
    
    +---------+------+--------------------+-------+
    | DocId   | id   | address1           | _col3 |
    +---------+------+--------------------+-------+
    | Alibaba | 1234 | 969 Wenyi West St. |  6789 |
    +---------+------+--------------------+-------+
    

    使用JSON函数处理数据

    例如,把“value_string”的嵌套JSON值作为字符串存储:

    {"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}
    

    使用在线JSON格式化工具格式化后,数据内容如下:

    {
        "data_key": "com.taobao.vipserver.domains.meta.biz.alibaba.com", 
        "ts": 1524550275112, 
        "value_string": "{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"
    }
    

    建表语句为

    CREATE external TABLE json_table_2 (
       data_key string,
       ts bigint,
       value_string string
    )
    STORED AS JSON
    LOCATION 'oss://xxx/test/json/hcatalog_serde/table_2/';
    

    表建好后,可进行查询:

    select * from json_table_2;
    
    +---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | data_key                                          | ts            | value_string                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
    +---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | com.taobao.vipserver.domains.meta.biz.alibaba.com | 1524550275112 | {"appName":"","apps":[],"checksum":"50fa0540b430904ee78dff07c7350e1c","clusterMap":{"DEFAULT":{"defCkport":80,"defIPPort":80,"healthCheckTask":null,"healthChecker":{"checkCode":200,"curlHost":"","curlPath":"/status.taobao","type":"HTTP"},"name":"DEFAULT","nodegroup":"","sitegroup":"","submask":"0.0.0.0/0","syncConfig":{"appName":"trade-ma","nodegroup":"tradema","pubLevel":"publish","role":"","site":""},"useIPPort4Check":true}},"disabledSites":[],"enableArmoryUnit":false,"enableClientBeat":false,"enableHealthCheck":true,"enabled":true,"envAndSites":"","invalidThreshold":0.6,"ipDeleteTimeout":1800000,"lastModifiedMillis":1524550275107,"localSiteCall":true,"localSiteThreshold":0.8,"name":"biz.alibaba.com","nodegroup":"","owners":["junlan.zx","张三","李四","cui.yuanc"],"protectThreshold":0,"requireSameEnv":false,"resetWeight":false,"symmetricCallType":null,"symmetricType":"warehouse","tagName":"ipGroup","tenantId":"","tenants":[],"token":"1cf0ec0c771321bb4177182757a67fb0","useSpecifiedURL":false}       |
    +---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    下面SQL示例json_parse,json_extract_scalar,json_extract等常用JSON函数的使用方式:

    mysql> select json_extract_scalar(json_parse(value), '$.owners[1]') from json_table_2;
    
    +--------+
    | _col0  |
    +--------+
    | 张三    |
    +--------+
    
    mysql> select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask') 
    from (
      select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2
    ) json_obj
    where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';
    
    +-----------+
    | _col0     |
    +-----------+
    | 0.0.0.0/0 |
    +-----------+
    
    mysql> with json_obj as (select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2)
    select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask')
    from json_obj 
    where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';
    
    +-----------+
    | _col0     |
    +-----------+
    | 0.0.0.0/0 |
    +-----------+
    

    2.5 ORC文件

    Optimized Row Columnar(ORC)是Apache开源项目Hive支持的一种优化的列存储文件格式。与CSV文件相比,不仅可以节省存储空间,还可以得到更好的查询性能。

    对于ORC文件,只需要在建表时指定 STORED AS ORC。
    例如,

    CREATE EXTERNAL TABLE orders_orc_date (
        O_ORDERKEY INT, 
        O_CUSTKEY INT, 
        O_ORDERSTATUS STRING, 
        O_TOTALPRICE DOUBLE, 
        O_ORDERDATE DATE, 
        O_ORDERPRIORITY STRING, 
        O_CLERK STRING, 
        O_SHIPPRIORITY INT, 
        O_COMMENT STRING
    ) 
    STORED AS ORC 
    LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';
    

    2.6 PARQUET文件

    Parquet是Apache开源项目Hadoop支持的一种列存储的文件格式。
    使用DLA建表时,需要指定STORED AS PARQUET即可。
    例如,

    CREATE EXTERNAL TABLE orders_parquet_date (
        O_ORDERKEY INT, 
        O_CUSTKEY INT, 
        O_ORDERSTATUS STRING, 
        O_TOTALPRICE DOUBLE, 
        O_ORDERDATE DATE, 
        O_ORDERPRIORITY STRING, 
        O_CLERK STRING, 
        O_SHIPPRIORITY INT, 
        O_COMMENT STRING
    ) 
    STORED AS PARQUET 
    LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';
    

    2.7 RCFILE文件

    Record Columnar File (RCFile), 列存储文件,可以有效地将关系型表结构存储在分布式系统中,并且可以被高效地读取和处理。
    DLA在建表时,需要指定STORED AS RCFILE。
    例如,

    CREATE EXTERNAL TABLE lineitem_rcfile_date (
        L_ORDERKEY INT, 
        L_PARTKEY INT, 
        L_SUPPKEY INT, 
        L_LINENUMBER INT, 
        L_QUANTITY DOUBLE, 
        L_EXTENDEDPRICE DOUBLE, 
        L_DISCOUNT DOUBLE, 
        L_TAX DOUBLE, 
        L_RETURNFLAG STRING, 
        L_LINESTATUS STRING, 
        L_SHIPDATE DATE, 
        L_COMMITDATE DATE, 
        L_RECEIPTDATE DATE, 
        L_SHIPINSTRUCT STRING, 
        L_SHIPMODE STRING, 
        L_COMMENT STRING
    ) 
    STORED AS RCFILE
    LOCATION 'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'
    

    2.8 AVRO文件

    DLA针对AVRO文件建表时,需要指定STORED AS AVRO,并且定义的字段需要符合AVRO文件的schema。

    如果不确定可以通过使用Avro提供的工具,获得schema,并根据schema建表。
    Apache Avro官网下载avro-tools-.jar到本地,执行下面的命令获得Avro文件的schema:

    java -jar avro-tools-1.8.2.jar getschema /path/to/your/doctors.avro
    {
      "type" : "record",
      "name" : "doctors",
      "namespace" : "testing.hive.avro.serde",
      "fields" : [ {
        "name" : "number",
        "type" : "int",
        "doc" : "Order of playing the role"
      }, {
        "name" : "first_name",
        "type" : "string",
        "doc" : "first name of actor playing role"
      }, {
        "name" : "last_name",
        "type" : "string",
        "doc" : "last name of actor playing role"
      } ]
    }
    

    建表语句如下,其中fields中的name对应表中的列名,type需要参考本文档中的表格转成hive支持的类型

    CREATE EXTERNAL TABLE doctors(
    number int,
    first_name string,
    last_name string)
    STORED AS AVRO
    LOCATION 'oss://mybucket-for-testing/directory/to/doctors';
    

    大多数情况下,Avro的类型可以直接转换成Hive中对应的类型。如果该类型在Hive不支持,则会转换成接近的类型。具体请参照下表:

    2.9 可以用正则表达式匹配的文件

    通常此类型的文件是以纯文本格式存储在OSS上的,每一行代表表中的一条记录,并且每行可以用正则表达式匹配。
    例如,Apache WebServer日志文件就是这种类型的文件。

    某日志文件的内容为:

    127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
    127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/?track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"
    

    每行文件可以用下面的正则表达式表示,列之间使用空格分隔:

    ([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?
    

    针对上面的文件格式,建表语句可以表示为:

    CREATE EXTERNAL TABLE serde_regex(
      host STRING,
      identity STRING,
      userName STRING,
      time STRING,
      request STRING,
      status STRING,
      size INT,
      referer STRING,
      agent STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
      "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"
    )
    STORED AS TEXTFILE
    LOCATION 'oss://bucket-for-testing/datasets/serde/regex';
    

    查询结果

    mysql> select * from serde_regex;
    +-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
    | host      | identity | userName | time                         | request                                     | status | size | referer | agent                                                                                                                    |
    +-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
    | 127.0.0.1 | -        | frank | [10/Oct/2000:13:55:36 -0700] | "GET /apache_pb.gif HTTP/1.0"               | 200    | 2326 | NULL    | NULL                                                                                                                     |
    | 127.0.0.1 | -        | -     | [26/May/2009:00:00:00 +0000] | "GET /someurl/?track=Blabla(Main) HTTP/1.1" | 200    | 5864 | -       | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19" |
    +-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
    

    2.10 Esri ArcGIS的地理JSON数据文件

    DLA支持Esri ArcGIS的地理JSON数据文件的SerDe处理,关于这种地理JSON数据格式说明,可以参考:https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats

    示例:

    CREATE EXTERNAL TABLE IF NOT EXISTS california_counties
    (
        Name string,
        BoundaryShape binary
    )
    ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
    STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 'oss://test_bucket/datasets/geospatial/california-counties/'
    

    3. 总结

    通过以上例子可以看出,DLA可以支持大部分开源存储格式的文件。对于同一份数据,使用不同的存储格式,在OSS中存储文件的大小,DLA的查询分析速度上会有较大的差别。推荐使用ORC格式进行文件的存储和查询。

    为了获得更快的查询速度,DLA还在不断的优化中,后续也会支持更多的数据源,为用户带来更好的大数据分析体验。



    本文作者:金络

    阅读原文

    本文为云栖社区原创内容,未经允许不得转载。

    相关文章

      网友评论

        本文标题:Data Lake Analytics + OSS数据文件格式处

        本文链接:https://www.haomeiwen.com/subject/jacjqqtx.html