美文网首页
parquet常用操作

parquet常用操作

作者: YG_9013 | 来源:发表于2018-10-16 14:25 被阅读0次

    1、创建parquet表

    1.1、创建内部表

    CREATE TABLE parquet_test (
     id int,
     str string,
     mp MAP<STRING,STRING>,
     lst ARRAY<STRING>,
     strct STRUCT<A:STRING,B:STRING>) 
    PARTITIONED BY (part string)
    STORED AS PARQUET;
    

    网上有第二种创建方法:

    CREATE TABLE parquet_test (
     id int,
     str string,
     mp MAP<STRING,STRING>,
     lst ARRAY<STRING>,
     strct STRUCT<A:STRING,B:STRING>) 
    PARTITIONED BY (part string)
    ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
     STORED AS
     INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
     OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';
    

    第一种是hive0.13之后的版本,第二种时0.13之前的版本。目前大都是使用第一种创建方法。https://cwiki.apache.org/confluence/display/Hive/Parquet

    1.2、创建外部表

    create external table tmp.guo_parquet_test (age string,name string, desc string) 
      STORED AS parquet 
      LOCATION '/tmp/jonyguo/streaming_parquet_test';
    

    1.3、指定压缩算法:

    CREATE TABLE parquet_test (
     id int,
     str string,
     mp MAP<STRING,STRING>,
     lst ARRAY<STRING>,
     strct STRUCT<A:STRING,B:STRING>) 
    PARTITIONED BY (part string)
    STORED AS PARQUET
    TBLPROPERTIES('parquet.compression'='SNAPPY');
    

    注意:
    1)有SNAPPY和GZIP两种压缩算法,GZIP不管时从空间大小还是查询性能都比较优秀。
    2)指定orc压缩格式是:TBLPROPERTIES('orc.compress'='ZLIB');parquet是TBLPROPERTIES('parquet.compression'='SNAPPY');

    2、Hadoop Streaming读写parquet文件

    2.1 Hadoop Streaming 限制

    Hadoop Streaming限制:
    1)Hadoop Streaming读写的数据格式都是Text文件格式。针对于parquet文件格式,无法直接读取,需要经过转换器转换。
    2)Hadoop Streaming读写的api全是旧API,即mapred包。无法处理新API,mapreduce包。

    MR新旧API读写parquet的例子可在 https://blog.csdn.net/woloqun/article/details/76068147 中找到。

    2.2 Hadoop Streaming 读写parquet

    可通过网友写的一个库直接用Hadoop Streaming读写parquet
    https://github.com/whale2/iow-hadoop-streaming

    举例:

    hadoop jar /usr/local/hadoop-2.7.0/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar 
    -libjars parquet_test.jar,hadoop2-iow-lib.jar,/usr/local/spark-2.1.0-bin-hadoop2.7/jars/parquet-column-1.8.1.jar,/usr/local/spark-2.1.0-bin-hadoop2.7/jars/parquet-common-1.8.1.jar,/usr/local/spark-2.1.0-bin-hadoop2.7/jars/parquet-encoding-1.8.1.jar,/usr/local/spark-2.1.0-bin-hadoop2.7/jars/parquet-hadoop-1.8.1.jar,/usr/local/spark-2.1.0-bin-hadoop2.7/jars/parquet-format-2.3.0-incubating.jar
    -D mapred.job.name="test_streaming" 
    -D iow.streaming.output.schema="message example {required binary age;required binary name;required binary desc;}"  
    -D mapreduce.output.fileoutputformat.compress=true 
    -D parquet.compression=gzip 
    -D parquet.read.support.class=net.iponweb.hadoop.streaming.parquet.GroupReadSupport 
    -D parquet.write.support.class=net.iponweb.hadoop.streaming.parquet.GroupWriteSupport 
    -inputformat net.iponweb.hadoop.streaming.parquet.ParquetAsTextInputFormat 
    -outputformat net.iponweb.hadoop.streaming.parquet.ParquetAsTextOutputFormat 
    -input "/tmp/jonyguo/parquet_test"  
    -output "/tmp/jonyguo/streaming_parquet_test"  
    -mapper /bin/cat 
    -reducer /bin/cat
    

    注意事项:

    1. 如果输出结果为parquet类型,必须配置schema信息(iow.streaming.output.schema)。
      2)读或者写parquet时必须配置support.class。

    3、其他方式读写parquet

    通过spark,mapreduce读写parquet的方式可参考文章:https://blog.csdn.net/woloqun/article/details/76068147

    相关文章

      网友评论

          本文标题:parquet常用操作

          本文链接:https://www.haomeiwen.com/subject/kvlhzftx.html