美文网首页玩转大数据Java
使用命令行查看Parquet文件

使用命令行查看Parquet文件

作者: AlienPaul | 来源:发表于2024-08-15 09:02 被阅读0次

    简介

    通常来说Parquet文件可以使用Spark或Flink来读取内容。对于问题分析或者学习研究场景,临时查看一个parquet文件专门使用Spark/Flink编写一段程序显得十分繁琐。本篇为大家带来两个命令行环境运行的Parquet文件读取和分析工具。使用较为简单,无需再编写程序代码。

    使用parquet-cli

    项目地址和下载

    项目地址:https://github.com/apache/parquet-java.git

    下载地址:https://repo1.maven.org/maven2/org/apache/parquet/parquet-cli/1.14.1/parquet-cli-1.14.1-runtime.jar

    官方使用方式和文档:https://github.com/apache/parquet-java/tree/master/parquet-cli

    使用方式

    命令格式:

    hadoop jar parquet-cli-1.14.1-runtime.jar 命令 本地parquet文件路径
    

    查看帮助:

    [root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar help
    
    Usage: parquet [options] [command] [command options]
    
      Options:
    
        -v, --verbose, --debug
            Print extra debugging information
    
      Commands:
    
        help
            Retrieves details on the functions of other commands
        meta
            Print a Parquet file's metadata
        pages
            Print page summaries for a Parquet file
        dictionary
            Print dictionaries for a Parquet column
        check-stats
            Check Parquet files for corrupt page and column stats (PARQUET-251)
        schema
            Print the Avro schema for a file
        csv-schema
            Build a schema from a CSV data sample
        convert-csv
            Create a file from CSV data
        convert
            Create a Parquet file from a data file
        to-avro
            Create an Avro file from a data file
        cat
            Print the first N records from a file
        head
            Print the first N records from a file
        column-index
            Prints the column and offset indexes of a Parquet file
        column-size
            Print the column sizes of a parquet file
        prune
            (Deprecated: will be removed in 2.0.0, use rewrite command instead) Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed.
        trans-compression
            (Deprecated: will be removed in 2.0.0, use rewrite command instead) Translate the compression from one to another (It doesn't support bloom filter feature yet).
        masking
            (Deprecated: will be removed in 2.0.0, use rewrite command instead) Replace columns with masked values and write to a new Parquet file
        footer
            Print the Parquet file footer in json format
        bloom-filter
            Check bloom filters for a Parquet column
        scan
            Scan all records from a file
        rewrite
            Rewrite one or more Parquet files to a new Parquet file
    
      Examples:
    
        # print information for meta
        parquet help meta
    
      See 'parquet help <command>' for more information on a specific command.
    
    

    使用示例

    这里以一个Hudi表底层的parquet文件为例。说明parquet-cli工具的使用方式。

    查看parquet文件的schema:

    [root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar schema ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
    {
      "type" : "record",
      "name" : "hudi_student_record",
      "namespace" : "hoodie.hudi_student",
      "fields" : [ {
        "name" : "_hoodie_commit_time",
        "type" : [ "null", "string" ],
        "doc" : "",
        "default" : null
      }, {
        "name" : "_hoodie_commit_seqno",
        "type" : [ "null", "string" ],
        "doc" : "",
        "default" : null
      }, {
        "name" : "_hoodie_record_key",
        "type" : [ "null", "string" ],
        "doc" : "",
        "default" : null
      }, {
        "name" : "_hoodie_partition_path",
        "type" : [ "null", "string" ],
        "doc" : "",
        "default" : null
      }, {
        "name" : "_hoodie_file_name",
        "type" : [ "null", "string" ],
        "doc" : "",
        "default" : null
      }, {
        "name" : "id",
        "type" : "int"
      }, {
        "name" : "name",
        "type" : [ "null", "string" ],
        "default" : null
      }, {
        "name" : "tel",
        "type" : [ "null", "int" ],
        "default" : null
      } ]
    }
    
    

    查看parquet文件数据:

    [root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar cat ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
    {"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
    {"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
    {"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}
    {"_hoodie_commit_time": "20240710084244349", "_hoodie_commit_seqno": "20240710084244349_0_7", "_hoodie_record_key": "2", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 2, "name": "Mary", "tel": 222222}
    {"_hoodie_commit_time": "20240710083659244", "_hoodie_commit_seqno": "20240710083659244_0_3", "_hoodie_record_key": "5", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 5, "name": "Tom", "tel": 666666}
    
    

    查看parquet文件前3行数据:

    [root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar head -n 3 ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
    {"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
    {"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
    {"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}
    
    

    获取parquet文件meta信息:

    [root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar meta ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
    
    File path:  ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
    Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
    Properties:
      hoodie_bloom_filter_type_code: DYNAMIC_V0
        org.apache.hudi.bloomfilter: //太长省略
              hoodie_min_record_key: 1
                parquet.avro.schema: {"type":"record","name":"hudi_student_record","namespace":"hoodie.hudi_student","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"int"},{"name":"name","type":["null","string"],"default":null},{"name":"tel","type":["null","int"],"default":null}]}
                  writer.model.name: avro
              hoodie_max_record_key: 5
    Schema:
    message hoodie.hudi_student.hudi_student_record {
      optional binary _hoodie_commit_time (STRING);
      optional binary _hoodie_commit_seqno (STRING);
      optional binary _hoodie_record_key (STRING);
      optional binary _hoodie_partition_path (STRING);
      optional binary _hoodie_file_name (STRING);
      required int32 id;
      optional binary name (STRING);
      optional int32 tel;
    }
    
    
    Row group 0:  count: 5  152.20 B records  start: 4  total(compressed): 761 B total(uncompressed):702 B
    --------------------------------------------------------------------------------
                            type      encodings count     avg size   nulls   min / max
    _hoodie_commit_time     BINARY    G   _     5         19.60 B    0       "20240710083659244" / "20240710084413943"
    _hoodie_commit_seqno    BINARY    G   _     5         21.80 B    0       "20240710083659244_0_3" / "20240710084413943_0_11"
    _hoodie_record_key      BINARY    G   _     5         12.60 B    0       "1" / "5"
    _hoodie_partition_path  BINARY    G _ R     5         18.80 B    0       "" / ""
    _hoodie_file_name       BINARY    G _ R     5         31.20 B    0       "ba74ba57-d45c-43c7-9ddb-7..." / "ba74ba57-d45c-43c7-9ddb-7..."
    id                      INT32     G   _     5         11.40 B    0       "1" / "5"
    name                    BINARY    G   _     5         16.00 B    0       "Jessy" / "Tom"
    tel                     INT32     G _ R     5         20.80 B    0       "111111" / "666666"
        
    

    使用parquet-tools

    下载方式

    下载jar文件:

    wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.11.2/parquet-tools-1.11.2.jar
    

    使用方式

    hadoop jar parquet-tools-1.x.0.jar 命令 HDFS中parquet文件路径
    

    命令使用方式和前面的parquet-cli工具相同,不再赘述。

    相关文章

      网友评论

        本文标题:使用命令行查看Parquet文件

        本文链接:https://www.haomeiwen.com/subject/wycekjtx.html