使用命令行查看Parquet文件

作者: AlienPaul | 来源:发表于2024-08-15 09:02 被阅读0次

使用parquet-tools工具查看parquet文件
将Avro数据转换为Parquet格式
spark题03
简单命令行使用
iOS-配置文件
Python查看包的文档
spark 读取 hdfs 数据分区规则
Windows常用命令
2019-03-07 开发sdk坑持续汇总和命令行汇总
macbook使用相关

简介

通常来说Parquet文件可以使用Spark或Flink来读取内容。对于问题分析或者学习研究场景，临时查看一个parquet文件专门使用Spark/Flink编写一段程序显得十分繁琐。本篇为大家带来两个命令行环境运行的Parquet文件读取和分析工具。使用较为简单，无需再编写程序代码。

使用parquet-cli

项目地址和下载

项目地址：https://github.com/apache/parquet-java.git

下载地址：https://repo1.maven.org/maven2/org/apache/parquet/parquet-cli/1.14.1/parquet-cli-1.14.1-runtime.jar

官方使用方式和文档：https://github.com/apache/parquet-java/tree/master/parquet-cli

使用方式

命令格式：

hadoop jar parquet-cli-1.14.1-runtime.jar 命令 本地parquet文件路径

查看帮助：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar help

Usage: parquet [options] [command] [command options]

  Options:

    -v, --verbose, --debug
        Print extra debugging information

  Commands:

    help
        Retrieves details on the functions of other commands
    meta
        Print a Parquet file's metadata
    pages
        Print page summaries for a Parquet file
    dictionary
        Print dictionaries for a Parquet column
    check-stats
        Check Parquet files for corrupt page and column stats (PARQUET-251)
    schema
        Print the Avro schema for a file
    csv-schema
        Build a schema from a CSV data sample
    convert-csv
        Create a file from CSV data
    convert
        Create a Parquet file from a data file
    to-avro
        Create an Avro file from a data file
    cat
        Print the first N records from a file
    head
        Print the first N records from a file
    column-index
        Prints the column and offset indexes of a Parquet file
    column-size
        Print the column sizes of a parquet file
    prune
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed.
    trans-compression
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Translate the compression from one to another (It doesn't support bloom filter feature yet).
    masking
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Replace columns with masked values and write to a new Parquet file
    footer
        Print the Parquet file footer in json format
    bloom-filter
        Check bloom filters for a Parquet column
    scan
        Scan all records from a file
    rewrite
        Rewrite one or more Parquet files to a new Parquet file

  Examples:

    # print information for meta
    parquet help meta

  See 'parquet help <command>' for more information on a specific command.

使用示例

这里以一个Hudi表底层的parquet文件为例。说明parquet-cli工具的使用方式。

查看parquet文件的schema：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar schema ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{
  "type" : "record",
  "name" : "hudi_student_record",
  "namespace" : "hoodie.hudi_student",
  "fields" : [ {
    "name" : "_hoodie_commit_time",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_commit_seqno",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_record_key",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_partition_path",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_file_name",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "id",
    "type" : "int"
  }, {
    "name" : "name",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "tel",
    "type" : [ "null", "int" ],
    "default" : null
  } ]
}

查看parquet文件数据：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar cat ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
{"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
{"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}
{"_hoodie_commit_time": "20240710084244349", "_hoodie_commit_seqno": "20240710084244349_0_7", "_hoodie_record_key": "2", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 2, "name": "Mary", "tel": 222222}
{"_hoodie_commit_time": "20240710083659244", "_hoodie_commit_seqno": "20240710083659244_0_3", "_hoodie_record_key": "5", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 5, "name": "Tom", "tel": 666666}

查看parquet文件前3行数据：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar head -n 3 ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
{"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
{"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}

获取parquet文件meta信息：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar meta ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet

File path:  ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
Properties:
  hoodie_bloom_filter_type_code: DYNAMIC_V0
    org.apache.hudi.bloomfilter: //太长省略
          hoodie_min_record_key: 1
            parquet.avro.schema: {"type":"record","name":"hudi_student_record","namespace":"hoodie.hudi_student","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"int"},{"name":"name","type":["null","string"],"default":null},{"name":"tel","type":["null","int"],"default":null}]}
              writer.model.name: avro
          hoodie_max_record_key: 5
Schema:
message hoodie.hudi_student.hudi_student_record {
  optional binary _hoodie_commit_time (STRING);
  optional binary _hoodie_commit_seqno (STRING);
  optional binary _hoodie_record_key (STRING);
  optional binary _hoodie_partition_path (STRING);
  optional binary _hoodie_file_name (STRING);
  required int32 id;
  optional binary name (STRING);
  optional int32 tel;
}


Row group 0:  count: 5  152.20 B records  start: 4  total(compressed): 761 B total(uncompressed):702 B
--------------------------------------------------------------------------------
                        type      encodings count     avg size   nulls   min / max
_hoodie_commit_time     BINARY    G   _     5         19.60 B    0       "20240710083659244" / "20240710084413943"
_hoodie_commit_seqno    BINARY    G   _     5         21.80 B    0       "20240710083659244_0_3" / "20240710084413943_0_11"
_hoodie_record_key      BINARY    G   _     5         12.60 B    0       "1" / "5"
_hoodie_partition_path  BINARY    G _ R     5         18.80 B    0       "" / ""
_hoodie_file_name       BINARY    G _ R     5         31.20 B    0       "ba74ba57-d45c-43c7-9ddb-7..." / "ba74ba57-d45c-43c7-9ddb-7..."
id                      INT32     G   _     5         11.40 B    0       "1" / "5"
name                    BINARY    G   _     5         16.00 B    0       "Jessy" / "Tom"
tel                     INT32     G _ R     5         20.80 B    0       "111111" / "666666"

使用parquet-tools

下载方式

下载jar文件：

wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.11.2/parquet-tools-1.11.2.jar

使用方式

hadoop jar parquet-tools-1.x.0.jar 命令 HDFS中parquet文件路径

命令使用方式和前面的parquet-cli工具相同，不再赘述。

使用parquet-tools工具查看parquet文件
cdh默认安装了。我安装6.2下面对应的路径是/opt/cloudera/parcels/CDH-6.2.0-1....
将Avro数据转换为Parquet格式
本文主要测试将Avro数据转换为Parquet格式的过程并查看 Parquet 文件的 schema 和元数据。 ...
spark题03
1.Spark使用parquet文件存储格式能带来哪些好处？使用 parquet 主要是对 Spark SQL ...
简单命令行使用
命令行工具的使用初始了解查看当前目录下的文件切换目录文件操作
iOS-配置文件
使用Security命令行查看配置文件内容进入到文件所在文件夹 $cd /path $security cms ...
Python查看包的文档
命令行方式执行 python 方法。引入需要查看的包名使用help函数查看编写文件执行方式
spark 读取 hdfs 数据分区规则
下文以读取 parquet 文件 / parquet hive table 为例： hive metastore ...
Windows常用命令
Windows命令行查看文件MD5:
2019-03-07 开发sdk坑持续汇总和命令行汇总
命令行： 1：file 查看文件类型2 ：lipo -info 查看文件支持架构（如果报对...
macbook使用相关
1. 命令行查看隐藏文件：ls -al 新建文件：touch file 打开文件：open file 查看git...

使用命令行查看Parquet文件

简介

使用parquet-cli

项目地址和下载

使用方式

使用示例

使用parquet-tools

下载方式

使用方式

相关文章

使用parquet-tools工具查看parquet文件

将Avro数据转换为Parquet格式

spark题03

简单命令行使用

iOS-配置文件

Python查看包的文档

spark 读取 hdfs 数据分区规则

Windows常用命令

2019-03-07 开发sdk坑持续汇总和命令行汇总

macbook使用相关

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

玩转大数据

Java