Apache Parquet: How to Be a Hero

作者: Adoit | 来源:发表于2018-06-28 20:53 被阅读0次

Apache Parquet: How to Be a Hero
We are all heroes 我们都是英雄
Parquet列式存储格式详解，下推和压缩性能测试
Apache Parquet设计解读
Hudi: Uber Engineering的Apache Ha
第1单元：见面问候·初识对话
列存储格式Parquet浅析
凹凸世界：玫瑰之恋
parquet文件格式解析
第一单元：见面问候·初识对话

Get All the Benefits of Apache Parquet File Format for Google Cloud, Amazon Athena and Redshift Spectrum

You have read about Google Cloud (BigQuery, Dataproc…), Amazon Redshift Spectrum and AWS Athena. You are looking to take advantage of one or two. However, before you jump into the deep end you will want to familiarize yourself with the opportunities of leveraging Apache Parquet file format instead of regular Text, CSV or TSV files. The parquet format is a columnar storage format which allows systems, like Amazon Athena, the ability to query information as columnar data vs a flat file like CSV

If you are not thinking about how to optimize for these new query service models, you can be throwing money out the window.

What Is Apache Parquet?

Apache Parquet format is a columnar storage format with the following characteristics:

Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row based files like CSV

Apache Parquet is built from the ground up with complex nested data structures in mind

Apache Parquet is built to support very efficient compression and encoding schemes

Apache Parquet allows to lower storage costs for data files and maximizes the effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum and Google Dataproc.

Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results a file that is optimized for query performance and minimizing I/O. Parquet also supports very efficient compression and encoding schemes. The great thing is that it is licensed under the Apache software foundation and available to any project.

Parquet and The Rise of Cloud Warehouses & Interactive Query Services

The rise interactive query services like AWS Athena and Amazon Redshift Spectrum make it easy using standard SQL to analyze data in storage systems like Amazon S3. Also, data warehouses like Google BigQuery and the Google Dataproc platform can leverage different formats for data ingest.

However, the data format you select can have significant implications on performance and cost, especially if you are looking at machine learning, AI or other complex operations. We will walk you through a few examples of those considerations.

Parquet vs CSV

CSV is simple and ubiquitous. Many tools like Excel, Google Sheets and a host of others can generate CSV files. You can even create them with your favorite text editing tool. We all love CSV files, but everything has a cost, even your love of CSV files, especially if CSV is your default format for data processing pipelines.

AWS Athena and AWS Redshift Spectrum charge you by the amount of data scanned per query. (Many other services also charge based on data queried so this is not unique to AWS)

Google and Amazon charge you for the amount of data stored on GS/S3

Google Dataproc charges are time-based

Defaulting to the use of CSV will have both technical and financial outcomes (not in a good way). You will learn to love Apache Parquet just as much as your trusty CSV.

Example: A 1 TB CSV File

The following demonstrates the efficiency and effectiveness of using a Parquet file vs CSV.

By converting your CSV data to Parquet’s columnar format, compressing and partitioning it, you save money and reap the rewards of better performance. The following table compares the savings created by converting data into Parquet vs CSV.

Think about this: If over the course of a year you stuck with the uncompressed 1 TB CSV files as a foundation of your queries costs would be $2000 USD. Using Parquet files your total cost would be $3.65 USD. I know you love your CSV files, but do you love them THAT much?

Also, if time is money your analysts can be spending close to 5 minutes waiting for a query to complete simply because you use raw CSV. If you are paying someone $150 an hour and they are doing this once a day for a year then they spent about 30 hours simply waiting for a query to complete. That is roughly about $4500 in unproductive “wait” time. Total wait time for the Apache Parquet user? About 42 mins or $100.

Example 2: Parquet, CSV and Your Redshift Data Warehouse

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against data in Amazon S3. This can be an effective strategy for teams that want to partition data where some of it is resident within Redshift and other data is resident on S3. For example, let’s assume you have about 4 TB of data in a historical_purchase table in Redshift. Since it is not accessed frequently, offloading it to S3 makes sense. This will free up that space in Redshift while still providing your team access via Spectrum. Now, the big question becomes what format are you storing that 4 TB historical_purchase table in? CSV? How about using Parquet?

Our historical_purchase table has 4 equally sized columns, stored in Amazon S3 in three files; uncompressed CSV, gzip CSV and Parquet.

Uncompressed CSV File

The uncompressed CSV file has a total size of 4 TB. Running a query to get data from a single column of the table requires Redshift Spectrum to scan the entire file 4 TB. As result this query would cost $20.

GZIP CSV File

If you compress your CSV file using GZIP, the file size is reduced to 1GB.Great savings! However, Redshift Spectrum still has to scan the entire file. The good news is your CSV file is four times smaller than the uncompressed one so you pay one-fourth of what you did before. This query would cost $5.

Parquet File

If you compress your file and convert it to Apache Parquet you end up with 1 TB of data in S3. However, because Parquet is columnar, Redshift Spectrum can read only the column that is relevant for the query being run. It only needs to scan just 1/4 the data. This query would only cost $1.25.

If you are running this query once a day for a year, using uncompressed CSV files would cost $7300. Even compressed CSV queries would cost over $1800. However, using the Apache Parquet file format it would cost about $460. Still in love with your CSV file?

Summary

The trend toward “serverless”, interactive query services and pre-built data processing suites is rapidly progressing. It is providing new opportunities for teams to go faster with lower investments. Athena and Spectrum make it easy to analyze data in Amazon S3 using standard SQL. Also, Google supports loading Parquet files into BigQuery and Dataproc.

When you only pay for the queries that you run, or resources like CPU and storage, it is important to look at optimizing the data those systems are relying on.

By the way, we have launched a zero admin data processing framework for Amazon Redshift Spectrum and Amazon Athena which includes automated database/table creation, Parquet file conversion, partitioning and more. See announcement for details:

Amazon Redshift Spectrum Automated — 60 Second Setup, Zero Administration And Automatic…

Announcing fully-managed support of zero administration Amazon Redshift Spectrum data pipeline service.blog.openbridge.com

AWS Athena Automated — 60 Second Setup, Zero Administration And Automatic Optimization

We are excited to announce the release of our zero administration AWS Athena data pipeline service.blog.openbridge.com

Also, take a look at our post about AWS Redshift Spectrum and AWS Athena. Using Apache Parquet can benefit both!

How is AWS Redshift Spectrum different than AWS Athena?

This question has come up a few times and most of the discussion in centered around the technical difference. Rather…blog.openbridge.com

Did we miss anything? Do you have any questions about how to transform your CSV to Apache Parquet? If you want help to streamline your data to Google Cloud, AWS Athena, AWS Redshift Spectrum or other data technologies, feel free to leave a comment or contact us at hello@openbridge.com. You can also visit us at https://www.openbridge.comto learn how we are helping other companies with their data efforts.

Apache Parquet: How to Be a Hero
(Original sourcebyThomas Spicer) Get All the Benefits of ...
We are all heroes 我们都是英雄
What is a hero? How to be a hero? Are you a hero? In your...
Parquet列式存储格式详解，下推和压缩性能测试
摘要：列式存储，Parquet Parquet概述 Apache Parquet是面向分析型业务的列式存储格式，由...
Apache Parquet设计解读
官网地址：https://parquet.apache.org/docs[https://parquet.apac...
Hudi: Uber Engineering的Apache Ha
随着Apache Parquet和Apache ORC等存储格式以及Presto和Apache Impala等查询...
第1单元：见面问候·初识对话
Hero: Hello, Mark, how are you? Jacob: I’m good. How are ...
列存储格式Parquet浅析
Parquet调研报告 1. 概述 1.1 简介 Apache Parquet是Hadoop生态圈中一种新型列式存...
凹凸世界：玫瑰之恋
Ruthlessness may not be a true hero . How can Lianzi not ...
parquet文件格式解析
参考资料:https://parquet.apache.org/documentation/latest/http...
第一单元：见面问候·初识对话
对话：Hero： Hello, Mark, how are you?Eddie： I'm good. How ar...