hive-testbench

作者: 你的努力时光不会辜负 | 来源:发表于2022-03-09 16:40 被阅读0次

hive-testbench
hive-testbench项目构建cds数据集合失败
TPC-DS在大数据中的使用

Github：https://github.com/hortonworks/hive-testbench/

TPC-DS：提供一个公平和诚实的业务和数据模型，99个案例
TPC-H：面向商品零售业的决策支持系统测试基准，定义了8张表，22个查询
wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip
unzip hive14.zip
cd hive-testbench-hive14/
./tpcds-build.sh
./tpcds-setup.sh 1000 //生成1000G的hive表数据集
FORMAT=parquet ./tpcds-setup.sh 10 //生成10G的parquet格式的hive表

[root@ip-172-31-16-68 hive-testbench]# ./tpcds-setup.sh 10 /extwarehouse/tpcds
（可左右滑动）

参数说明：

10表示生成的数据量大小GB单位

/extwarehouse/tpcds表数据数据生成的目录，目录不存在自动生成，如果不指定数据目录则默认生成到/tmp/tpcds目录下。

执行完成后，查看hive

image

数据生成已导入。

测试：

cd sample-queries-tpcds/

hive> use tpcds_bin_partitioned_orc_100;

hive>source query1.sql;

查看执行结果。

————————————————
版权声明：本文为CSDN博主「无影风Victorz」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/victorzzzz/article/details/88741767

下载失败，可以通过https://public-repo-1.hortonworks.com/hive-testbench/tpcds/TPCDS_Tools.zip下载

编译失败参考：https://www.jianshu.com/p/6be3e51256f4

image.png

hive-testbench

A testbench for experimenting with Apache Hive at any data scale.

Overview

The hive-testbench is a data generator and set of queries that lets you experiment with Apache Hive at scale. The testbench allows you to experience base Hive performance on large datasets, and gives an easy way to see the impact of Hive tuning parameters and advanced settings.

Prerequisites

You will need:

Hadoop 2.2 or later cluster or Sandbox.
Apache Hive.
Between 15 minutes and 2 days to generate data (depending on the Scale Factor you choose and available hardware).
If you plan to generate 1TB or more of data, using Apache Hive 13+ to generate the data is STRONGLY suggested.

Install and Setup

All of these steps should be carried out on your Hadoop cluster.

Step 1: Prepare your environment.

In addition to Hadoop and Hive, before you begin ensure gcc is installed and available on your system path. If you system does not have it, install it using yum or apt-get.
Step 2: Decide which test suite(s) you want to use.

hive-testbench comes with data generators and sample queries based on both the TPC-DS and TPC-H benchmarks. You can choose to use either or both of these benchmarks for experiementation. More information about these benchmarks can be found at the Transaction Processing Council homepage.
Step 3: Compile and package the appropriate data generator.

For TPC-DS, ./tpcds-build.sh downloads, compiles and packages the TPC-DS data generator. For TPC-H, ./tpch-build.sh downloads, compiles and packages the TPC-H data generator.
Step 4: Decide how much data you want to generate.

You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes and one terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 1000 (1 TB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 10000 (10 TB) or more. The notion of scale factor is similar between TPC-DS and TPC-H.

If you want to generate a large amount of data, you should use Hive 13 or later. Hive 13 introduced an optimization that allows far more scalable data partitioning. Hive 12 and lower will likely crash if you generate more than a few hundred GB of data and tuning around the problem is difficult. You can generate text or RCFile data in Hive 13 and use it in multiple versions of Hive.
Step 5: Generate and load the data.

The scripts tpcds-setup.sh and tpch-setup.sh generate and load data for TPC-DS and TPC-H, respectively. General usage is tpcds-setup.sh scale_factor [directory] or tpch-setup.sh scale_factor [directory]

Some examples:

Build 1 TB of TPC-DS data: ./tpcds-setup.sh 1000

Build 1 TB of TPC-H data: ./tpch-setup.sh 1000

Build 100 TB of TPC-DS data: ./tpcds-setup.sh 100000

Build 30 TB of text formatted TPC-DS data: FORMAT=textfile ./tpcds-setup 30000

Build 30 TB of RCFile formatted TPC-DS data: FORMAT=rcfile ./tpcds-setup 30000

Also check other parameters in setup scripts important one is BUCKET_DATA.
Step 6: Run queries.

More than 50 sample TPC-DS queries and all TPC-H queries are included for you to try. You can use hive, beeline or the SQL tool of your choice. The testbench also includes a set of suggested settings.

This example assumes you have generated 1 TB of TPC-DS data during Step 5:
```
 cd sample-queries-tpcds
 hive -i testbench.settings
 hive> use tpcds_bin_partitioned_orc_1000;
 hive> source query55.sql;
```
Note that the database is named based on the Data Scale chosen in step 3. At Data Scale 10000, your database will be named tpcds_bin_partitioned_orc_10000. At Data Scale 1000 it would be named tpch_flat_orc_1000. You can always show databases to get a list of available databases.

Similarly, if you generated 1 TB of TPC-H data during Step 5:
```
 cd sample-queries-tpch
 hive -i testbench.settings
 hive> use tpch_flat_orc_1000;
 hive> source tpch_query1.sql;
```
<clipboard-copy aria-label="Copy" class="ClipboardButton btn js-clipboard-copy m-2 p-0 tooltipped-no-delay" data-copy-feedback="Copied!" data-tooltip-direction="w" value=" cd sample-queries-tpch
hive -i testbench.settings
hive> use tpch_flat_orc_1000;
hive> source tpch_query1.sql;" tabindex="0" role="button" style="box-sizing: border-box; position: relative; display: inline-block; padding: 0px !important; font-size: 14px; font-weight: 500; line-height: 20px; white-space: nowrap; vertical-align: middle; cursor: pointer; user-select: none; border-width: 1px; border-style: solid; border-color: var(--color-btn-border); border-image: initial; border-radius: 6px; appearance: none; color: var(--color-btn-text); background-color: var(--color-btn-bg); box-shadow: var(--color-btn-shadow),var(--color-btn-inset-shadow); transition: color 0.2s cubic-bezier(0.3, 0, 0.5, 1) 0s, background-color, border-color; margin: 8px !important;"></clipboard-copy>

Feedback

If you have questions, comments or problems, visit the Hortonworks Hive forum.

If you have improvements, pull requests are accepted.

网友评论

本文标题：hive-testbench

本文链接：https://www.haomeiwen.com/subject/whgtdrtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

hive-testbench

hive-testbench

Overview

Prerequisites

Install and Setup

Feedback

相关文章

hive-testbench

hive-testbench项目构建cds数据集合失败

TPC-DS在大数据中的使用

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读