一、DataX 简介(摘自 github.com/alibaba/Dat… 详细介绍 Quick StartSupport Data Channels 二、......
DataX
DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具 / 平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。
Features
DataX 本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的 Reader 插件,以及向目标端写入数据的 Writer 插件,理论上 DataX 框架可以支持任意数据源类型的数据同步工作。同时 DataX 插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。
DataX 详细介绍
请参考:DataX-Introduction
Quick Start
Download DataX 下载地址
请点击:Quick Start
Support Data Channels
DataX 目前已经有了比较全面的插件体系,主流的 RDBMS 数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,
使用 DataX 将 Hive 与 MySQL 中的表互导2.1 将 Hive 表导入 MySQL
1.将DataX下载后上传到服务器
2.解压后使用自检脚本:
python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
3.运行没有问题后根据模板和自己的需求创建your_job.json配置文件,这里以hive2mysql为例:
(需要注意的是读hive表其实就是读hive存放在hdfs上的文件)
{
"job": {
"setting": {
"speed": {
"channel": 3
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/apps/hive/warehouse/test_table/*",
"defaultFS": "hdfs://192.168.10.101:8020",
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "long"
}
],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": "\u0001"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "insert",
"username": "username",
"password": "password",
"column": [
"word",
"cnt"
],
"session": [
"set session sql_mode='ANSI'"
],
"preSql": [
"delete from test_table"
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://192.168.10.116:3306/test_datax?useUnicode=true&characterEncoding=gbk",
"table": [
"test_table"
]
}
]
}
}
}
]
}
}
4.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/hive2mysql.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2018-08-29 23:19:46.674 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-29 23:19:46.691 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.161-b12
jvmInfo: Linux amd64 2.6.32-696.23.1.el6.x86_64
cpu num: 1
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [Copy, MarkSweepCompact]
MEMORY_NAME | allocation_size | init_size
Eden Space | 273.06MB | 273.06MB
Code Cache | 240.00MB | 2.44MB
Survivor Space | 34.13MB | 34.13MB
Compressed Class Space | 1,024.00MB | 0.00MB
Metaspace | -0.00MB | 0.00MB
Tenured Gen | 682.69MB | 682.69MB
2018-08-29 23:19:46.732 [main] INFO Engine -
...
2018-08-29 23:20:00.489 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
Copy | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
MarkSweepCompact | 1 | 1 | 1 | 0.071s | 0.071s | 0.071s
2018-08-29 23:20:00.490 [job-0] INFO JobContainer - PerfTrace not enable!
2018-08-29 23:20:00.499 [job-0] INFO StandAloneJobContainerCommunicator - Total 939 records, 31267 bytes | Speed 3.05KB/s, 93 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.001s | All Task WaitReaderTime 0.285s | Percentage 100.00%
2018-08-29 23:20:00.500 [job-0] INFO JobContainer -
任务启动时刻 : 2018-08-29 23:19:46
任务结束时刻 : 2018-08-29 23:20:00
任务总计耗时 : 13s
任务平均流量 : 3.05KB/s
记录写入速度 : 93rec/s
读出记录总数 : 939
读写失败总数 : 0
**特别注意:hive存在hdfs上的数据默认是以'\001'分隔的,如果用vim打开文件会看到是以^A分隔,但分隔符要用"fieldDelimiter": "\u0001"!!!**
2.2 将 MySQL 表导入 Hive
1.在Hive中建表(储存为文本文件类型)
hive> create table mysql_table(word string, cnt int) row format delimited fields terminated by ',' STORED AS TEXTFILE;
OK
Time taken: 0.194 seconds
hive> select * from mysql_table limit 10;
OK
Time taken: 0.162 seconds
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"writeMode": "insert",
"username": "root",
"password": "123456",
"column": [
"word",
"cnt"
],
"splitPk": "cnt",
"connection": [
{
"table": [
"test_table"
],
"jdbcUrl": [
"jdbc:mysql://192.168.10.116:3306/test_datax"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://192.168.10.101:8020",
"fileType": "text",
"path": "/apps/hive/warehouse/mysql_table",
"fileName": "mysql_table",
"column": [
{
"name": "word",
"type": "string"
},
{
"name": "cnt",
"type": "int"
}
],
"writeMode": "append",
"fieldDelimiter": ",",
"compress":"gzip"
}
}
}
]
}
}
3.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/mysql2hive.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2018-08-30 01:03:36.399 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-30 01:03:36.415 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.161-b12
jvmInfo: Linux amd64 2.6.32-696.23.1.el6.x86_64
cpu num: 1
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [Copy, MarkSweepCompact]
MEMORY_NAME | allocation_size | init_size
Eden Space | 273.06MB | 273.06MB
Code Cache | 240.00MB | 2.44MB
Survivor Space | 34.13MB | 34.13MB
Compressed Class Space | 1,024.00MB | 0.00MB
Metaspace | -0.00MB | 0.00MB
Tenured Gen | 682.69MB | 682.69MB
2018-08-30 01:03:36.469 [main] INFO Engine -
...
2018-08-30 01:03:52.094 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
Copy | 1 | 1 | 1 | 0.346s | 0.346s | 0.346s
MarkSweepCompact | 1 | 1 | 1 | 0.206s | 0.206s | 0.206s
2018-08-30 01:03:52.094 [job-0] INFO JobContainer - PerfTrace not enable!
2018-08-30 01:03:52.096 [job-0] INFO StandAloneJobContainerCommunicator - Total 939 records, 19462 bytes | Speed 1.90KB/s, 93 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.975s | All Task WaitReaderTime 0.022s | Percentage 100.00%
2018-08-30 01:03:52.101 [job-0] INFO JobContainer -
任务启动时刻 : 2018-08-30 01:03:36
任务结束时刻 : 2018-08-30 01:03:52
任务总计耗时 : 15s
任务平均流量 : 1.90KB/s
记录写入速度 : 93rec/s
读出记录总数 : 939
读写失败总数 : 0
4.查看hive表中是否有数据
hive> select * from mysql_table limit 10;
OK
at-spi-python-1.28.1 1
mesa-dri-drivers-9.2 1
m4-1.4.13-5.el6.x86_ 1
libXxf86vm-1.1.2-2.e 1
libuuid-2.17.2-12.14 1
httpd-2.2.15-29.el6. 1
libgweather-2.28.0-5 1
lvm2-2.02.100-8.el6. 1
xmlrpc-c-1.16.24-121 1
parted-2.1-21.el6.x8 1
Time taken: 0.148 seconds, Fetched: 10 row(s)
hive> select * from mysql_table order by cnt desc limit 10;
Query ID = hdfs_20180830010430_68021b2b-9bd8-44a0-92df-dbe0c717053a
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1535363634130_0005)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 17.63 s
--------------------------------------------------------------------------------
OK
Installing 918
warning: 2
*** 2
tcp_wrappers-libs-7. 1
m4-1.4.13-5.el6.x86_ 1
httpd-2.2.15-29.el6. 1
libgweather-2.28.0-5 1
libXxf86vm-1.1.2-2.e 1
xmlrpc-c-1.16.24-121 1
parted-2.1-21.el6.x8 1
Time taken: 33.713 seconds, Fetched: 10 row(s)
1. 可能会遇到的错误
2018-08-29 22:37:54.327 [job-0] ERROR JobContainer - Exception when job run com.alibaba.datax.common.
exception.DataXException: Code:[DBUtilErrorCode-01], Description:[获取表字段相关信息失败.]. -
获取表:test_table 的字段的元信息时失败. 请联系 DBA 核查该库、表信息. -
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'id' in 'field list'
原因:“column”: [ ] 中定义的列名与 mysql 不一致
经DataX智能分析,该任务最可能的错误原因是:com.alibaba.datax.common.exception.DataXException:
Code:[UnstructuredStorageReader-03], Description:[您填写的参数值不合法.]. - 仅仅支持单字符切分,
您配置的切分为 : ['01']原因:分隔符问题,详见上面的特别注意
原因:分隔符问题,详见上面的特别注意
2.MysqlWriter 针对 Mysql 类型转换列表
使用 DataX 将 Hive 与 MySQL 中的表互导3.MysqlReader 针对 Mysql 类型转换列表
使用 DataX 将 Hive 与 MySQL 中的表互导请注意:
- 除上述罗列字段类型外,其他类型均不支持。
- tinyint(1) DataX视作为整形。
- year DataX视作为字符串类型
- bit DataX属于未定义行为。
</article>
网友评论