美文网首页
使用 DataX 将 Hive 与 MySQL 中的表互导

使用 DataX 将 Hive 与 MySQL 中的表互导

作者: 代码小当家 | 来源:发表于2020-12-28 21:07 被阅读0次

一、DataX 简介(摘自 github.com/alibaba/Dat… 详细介绍 Quick StartSupport Data Channels 二、......

DataX

DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具 / 平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。

Features

DataX 本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的 Reader 插件,以及向目标端写入数据的 Writer 插件,理论上 DataX 框架可以支持任意数据源类型的数据同步工作。同时 DataX 插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。

DataX 详细介绍

请参考:DataX-Introduction

Quick Start

Download DataX 下载地址

请点击:Quick Start

Support Data Channels

DataX 目前已经有了比较全面的插件体系,主流的 RDBMS 数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,

使用 DataX 将 Hive 与 MySQL 中的表互导

2.1 将 Hive 表导入 MySQL

1.将DataX下载后上传到服务器
2.解压后使用自检脚本:
python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
3.运行没有问题后根据模板和自己的需求创建your_job.json配置文件,这里以hive2mysql为例:
(需要注意的是读hive表其实就是读hive存放在hdfs上的文件)
{
    "job": {
        "setting": {
            "speed": {
                "channel": 3
            }
        },
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "path": "/apps/hive/warehouse/test_table/*",
                        "defaultFS": "hdfs://192.168.10.101:8020",
                        "column": [
                               {
                                "index": 0,
                                "type": "string"
                               },
                               {
                                "index": 1,
                                "type": "long"
                               }
                        ],
                        "fileType": "text",
                        "encoding": "UTF-8",
                        "fieldDelimiter": "\u0001"
                    }

                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "username",
                        "password": "password",
                        "column": [
                            "word",
                            "cnt"
                        ],
                        "session": [
                            "set session sql_mode='ANSI'"
                        ],
                        "preSql": [
                            "delete from test_table"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://192.168.10.116:3306/test_datax?useUnicode=true&characterEncoding=gbk",
                                "table": [
                                    "test_table"
                                ]
                            }
                        ]
                    }
                }
            }
        ]
    }
}
4.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/hive2mysql.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.

2018-08-29 23:19:46.674 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-29 23:19:46.691 [main] INFO  Engine - the machine info  => 

    osInfo: Oracle Corporation 1.8 25.161-b12
    jvmInfo:    Linux amd64 2.6.32-696.23.1.el6.x86_64
    cpu num:    1

    totalPhysicalMemory:    -0.00G
    freePhysicalMemory: -0.00G
    maxFileDescriptorCount: -1
    currentOpenFileDescriptorCount: -1

    GC Names    [Copy, MarkSweepCompact]

    MEMORY_NAME                    | allocation_size                | init_size                      
    Eden Space                     | 273.06MB                       | 273.06MB                       
    Code Cache                     | 240.00MB                       | 2.44MB                         
    Survivor Space                 | 34.13MB                        | 34.13MB                        
    Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
    Metaspace                      | -0.00MB                        | 0.00MB                         
    Tenured Gen                    | 682.69MB                       | 682.69MB                       

2018-08-29 23:19:46.732 [main] INFO  Engine - 
...
2018-08-29 23:20:00.489 [job-0] INFO  JobContainer - 
     [total cpu info] => 
        averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
        -1.00%                         | -1.00%                         | -1.00%

     [total gc info] => 
         NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
         Copy                 | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
         MarkSweepCompact     | 1                  | 1                  | 1                  | 0.071s             | 0.071s             | 0.071s             

2018-08-29 23:20:00.490 [job-0] INFO  JobContainer - PerfTrace not enable!
2018-08-29 23:20:00.499 [job-0] INFO  StandAloneJobContainerCommunicator - Total 939 records, 31267 bytes | Speed 3.05KB/s, 93 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.001s |  All Task WaitReaderTime 0.285s | Percentage 100.00%
2018-08-29 23:20:00.500 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2018-08-29 23:19:46
任务结束时刻                    : 2018-08-29 23:20:00
任务总计耗时                    :                 13s
任务平均流量                    :            3.05KB/s
记录写入速度                    :             93rec/s
读出记录总数                    :                 939
读写失败总数                    :                   0
**特别注意:hive存在hdfs上的数据默认是以'\001'分隔的,如果用vim打开文件会看到是以^A分隔,但分隔符要用"fieldDelimiter": "\u0001"!!!**

2.2 将 MySQL 表导入 Hive

1.在Hive中建表(储存为文本文件类型)
hive> create table mysql_table(word string, cnt int) row format delimited fields terminated by ',' STORED AS TEXTFILE;
OK
Time taken: 0.194 seconds
hive> select * from mysql_table limit 10;
OK
Time taken: 0.162 seconds
{
    "job": {
        "setting": {
            "speed": {
                 "channel": 3
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        },
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "root",
                        "password": "123456",
                        "column": [
                            "word",
                            "cnt"
                        ],
                        "splitPk": "cnt",
                        "connection": [
                            {
                                "table": [
                                    "test_table"
                                ],
                                "jdbcUrl": [
     "jdbc:mysql://192.168.10.116:3306/test_datax"
                                ]
                            }
                        ]
                    }
                },
               "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://192.168.10.101:8020",
                        "fileType": "text",
                        "path": "/apps/hive/warehouse/mysql_table",
                        "fileName": "mysql_table",
                        "column": [
                            {
                                "name": "word",
                                "type": "string"
                            },
                            {
                                "name": "cnt",
                                "type": "int"
                            }
                        ],
                        "writeMode": "append",
                        "fieldDelimiter": ",",
                        "compress":"gzip"
                    }
                }
            }
        ]
    }
}
3.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/mysql2hive.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.

2018-08-30 01:03:36.399 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-30 01:03:36.415 [main] INFO  Engine - the machine info  => 

    osInfo: Oracle Corporation 1.8 25.161-b12
    jvmInfo:    Linux amd64 2.6.32-696.23.1.el6.x86_64
    cpu num:    1

    totalPhysicalMemory:    -0.00G
    freePhysicalMemory: -0.00G
    maxFileDescriptorCount: -1
    currentOpenFileDescriptorCount: -1

    GC Names    [Copy, MarkSweepCompact]

    MEMORY_NAME                    | allocation_size                | init_size                      
    Eden Space                     | 273.06MB                       | 273.06MB                       
    Code Cache                     | 240.00MB                       | 2.44MB                         
    Survivor Space                 | 34.13MB                        | 34.13MB                        
    Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
    Metaspace                      | -0.00MB                        | 0.00MB                         
    Tenured Gen                    | 682.69MB                       | 682.69MB                       

2018-08-30 01:03:36.469 [main] INFO  Engine - 
...
2018-08-30 01:03:52.094 [job-0] INFO  JobContainer - 
     [total cpu info] => 
        averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
        -1.00%                         | -1.00%                         | -1.00%

     [total gc info] => 
         NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
         Copy                 | 1                  | 1                  | 1                  | 0.346s             | 0.346s             | 0.346s             
         MarkSweepCompact     | 1                  | 1                  | 1                  | 0.206s             | 0.206s             | 0.206s             

2018-08-30 01:03:52.094 [job-0] INFO  JobContainer - PerfTrace not enable!
2018-08-30 01:03:52.096 [job-0] INFO  StandAloneJobContainerCommunicator - Total 939 records, 19462 bytes | Speed 1.90KB/s, 93 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.975s |  All Task WaitReaderTime 0.022s | Percentage 100.00%
2018-08-30 01:03:52.101 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2018-08-30 01:03:36
任务结束时刻                    : 2018-08-30 01:03:52
任务总计耗时                    :                 15s
任务平均流量                    :            1.90KB/s
记录写入速度                    :             93rec/s
读出记录总数                    :                 939
读写失败总数                    :                   0
4.查看hive表中是否有数据
hive> select * from mysql_table limit 10;
OK
at-spi-python-1.28.1    1
mesa-dri-drivers-9.2    1
m4-1.4.13-5.el6.x86_    1
libXxf86vm-1.1.2-2.e    1
libuuid-2.17.2-12.14    1
httpd-2.2.15-29.el6.    1
libgweather-2.28.0-5    1
lvm2-2.02.100-8.el6.    1
xmlrpc-c-1.16.24-121    1
parted-2.1-21.el6.x8    1
Time taken: 0.148 seconds, Fetched: 10 row(s)
hive> select * from mysql_table order by cnt desc limit 10;
Query ID = hdfs_20180830010430_68021b2b-9bd8-44a0-92df-dbe0c717053a
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.

Status: Running (Executing on YARN cluster with App id application_1535363634130_0005)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 17.63 s    
--------------------------------------------------------------------------------
OK
Installing  918
warning:    2
*** 2
tcp_wrappers-libs-7.    1
m4-1.4.13-5.el6.x86_    1
httpd-2.2.15-29.el6.    1
libgweather-2.28.0-5    1
libXxf86vm-1.1.2-2.e    1
xmlrpc-c-1.16.24-121    1
parted-2.1-21.el6.x8    1
Time taken: 33.713 seconds, Fetched: 10 row(s)

1. 可能会遇到的错误

2018-08-29 22:37:54.327 [job-0] ERROR JobContainer - Exception when job run com.alibaba.datax.common.
exception.DataXException: Code:[DBUtilErrorCode-01], Description:[获取表字段相关信息失败.].  -
获取表:test_table 的字段的元信息时失败. 请联系 DBA 核查该库、表信息. -
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'id' in 'field list'

原因:“column”: [ ] 中定义的列名与 mysql 不一致

经DataX智能分析,该任务最可能的错误原因是:com.alibaba.datax.common.exception.DataXException: 
Code:[UnstructuredStorageReader-03], Description:[您填写的参数值不合法.]. - 仅仅支持单字符切分, 
您配置的切分为 : ['01']原因:分隔符问题,详见上面的特别注意

原因:分隔符问题,详见上面的特别注意

2.MysqlWriter 针对 Mysql 类型转换列表

使用 DataX 将 Hive 与 MySQL 中的表互导

3.MysqlReader 针对 Mysql 类型转换列表

使用 DataX 将 Hive 与 MySQL 中的表互导

请注意:

  • 除上述罗列字段类型外,其他类型均不支持。
  • tinyint(1) DataX视作为整形。
  • year DataX视作为字符串类型
  • bit DataX属于未定义行为。

</article>

相关文章

网友评论

      本文标题:使用 DataX 将 Hive 与 MySQL 中的表互导

      本文链接:https://www.haomeiwen.com/subject/rpatoktx.html