DataX通用脚本全量表数据同步

作者: Yobhel | 来源:发表于2023-11-02 09:05 被阅读0次

dataX是阿里开源的离线数据库同步工具的使用
MaxCompute full outer join改写left
开源数据同步工具——datax
DataX 3.0简介安装及使用
DataX 增量同步数据
DataX Web使用体验入门
datax字段转换
DataX 数据全量，增量同步方案
DataX系列1-DataX介绍
数据同步工具

1. 数据通道

全量表数据由DataX从MySQL业务数据库直接同步到HDFS，具体数据流向如下图所示。

image.png

2 DataX配置文件
我们需要为每张全量表编写一个DataX的json配置文件，此处以 base_province 为例，配置文件内容如下：

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": [
                            "id",
                            "name",
                            "region_id",
                            "area_code",
                            "iso_code",
                            "iso_3166_2"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop101:3306/edu2077"
                                ],
                                "table": [
                                    "base_province"
                                ]
                            }
                        ],
                        "password": "000000",
                        "splitPk": "",
                        "username": "root"
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [
                            {
                                "name": "id",
                                "type": "bigint"
                            },
                            {
                                "name": "name",
                                "type": "string"
                            },
                            {
                                "name": "region_id",
                                "type": "string"
                            },
                            {
                                "name": "area_code",
                                "type": "string"
                            },
                            {
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
                                "name": "iso_3166_2",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop101:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "${targetdir}",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1
            }
        }
    }
}

注：由于目标路径包含一层日期，用于对不同天的数据加以区分，故path参数并未写死，需在提交任务时通过参数动态传入，参数名称为targetdir。

3 DataX配置文件生成脚本

方便起见，此处提供了DataX配置文件批量生成脚本，脚本内容及使用方式如下。
1）在~/bin目录下创建gen_import_config.py脚本

[yobhel@hadoop101 bin]$ vim ~/bin/gen_import_config.py

脚本内容如下

# coding=utf-8
import json
import getopt
import os
import sys
import MySQLdb

#MySQL相关配置，需根据实际情况作出修改
mysql_host = "hadoop101"
mysql_port = "3306"
mysql_user = "root"
mysql_passwd = "000000"

#HDFS NameNode相关配置，需根据实际情况作出修改
hdfs_nn_host = "hadoop101"
hdfs_nn_port = "8020"

#生成配置文件的目标路径，可根据实际情况作出修改
output_path = "/opt/module/datax/job/import"


def get_connection():
    return MySQLdb.connect(host=mysql_host, port=int(mysql_port), user=mysql_user, passwd=mysql_passwd)


def get_mysql_meta(database, table):
    connection = get_connection()
    cursor = connection.cursor()
    sql = "SELECT COLUMN_NAME,DATA_TYPE from information_schema.COLUMNS WHERE TABLE_SCHEMA=%s AND TABLE_NAME=%s ORDER BY ORDINAL_POSITION"
    cursor.execute(sql, [database, table])
    fetchall = cursor.fetchall()
    cursor.close()
    connection.close()
    return fetchall


def get_mysql_columns(database, table):
    return map(lambda x: x[0], get_mysql_meta(database, table))


def get_hive_columns(database, table):
    def type_mapping(mysql_type):
        mappings = {
            "bigint": "bigint",
            "int": "bigint",
            "smallint": "bigint",
            "tinyint": "bigint",
            "decimal": "string",
            "double": "double",
            "float": "float",
            "binary": "string",
            "char": "string",
            "varchar": "string",
            "datetime": "string",
            "time": "string",
            "timestamp": "string",
            "date": "string",
            "text": "string"
        }
        return mappings[mysql_type]

    meta = get_mysql_meta(database, table)
    return map(lambda x: {"name": x[0], "type": type_mapping(x[1].lower())}, meta)


def generate_json(source_database, source_table):
    job = {
        "job": {
            "setting": {
                "speed": {
                    "channel": 3
                },
                "errorLimit": {
                    "record": 0,
                    "percentage": 0.02
                }
            },
            "content": [{
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": mysql_user,
                        "password": mysql_passwd,
                        "column": get_mysql_columns(source_database, source_table),
                        "splitPk": "",
                        "connection": [{
                            "table": [source_table],
                            "jdbcUrl": ["jdbc:mysql://" + mysql_host + ":" + mysql_port + "/" + source_database]
                        }]
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://" + hdfs_nn_host + ":" + hdfs_nn_port,
                        "fileType": "text",
                        "path": "${targetdir}",
                        "fileName": source_table,
                        "column": get_hive_columns(source_database, source_table),
                        "writeMode": "append",
                        "fieldDelimiter": "\t",
                        "compress": "gzip"
                    }
                }
            }]
        }
    }
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    with open(os.path.join(output_path, ".".join([source_database, source_table, "json"])), "w") as f:
        json.dump(job, f)


def main(args):
    source_database = ""
    source_table = ""

    options, arguments = getopt.getopt(args, '-d:-t:', ['sourcedb=', 'sourcetbl='])
    for opt_name, opt_value in options:
        if opt_name in ('-d', '--sourcedb'):
            source_database = opt_value
        if opt_name in ('-t', '--sourcetbl'):
            source_table = opt_value

    generate_json(source_database, source_table)


if __name__ == '__main__':
    main(sys.argv[1:])

（1）安装Python Mysql驱动
由于需要使用Python访问Mysql数据库，故需安装驱动，命令如下：

[yobhel@hadoop101 bin]$ sudo yum install MySQL-python

（2）脚本使用说明

python gen_import_config.py -d database -t table

通过-d传入数据库名，-t传入表名，执行上述命令即可生成该表的DataX同步配置文件。
2）在~/bin目录下创建gen_import_config.sh脚本

[yobhel@hadoop101 bin]$ vim ~/bin/gen_import_config.sh

脚本内容如下

#!/bin/bash

python ~/bin/gen_import_config.py -d edu2077 -t base_category_info
python ~/bin/gen_import_config.py -d edu2077 -t base_source
python ~/bin/gen_import_config.py -d edu2077 -t base_province
python ~/bin/gen_import_config.py -d edu2077 -t base_subject_info
python ~/bin/gen_import_config.py -d edu2077 -t cart_info
python ~/bin/gen_import_config.py -d edu2077 -t chapter_info
python ~/bin/gen_import_config.py -d edu2077 -t course_info
python ~/bin/gen_import_config.py -d edu2077 -t knowledge_point
python ~/bin/gen_import_config.py -d edu2077 -t test_paper
python ~/bin/gen_import_config.py -d edu2077 -t test_paper_question
python ~/bin/gen_import_config.py -d edu2077 -t test_point_question
python ~/bin/gen_import_config.py -d edu2077 -t test_question_info
python ~/bin/gen_import_config.py -d edu2077 -t user_chapter_process
python ~/bin/gen_import_config.py -d edu2077 -t test_question_option
python ~/bin/gen_import_config.py -d edu2077 -t video_info

3）为gen_import_config.sh脚本增加执行权限

[yobhel@hadoop101 bin]$ chmod +x ~/bin/gen_import_config.sh

4）执行gen_import_config.sh脚本，生成配置文件

[yobhel@hadoop101 bin]$ gen_import_config.sh

5）观察生成的配置文件

[yobhel@hadoop101 bin]$ ll /opt/module/datax/job/import/
总用量 60
-rw-rw-r--. 1 yobhel yobhel  845 3月   2 21:06 edu2077.base_category_info.json
-rw-rw-r--. 1 yobhel yobhel  867 3月   2 21:06 edu2077.base_province.json
-rw-rw-r--. 1 yobhel yobhel  717 3月   2 21:06 edu2077.base_source.json
-rw-rw-r--. 1 yobhel yobhel  899 3月   2 21:06 edu2077.base_subject_info.json
-rw-rw-r--. 1 yobhel yobhel 1133 3月   2 21:06 edu2077.cart_info.json
-rw-rw-r--. 1 yobhel yobhel 1047 3月   2 21:06 edu2077.chapter_info.json
-rw-rw-r--. 1 yobhel yobhel 1431 3月   2 21:06 edu2077.course_info.json
-rw-rw-r--. 1 yobhel yobhel 1059 3月   2 21:06 edu2077.knowledge_point.json
-rw-rw-r--. 1 yobhel yobhel  939 3月   2 21:06 edu2077.test_paper.json
-rw-rw-r--. 1 yobhel yobhel  943 3月   2 21:06 edu2077.test_paper_question.json
-rw-rw-r--. 1 yobhel yobhel  897 3月   2 21:06 edu2077.test_point_question.json
-rw-rw-r--. 1 yobhel yobhel 1075 3月   2 21:06 edu2077.test_question_info.json
-rw-rw-r--. 1 yobhel yobhel  957 3月   2 21:06 edu2077.test_question_option.json
-rw-rw-r--. 1 yobhel yobhel 1007 3月   2 21:06 edu2077.user_chapter_process.json
-rw-rw-r--. 1 yobhel yobhel 1341 3月   2 21:06 edu2077.video_info.json

4 测试生成的DataX配置文件

以base_province为例，测试用脚本生成的配置文件是否可用。
1）创建目标路径
由于DataX同步任务要求目标路径提前存在，故需手动创建路径，当前base_province表的目标路径应为/origin_data/edu/db/base_province_full/2022-02-21。

[yobhel@hadoop101 bin]$ hadoop fs -mkdir -p /origin_data/edu/db/base_province_full/2022-02-21

2）执行DataX同步命令

[yobhel@hadoop101 bin]$ python /opt/module/datax/bin/datax.py -p"-Dtargetdir=/origin_data/edu/db/base_province_full/2022-02-21" /opt/module/datax/job/import/edu2077.base_province.json

3）观察同步结果
观察HFDS目标路径是否出现数据。

5 全量表数据同步脚本

为方便使用以及后续的任务调度，此处编写一个全量表数据同步脚本。
1）在~/bin目录创建mysql_to_hdfs_full.sh

[yobhel@hadoop101 bin]$ vim ~/bin/mysql_to_hdfs_full.sh

脚本内容如下

#!/bin/bash

DATAX_HOME=/opt/module/datax
DATAX_DATA=/opt/module/datax/job

#清理脏数据
handle_targetdir() {
  hadoop fs -rm -r $1 >/dev/null 2>&1
  hadoop fs -mkdir -p $1
}

#数据同步
import_data() {
  local datax_config=$1
  local target_dir=$2

  handle_targetdir "$target_dir"
  echo "正在处理$1"
  python $DATAX_HOME/bin/datax.py -p"-Dtargetdir=$target_dir" $datax_config >/tmp/datax_run.log 2>&1
  if [ $? -ne 0 ]
  then
    echo "处理失败, 日志如下:"
    cat /tmp/datax_run.log 
  fi
  rm /tmp/datax_run.log 
}

#接收表名变量
tab=$1
# 如果传入日期则do_date等于传入的日期，否则等于前一天日期
if [ -n "$2" ] ;then
    do_date=$2
else
    do_date=$(date -d "-1 day" +%F)
fi


case ${tab} in
base_category_info | base_province | base_source | base_subject_info | cart_info | chapter_info | course_info | knowledge_point | test_paper | test_paper_question | test_point_question | test_question_info | test_question_option | user_chapter_process | video_info)
  import_data $DATAX_DATA/import/edu2077.${tab}.json /origin_data/edu/db/${tab}_full/$do_date
  ;;
"all")
  for tmp in base_category_info base_province base_source base_subject_info cart_info chapter_info course_info knowledge_point test_paper test_paper_question test_point_question test_question_info test_question_option user_chapter_process video_info
  do
    import_data $DATAX_DATA/import/edu2077.${tmp}.json /origin_data/edu/db/${tmp}_full/$do_date
  done
  ;;
esac

2）为mysql_to_hdfs_full.sh增加执行权限

[yobhel@hadoop101 bin]$ chmod +x ~/bin/mysql_to_hdfs_full.sh

3）测试同步脚本

[yobhel@hadoop101 bin]$ mysql_to_hdfs_full.sh all 2022-02-21

4）检查同步结果
查看HDFS目表路径是否出现全量表数据，全量表共15张。

6 全量表同步总结

全量表同步逻辑比较简单，只需每日执行全量表数据同步脚本mysql_to_hdfs_full.sh即可。

dataX是阿里开源的离线数据库同步工具的使用
dataX是阿里开源的离线数据库同步工具的使用 DataX介绍： DataX 是阿里开源的一个异构数据源离线同步工...
MaxCompute full outer join改写left
简介：ods层数据同步时经常会遇到增全量合并的模型，即T-1天增量表 + T-2全量表 = T-1全量表。可以通过...
开源数据同步工具——datax
开源数据同步工具——datax DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台，实现包括 MySQ...
DataX 3.0简介安装及使用
DataX3.0离线同步工具介绍一. DataX3.0概览 DataX 是一个异构数据源离线同步工具，致力于实...
DataX 增量同步数据
全量数据导出请查看DataX mongodb导出数据到mysql Datax UDF手册 datax.py mon...
DataX Web使用体验入门
一、DataX Web是什么 DataX web是在DataX的基础上开发的分布式的数据同步工具，方便DataX的...
datax字段转换
通用转换针对通用的转换，如加密等很多表同步都使用到的，可以加入插件。可以参考com.alibaba.datax....
DataX 数据全量，增量同步方案
关于DataX 增量更新实现注：参考来源文章增量更新总体思路：从目标数据库读取一个最大值的记录，可以是Data...
DataX系列1-DataX介绍
一. DataX 概述 DataX 是一个异构数据源离线同步工具，致力于实现包括关系型数据库(MySQL、Or...
数据同步工具
mysql ---> es 数据同步工具数据同步工具介绍基本介绍一下工具 DataX, 离线导入导出 Sqoo...