美文网首页Flink
Flink-CDC 同步Mysql数据到S3 Hudi

Flink-CDC 同步Mysql数据到S3 Hudi

作者: 阿猫阿狗Hakuna | 来源:发表于2021-09-26 13:58 被阅读0次

    软件版本

    Mysql: 5.7
    Hadoop: 3.1.3
    Flink: 1.12.2
    Hudi: 0.9.0
    Hive: 2.3.7

    1.Mysql建表并开启bin_log

    create table users(
        id bigint auto_increment primary key,
        name varchar(20) null,
        birthday timestamp default CURRENT_TIMESTAMP not null,
        ts timestamp default CURRENT_TIMESTAMP not null
    );
    

    2.安装Hadoop

    (1)解压hadoop安装包:tar -zxvf hadoop-3.1.3.tar.gz
    (2)配置环境变量

    export HADOOP_HOME=/Users/xxx/hadoop/hadoop-3.1.3
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export PATH=$HADOOP_HOME/bin:$PATH
    
    #添加hadoop classpath
    export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
    

    3.下载安装Flink

    (1)在Flink官网下载flink软件包:https://flink.apache.org/downloads.html
    (2)解压:tar -zxvf flink-1.12.2-bin-scala_2.11.tgz
    (3)配置flink(vim conf/flink-conf.yaml),开启checkpoint(flink-cdc需要开启checkpoint才能生成hudi commit,提交数据)

    state.backend: filesystem
    execution.checkpointing.interval: 10000
    state.checkpoints.dir: file:///Users/xxx/flink/flink-1.12.2/hudi/flink-checkpoints
    state.savepoints.dir: file:///Users/xxx/flink/flink-1.12.2/hudi/flink-savepoints
    

    (4)配置flink(vim conf/flink-conf.yaml),增加slot数

    taskmanager.numberOfTaskSlots: 4
    
    vim workers
      1 localhost
      2 localhost
      3 localhost
      4 localhost
    

    (4)启动Flink:bin/start-cluster.sh

    4.编译Hudi,拷贝jar包

    (1)下载Hudi源码:git clone https://github.com/apache/hudi.git
    (2)切换到0.9.0分支:git checkout origin release-0.9.0
    (3)编译:mvn clean package -DskipTests
    (4)编译完成后,会在packaging/hudi-flink-bundle/target目录下生成对应的jar包(hudi-flink-bundle_2.11-0.9.0.jar),将此jar包拷贝至flink的lib目录中:

    cp hudi-flink-bundle_2.11-0.9.0.jar ~/flink/lib
    

    5.将其他相关jar包拷贝至flink/lib目录下

    (1)flink-sql-connector-mysql-cdc-1.2.0.jar:用于连接mysql
    (2)aws-java-sdk-bundle-1.11.874.jar/hadoop-aws-3.1.3.jar:用于连接aws s3

    6.启动sql-client

    1.bin/sql-client.sh embedded
    2.建立mysql 映射表
    create table mysql_users(
        id bigint primary key not enforced,
        name string,
        birthday timestamp(3),
        ts timestamp(3)
    ) with (
        'connector' = 'mysql-cdc',
        'hostname' = '127.0.0.1',
        'port' = '3306',
        'username' = 'root',
        'password' = '123456',
        'database-name' = 'test_cdc',
        'table-name' = 'users'
    );
    
    3.建立hudi映射表
    create table hudi_users(
        id bigint primary key not enforced,
        name string,
        birthday timestamp(3),
        ts timestamp(3),
        `partition` varchar(20)
    ) partitioned by (`partition`) with (
        'connector' = 'hudi',
        'table.type' = 'COPY_ON_WRITE',
        'path' = 's3a://xxx/yyy/hudi_users',
        'read.streaming.enabled' = 'true',
        'read.streaming.check-interval' = '1'
    );
    
    4.创建任务
    insert into hudi_users select *, date_format(birthday, 'yyyyMMdd') from mysql_users;
    
    检查s3上是否生成了数据;
    

    7.Hive建立external table

    1.通过beeline连接hive
    !connect jdbc:hive2://[ELB-DEV-Presto-hs2-s0000e2c5-06a22927ec8bb2f6.elb.us-east-1.amazonaws.com:10000/default;auth=noSasl](http://elb-dev-presto-hs2-s0000e2c5-06a22927ec8bb2f6.elb.us-east-1.amazonaws.com:10000/default;auth=noSasl)
    
    
    CREATE EXTERNAL TABLE `hudi_user_mor`(               
       `_hoodie_commit_time` string,                    
       `_hoodie_commit_seqno` string,                   
       `_hoodie_record_key` string,                     
       `_hoodie_partition_path` string,                 
       `_hoodie_file_name` string,                      
       `id` bigint,                                     
       `name` string,                                   
       `birthday` bigint,                               
       `ts` bigint)                                     
     PARTITIONED BY (                                   
       `partition` string)                              
     ROW FORMAT SERDE                                   
       'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  
     STORED AS INPUTFORMAT                              
       'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' 
     OUTPUTFORMAT                                       
       'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
     LOCATION                                           
       's3a://xxx/yyy/hudi_users';
    
    添加分区:
    alter table hudi_user_mor add if not exists partition(`partition`='par1') location 's3a://fw-itf/DFMOD-c34db792/target_table/par1';
    

    8.通过presto查询数据

    1.进入presto
    ./presto-cli-0.248-executable.jar --server ELB-DEV-Presto-master-s0000eca1-efaff1be86b6ffa3.elb.us-east-1.amazonaws.com:9106 --catalog db
    
    2.查询数据
    select * from hudi_user_mor where partition = 'par1' limit 5;
    

    8.测试同步

    在mysql中执行增、删、改语句,并在Hive或presto中进行查询,可以实时的查询到改动。

    相关文章

      网友评论

        本文标题:Flink-CDC 同步Mysql数据到S3 Hudi

        本文链接:https://www.haomeiwen.com/subject/kvjqnltx.html