美文网首页
sqoop数据处理

sqoop数据处理

作者: 大龄程序员在帝都 | 来源:发表于2017-05-15 11:50 被阅读280次

    参考文章001
    sqoop操作

    1、数据导入

    sqoop是sql和hadoop的缩写,表示从数据库中导出数据到hadoop的HDFS中,本教程就是演示在mysql和HDFS之间导入导出数据

    1、首先安装sqoop
    下载最新包:官网下载
    解压到指定目录下

    tar -xzvf sqoop-1.4.6.bin__hadoop-0.23.tar.gz -C /home sqoop
    

    2、配置环境变量:
    vim /etc/profile

    export SQOOP_HOME=/home/sqoop
    export PATH=$SQOOP_HOME/bin:$PATH
    

    配置生效source /etc/profile
    查看环境变量是否生效:

    image.png

    如上已经生效

    3、设置sqoop的环境
    cp ${SQOOP_HOME}/conf下

    cp sqoop-env-template.sh  sqoop-env.sh
    

    编辑sqoop-env.sh

    #Set path to where bin/hadoop is available
    export HADOOP_COMMON_HOME=/home/hadoop/
    
    #Set path to where hadoop-*-core.jar is available
    export HADOOP_MAPRED_HOME=/home/hadoop
    
    #set the path to where bin/hbase is available
    #export HBASE_HOME=
    
    #Set the path to where bin/hive is available
    export HIVE_HOME=/home/apache-hive-1.2.2-bin
    
    #Set the path for where zookeper config dir is
    #export ZOOCFGDIR=
    

    hbase和zookeeper的没有设置,是因为我没有用到hbase,后期可以考虑设置habase把数据导入到habase中即可

    3、在mysql中制造数据便于通过sqoop导入到HDFS中

    #创建test数据库
    create database test;
    #授权任何ip和用户都可以访问test数据库
    grant all privileges on test.* to root identified by 'xxxx'
    
    #创建日志表,只有两个主键id和content代表日志id和内容
    create table log(`id` int(11) NOT NULL AUTO_INCREMENT,content varchar(500) NOT NULL,PRIMARY KEY (`id`) 
    )ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;   
    
    #编写存储过程,往log表中添加10000条记录
    drop PROCEDURE add_log;
    delimiter //
    CREATE  PROCEDURE `add_log`()
    BEGIN
        DECLARE v_max INT UNSIGNED DEFAULT 10000;
        DECLARE v_counter INT UNSIGNED DEFAULT 0;
            WHILE v_counter < v_max DO
               insert into log values(NULL,'test data');
                SET v_counter=v_counter+1;
            END WHILE;
     END;//
    delimiter ;
    

    最后进行执行

    call  add_log();
    

    调用后一直执行,最后看到test.log中有10000条记录;

    image.png image.png

    4、在执行导入操作之前,需要把mysql连接数据库的jar包拷贝到${SQOOP_HOME}/lib下面

    image.png

    5、执行导入操作:

    sqoop import --connect jdbc:mysql://node3:3306/test  --username root --password xxxxx --table log --fields-terminated-by '\t'
    

    对以上参数进行说明:

     sqoop             ##sqoop命令
    
        import             ##表示导入
    
        --connect jdbc:mysql://ip:3306/test    ##告诉jdbc,连接mysql的url
    
        --username root                                     ##连接mysql的用户名
    
        --password xxx                                 ##连接mysql的密码
    
        --table log                                        ##从mysql导出的表名称
    
        --fields-terminated-by '\t'                        ##指定输出文件中的行的字段分隔符
    
        -m 1                                                       ##复制过程使用1个map作业
    
        --hive-import                                          ##把mysql表数据复制到hive空间中。如果不使用该选项,意味着复制到hdfs中 
    

    6、验证
    其实就是执行map-reduce程序,往HDFS上写入数据,完整日志可以看最后的执行日志,截图中可以看到接收到10000条数据。
    我们看一下HDFS上是否有数据:如图所示,hdfs上有对应的数据,后期会分析为什么会分成4部分!

    image.png

    可以看到一共分为四个部分,每一个部分应该是2500条数据,我们下载part-m-0000打开以后可以看到是2500条,格式如下:

    image.png

    执行结果:

    image.png

    将mysql数据导入到hive中

    首先删除掉导入到HDFS的数据

    hadoop fs -rmr /user/root/*
    

    然后再通过命令导入到hive中:

    sqoop import --connect jdbc:mysql://node3:3306/test  --username root --password xxxx --table log --fields-terminated-by '\t'   --hive-import
    

    执行完成以后可以看到hive中有对应的数据

    image.png

    2、数据导出

    sqoop 
    
        export                                        ##表示数据从hive复制到mysql中
    
        --connect jdbc:mysql://ip:3306/test   ##告诉jdbc,连接mysql的url
    
        --username root          ##连接mysql的用户名
    
        --password xxxx        ##连接mysql的密码
    
        --table log_2                                        ##mysql中的表,即将被导入的表名称
    
        --export-dir '/user/root/warehouse/log'  ##hive中被导出的文件目录
    
        --fields-terminated-by '\t'    ##hive中被导出的文件字段的分隔符
    
     
    注意:导出的数据表必须是事先存在的  
    

    首先在mysql数据库中新建表log_2,将第一次导入的10000条数据导出到log_2中,创建表log_2;

    CREATE TABLE `log_2` (
           `id` int(11) NOT NULL AUTO_INCREMENT,
         `content` varchar(500) NOT NULL,
          PRIMARY KEY (`id`)
       ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;
    

    执行导出

    sqoop export --connect jdbc:mysql://node3:3306/test  --username root --password xxxx --table log_2 --fields-terminated-by '\t' --export-dir '/user/root/log'  --input-fields-terminated-by '\0000'
    

    其中--input-fields-terminated-by '\0000'代表需要导出的文件的文件名匹配格式

    验证:
    查看mysql中的表log_2; 如下图,执行结果,log_2中有10000条数据,就是刚才导入的!

    image.png

    备注
    1、导入mysql数据到Hadoop HDFS中日志

    执行日志:

    [root@node1 home]# sqoop import --connect jdbc:mysql://106.75.62.162:3306/test  --username root --password xxxxx --table log --fields-terminated-by '\t'
    Warning: /home/sqoop/../hbase does not exist! HBase imports will fail.
    Please set $HBASE_HOME to the root of your HBase installation.
    Warning: /home/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
    Please set $HCAT_HOME to the root of your HCatalog installation.
    Warning: /home/sqoop/../accumulo does not exist! Accumulo imports will fail.
    Please set $ACCUMULO_HOME to the root of your Accumulo installation.
    Warning: /home/sqoop/../zookeeper does not exist! Accumulo imports will fail.
    Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
    17/05/15 11:23:05 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
    17/05/15 11:23:05 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
    17/05/15 11:23:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
    17/05/15 11:23:05 INFO tool.CodeGenTool: Beginning code generation
    17/05/15 11:23:12 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `log` AS t LIMIT 1
    17/05/15 11:23:12 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `log` AS t LIMIT 1
    17/05/15 11:23:12 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoop
    注: /tmp/sqoop-root/compile/3d774b871b1067fdc69f6e08612e3e64/log.java使用或覆盖了已过时的 API。
    注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。
    17/05/15 11:23:15 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/3d774b871b1067fdc69f6e08612e3e64/log.jar
    17/05/15 11:23:15 WARN manager.MySQLManager: It looks like you are importing from mysql.
    17/05/15 11:23:15 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
    17/05/15 11:23:15 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
    17/05/15 11:23:15 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
    17/05/15 11:23:15 INFO mapreduce.ImportJobBase: Beginning import of log
    17/05/15 11:23:16 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
    17/05/15 11:23:17 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
    17/05/15 11:23:17 INFO client.RMProxy: Connecting to ResourceManager at node1/10.9.167.99:8032
    17/05/15 11:23:19 WARN hdfs.DFSClient: Caught exception
    java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1249)
        at java.lang.Thread.join(Thread.java:1323)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
    17/05/15 11:23:19 WARN hdfs.DFSClient: Caught exception
    java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1249)
        at java.lang.Thread.join(Thread.java:1323)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
    17/05/15 11:23:19 WARN hdfs.DFSClient: Caught exception
    java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1249)
        at java.lang.Thread.join(Thread.java:1323)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
    17/05/15 11:23:26 INFO db.DBInputFormat: Using read commited transaction isolation
    17/05/15 11:23:26 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `log`
    17/05/15 11:23:26 INFO mapreduce.JobSubmitter: number of splits:4
    17/05/15 11:23:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1492235323762_0002
    17/05/15 11:23:27 INFO impl.YarnClientImpl: Submitted application application_1492235323762_0002
    17/05/15 11:23:27 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1492235323762_0002/
    17/05/15 11:23:27 INFO mapreduce.Job: Running job: job_1492235323762_0002
    17/05/15 11:23:37 INFO mapreduce.Job: Job job_1492235323762_0002 running in uber mode : false
    17/05/15 11:23:38 INFO mapreduce.Job:  map 0% reduce 0%
    17/05/15 11:23:53 INFO mapreduce.Job:  map 50% reduce 0%
    17/05/15 11:23:59 INFO mapreduce.Job:  map 100% reduce 0%
    17/05/15 11:23:59 INFO mapreduce.Job: Job job_1492235323762_0002 completed successfully
    17/05/15 11:23:59 INFO mapreduce.Job: Counters: 31
        File System Counters
            FILE: Number of bytes read=0
            FILE: Number of bytes written=542612
            FILE: Number of read operations=0
            FILE: Number of large read operations=0
            FILE: Number of write operations=0
            HDFS: Number of bytes read=415
            HDFS: Number of bytes written=148894
            HDFS: Number of read operations=16
            HDFS: Number of large read operations=0
            HDFS: Number of write operations=8
        Job Counters
            Killed map tasks=1
            Launched map tasks=4
            Other local map tasks=4
            Total time spent by all maps in occupied slots (ms)=60923
            Total time spent by all reduces in occupied slots (ms)=0
            Total time spent by all map tasks (ms)=60923
            Total vcore-milliseconds taken by all map tasks=60923
            Total megabyte-milliseconds taken by all map tasks=62385152
        Map-Reduce Framework
            Map input records=10000
            Map output records=10000
            Input split bytes=415
            Spilled Records=0
            Failed Shuffles=0
            Merged Map outputs=0
            GC time elapsed (ms)=591
            CPU time spent (ms)=4550
            Physical memory (bytes) snapshot=475136000
            Virtual memory (bytes) snapshot=8461393920
            Total committed heap usage (bytes)=243531776
        File Input Format Counters
            Bytes Read=0
        File Output Format Counters
            Bytes Written=148894
    17/05/15 11:23:59 INFO mapreduce.ImportJobBase: Transferred 145.4043 KB in 42.3541 seconds (3.4331 KB/sec)
    17/05/15 11:23:59 INFO mapreduce.ImportJobBase: Retrieved 10000 records.
    [root@node1 home]#
    

    相关文章

      网友评论

          本文标题:sqoop数据处理

          本文链接:https://www.haomeiwen.com/subject/svthxxtx.html