oracle大数据量python导入实践-1w/s

作者: HBLong | 来源:发表于2017-05-28 11:02 被阅读238次

    在项目中需要将一个80w+的csv数据(200+m)导入到oracle库,一开始使用的是Navicat for Oracle的导入工具。跑了五六分钟之后绝望了,因为才跑了2%,按这样的速度跑半天都跑不完。
    回想了一下老本行mysql,觉得Oracle应该也会有一个批量插入的高效率语句的写法。于是翻看了oracle的官方文档精通 Oracle+Python,第 1 部分:查询最佳应践

    一次多行
    
    大型的插入操作不需求多次的单独插入,这是因为 Python 通过 cx_Oracle.Cursor.executemany 方法完全支持一次插入多行。
    限制执行操作的数量极大地改善了程序性能,因此在编写存在大量插入操作的应用程序时应首先考虑这一功能。 
    

    呲!这就是我想要的。
    迅速仿造官方的demo写了一个简易的导入脚本,跑了一下,138.7s也就是将近2分多钟就跑完了这200多m,差不多88w行的数据。附上性能测试

    总行数: 885640
    2017-05-17 14:11:24
    ===begin===
    2017-05-17 14:11:24
    prepare end
    2017-05-17 14:11:26 => 1 - 10000 ( 9999 ) finish
    2017-05-17 14:11:27 => 10001 - 20000 ( 9999 ) finish
    2017-05-17 14:11:28 => 20001 - 30000 ( 9999 ) finish
    2017-05-17 14:11:30 => 30001 - 40000 ( 9999 ) finish
    2017-05-17 14:11:31 => 40001 - 50000 ( 9999 ) finish
    2017-05-17 14:11:33 => 50001 - 60000 ( 9999 ) finish
    2017-05-17 14:11:34 => 60001 - 70000 ( 9999 ) finish
    2017-05-17 14:11:35 => 70001 - 80000 ( 9999 ) finish
    2017-05-17 14:11:37 => 80001 - 90000 ( 9999 ) finish
    2017-05-17 14:11:38 => 90001 - 100000 ( 9999 ) finish
    2017-05-17 14:11:39 => 100001 - 110000 ( 9999 ) finish
    2017-05-17 14:11:42 => 110001 - 120000 ( 9999 ) finish
    2017-05-17 14:11:44 => 120001 - 130000 ( 9999 ) finish
    2017-05-17 14:11:45 => 130001 - 140000 ( 9999 ) finish
    2017-05-17 14:11:46 => 140001 - 150000 ( 9999 ) finish
    2017-05-17 14:11:47 => 150001 - 160000 ( 9999 ) finish
    2017-05-17 14:11:49 => 160001 - 170000 ( 9999 ) finish
    2017-05-17 14:11:50 => 170001 - 180000 ( 9999 ) finish
    2017-05-17 14:11:51 => 180001 - 190000 ( 9999 ) finish
    2017-05-17 14:11:53 => 190001 - 200000 ( 9999 ) finish
    2017-05-17 14:11:56 => 200001 - 210000 ( 9999 ) finish
    2017-05-17 14:11:58 => 210001 - 220000 ( 9999 ) finish
    2017-05-17 14:12:00 => 220001 - 230000 ( 9999 ) finish
    2017-05-17 14:12:02 => 230001 - 240000 ( 9999 ) finish
    2017-05-17 14:12:04 => 240001 - 250000 ( 9999 ) finish
    2017-05-17 14:12:06 => 250001 - 260000 ( 9999 ) finish
    2017-05-17 14:12:10 => 260001 - 270000 ( 9999 ) finish
    2017-05-17 14:12:14 => 270001 - 280000 ( 9999 ) finish
    2017-05-17 14:12:16 => 280001 - 290000 ( 9999 ) finish
    2017-05-17 14:12:17 => 290001 - 300000 ( 9999 ) finish
    2017-05-17 14:12:19 => 300001 - 310000 ( 9999 ) finish
    2017-05-17 14:12:25 => 310001 - 320000 ( 9999 ) finish
    2017-05-17 14:12:26 => 320001 - 330000 ( 9999 ) finish
    2017-05-17 14:12:28 => 330001 - 340000 ( 9999 ) finish
    2017-05-17 14:12:29 => 340001 - 350000 ( 9999 ) finish
    2017-05-17 14:12:30 => 350001 - 360000 ( 9999 ) finish
    2017-05-17 14:12:51 => 360001 - 370000 ( 9999 ) finish
    2017-05-17 14:12:53 => 370001 - 380000 ( 9999 ) finish
    2017-05-17 14:12:54 => 380001 - 390000 ( 9999 ) finish
    2017-05-17 14:12:56 => 390001 - 400000 ( 9999 ) finish
    2017-05-17 14:12:58 => 400001 - 410000 ( 9999 ) finish
    2017-05-17 14:13:36 => 410001 - 420000 ( 9999 ) finish
    2017-05-17 14:13:37 => 420001 - 430000 ( 9999 ) finish
    2017-05-17 14:13:39 => 430001 - 440000 ( 9999 ) finish
    2017-05-17 14:13:40 => 440001 - 450000 ( 9999 ) finish
    2017-05-17 14:13:42 => 450001 - 460000 ( 9999 ) finish
    2017-05-17 14:13:43 => 460001 - 470000 ( 9999 ) finish
    2017-05-17 14:13:45 => 470001 - 480000 ( 9999 ) finish
    2017-05-17 14:13:47 => 480001 - 490000 ( 9999 ) finish
    2017-05-17 14:13:48 => 490001 - 500000 ( 9999 ) finish
    2017-05-17 14:13:50 => 500001 - 510000 ( 9999 ) finish
    2017-05-17 14:13:51 => 510001 - 520000 ( 9999 ) finish
    2017-05-17 14:13:52 => 520001 - 530000 ( 9999 ) finish
    2017-05-17 14:13:54 => 530001 - 540000 ( 9999 ) finish
    2017-05-17 14:13:56 => 540001 - 550000 ( 9999 ) finish
    2017-05-17 14:13:57 => 550001 - 560000 ( 9999 ) finish
    2017-05-17 14:13:59 => 560001 - 570000 ( 9999 ) finish
    2017-05-17 14:14:01 => 570001 - 580000 ( 9999 ) finish
    2017-05-17 14:14:03 => 580001 - 590000 ( 9999 ) finish
    2017-05-17 14:14:05 => 590001 - 600000 ( 9999 ) finish
    2017-05-17 14:14:06 => 600001 - 610000 ( 9999 ) finish
    2017-05-17 14:14:07 => 610001 - 620000 ( 9999 ) finish
    2017-05-17 14:14:09 => 620001 - 630000 ( 9999 ) finish
    2017-05-17 14:14:10 => 630001 - 640000 ( 9999 ) finish
    2017-05-17 14:14:11 => 640001 - 650000 ( 9999 ) finish
    2017-05-17 14:14:13 => 650001 - 660000 ( 9999 ) finish
    2017-05-17 14:14:14 => 660001 - 670000 ( 9999 ) finish
    2017-05-17 14:14:16 => 670001 - 680000 ( 9999 ) finish
    2017-05-17 14:14:17 => 680001 - 690000 ( 9999 ) finish
    2017-05-17 14:14:19 => 690001 - 700000 ( 9999 ) finish
    2017-05-17 14:14:20 => 700001 - 710000 ( 9999 ) finish
    2017-05-17 14:14:23 => 710001 - 720000 ( 9999 ) finish
    2017-05-17 14:14:28 => 720001 - 730000 ( 9999 ) finish
    2017-05-17 14:14:34 => 730001 - 740000 ( 9999 ) finish
    2017-05-17 14:14:37 => 740001 - 750000 ( 9999 ) finish
    2017-05-17 14:14:40 => 750001 - 760000 ( 9999 ) finish
    2017-05-17 14:14:43 => 760001 - 770000 ( 9999 ) finish
    2017-05-17 14:14:46 => 770001 - 780000 ( 9999 ) finish
    2017-05-17 14:14:49 => 780001 - 790000 ( 9999 ) finish
    2017-05-17 14:14:51 => 790001 - 800000 ( 9999 ) finish
    2017-05-17 14:14:55 => 800001 - 810000 ( 9999 ) finish
    2017-05-17 14:15:15 => 810001 - 820000 ( 9999 ) finish
    2017-05-17 14:15:17 => 820001 - 830000 ( 9999 ) finish
    2017-05-17 14:15:19 => 830001 - 840000 ( 9999 ) finish
    2017-05-17 14:15:20 => 840001 - 850000 ( 9999 ) finish
    2017-05-17 14:15:22 => 850001 - 860000 ( 9999 ) finish
    2017-05-17 14:15:24 => 860001 - 870000 ( 9999 ) finish
    2017-05-17 14:15:27 => 870001 - 880000 ( 9999 ) finish
    2017-05-17 14:15:28 => 880001 - 890000 ( 5639 ) finish
    2017-05-17 14:15:28
    execute end
    
    请按任意键继续. . .
    [Finished in 138.7s]
    

    从这里可以看出,1w行的数据1s就可以搞定了,这效率杠杠的。文末献上粗糙的python导入脚本。

    困难回顾

    在实际操作时,可能会遇到以下问题

    1.ORA-12899: value too large for column

    大意就是值越界的,原来的数据库是没有这个问题的,于是猜测是编码集的原因,由于我用的是utf8,与库的编码集不一致,于是乎一个汉字被转成\uxxxx之类的。

    解决方法:在脚本加入os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8',让数据库自行转换编码。

    2. ORA-03135: connection lost contact

    意思就是说与数据库的连接断开的,原因是我的执行时间太长,超过了Oracle设置允许的最大时间

    解决方法:在Oracle安装目录下找到/network/admin/sqlnet.ora,如博主的目录/app/oracle/product/11.2.0/dbhome_1/network/admin/sqlnet.ora。在该文件追加
    sqlnet.expire_time = 60
    时间单位:分钟
    取值范围:大于0
    默认取值:无


    以下是python demo

    # -*- coding: utf8 -*-
    
    import cx_Oracle
    import csv
    import time
    import os
    
    # 编码转换
    os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8'
    
    M = []
    csvfile = file('C:\Users\Administrator\Desktop\Table.csv', 'rb')
    reader = csv.reader(csvfile)
    
    
    for line in reader:
        try:
            M.append((line[0],line[1],line[2],line[3]))
        except AttributeError:
            pass
    
    csvfile.close() 
    
    
    
    #创建数据库连接
    conn = cx_Oracle.connect('oracle', '123456', '192.168.102.126:1521/ORCL')
    
    #获取操作游标
    cursor = conn.cursor()
    
    print len(M)
    
    print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
    print '===begin==='
    
    cursor.prepare("INSERT INTO MY_TABLE (ID, COMPANY, DEPARTMENT, NAME) VALUES (:1,:2,:3,:4)")
    
    print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
    print 'prepare end'
    
    for i in range(1, 31):
        begin = (i - 1) * 30000
        end = i * 30000
        cursor.executemany(None, M[begin:end])
        print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())), '=>', begin, '-', end, '(',  len(M[begin:end]), ')','finish'
    
    print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
    print 'execute end'
    
    conn.commit()
    #885640
    print time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
    print 'end'
    
    r = cursor.execute("SELECT COUNT(*) FROM MY_TABLE")
    print cursor.fetchone()
    
    #关闭连接,释放资源
    cursor.close()
    conn.close()
    

    听说使用数据泵的导入导出效率更高,有机会尝试一下。

    相关文章

      网友评论

      • RHEL6:你好,问下,在插入数据的时候,value 怎么引用进去的。
      • AlicFeng:赞一个(^3^)
      • hoobyy:腻害
        HBLong:过奖了:grin:

      本文标题:oracle大数据量python导入实践-1w/s

      本文链接:https://www.haomeiwen.com/subject/izgcfxtx.html