大数据量excel导入

作者: _渐秋 | 来源:发表于2019-07-27 16:16 被阅读0次

数据导入
大数据量excel导入
Web优雅的实现大数据量Excel导入
多线程批量导入数据
mysql导入excel数据php代码
java excel导入获取实时进度
2022-01-23
js文件上传、导入报表excel
使用python处理excel
java实现Excel导入（迭代一）

之前采用的是InputStream文件流的方式处理对内存依赖极大，所以实际应用时，如果只能获取文件流的话，建议先将文件通过流拷贝到本地，然后再使用解析工具类;

        DiskFileItemFactory factory = new DiskFileItemFactory();
        ServletFileUpload upload = new ServletFileUpload(factory);
        List items = upload.parseRequest(request);
        List<FileItem> fileList = new ArrayList<>();
        Iterator iterator = items.iterator();

         while (iterator.hasNext()) {
            FileItem item = (FileItem) iterator.next();

            //文件
            if (item.getFieldName().contains("file")) {
                fileList.add(item);
            }
        }

       for (FileItem fileItem : fileList) {
           String fileName = fileItem.getName();

           File file3 = new File(this.getClass().getResource("/").getPath());
           String filePath = file3.getPath() + "/" +fileName;
           File file = new File(filePath);
           fileItem.write(file);
           ...
           //file.delete(); //删除文件
        }

前端部分代码

        <form id="uploadFiles" enctype="multipart/form-data" method="post">
        <input type="file" name="file" id="importExcelFile" style="padding-top:7px" multiple>
        </form>

        var fileObjList = document.getElementById('importExcelFile').files;
        for(var i = 0 ; i < fileObjList.length ; i ++){
                formFile.append("file" + i, fileObjList[i]); //加入文件对象
            }

数据解析采用POI事件驱动方式

【原理】
poi先将excel转为xml，而后是使用SAXParser解析器，解析xml文件得到excel的数据

xml解析一般是先转dom树，然后操作，【方便随意遍历】，但是这需要将全部xml加载处理，适合小的xml，或者配置类xml

xml文件到数百M或上G的量，全部加载效率低，无法生成完整dom树操作，所以SAX解析器是循环读取一定长度处理，读到一个标签就会回调一个用户方法处理，这样减小内存。【适合大量数据导入，不能回头遍历以前的xml，需要自己实现处理xml内读取的数据关系】

excel转换后的完整xml例子,test.xml

<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
           xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
           xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac"
           xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
    <dimension ref="A1:AB7"/>
    <sheetViews>
        <sheetView tabSelected="1" workbookViewId="0">
            <selection activeCell="L3" sqref="L3"/>
        </sheetView>
    </sheetViews>
    <sheetFormatPr defaultRowHeight="13.8" x14ac:dyDescent="0.25"/>
    <cols>
        <col min="12" max="12" width="9.109375" bestFit="1" customWidth="1"/>
    </cols>
    <sheetData>
        <row r="1" spans="1:28" x14ac:dyDescent="0.25">
            <c r="A1">
                <v>1</v>
            </c>
            <c r="B1">
                <v>2</v>
            </c>
            <c r="D1">
                <v>4</v>
            </c>
            <c r="G1">
                <v>7</v>
            </c>
            <c r="H1" t="s">
                <v>0</v>
            </c>
            <c r="I1" t="s">
                <v>4</v>
            </c>
            <c r="K1">
                <v>32423</v>
            </c>
            <c r="U1">
                <v>78979</v>
            </c>
            <c r="Y1" t="s">
                <v>3</v>
            </c>
        </row>
        <row r="2" spans="1:28" x14ac:dyDescent="0.25">
            <c r="B2">
                <v>22</v>
            </c>
            <c r="C2">
                <v>33</v>
            </c>
            <c r="E2">
                <v>55</v>
            </c>
            <c r="F2" t="s">
                <v>1</v>
            </c>
            <c r="Q2" t="s">
                <v>2</v>
            </c>
        </row>
        <row r="3" spans="1:28" x14ac:dyDescent="0.25">
            <c r="L3" s="1">
                <v>201287</v>
            </c>
        </row>
        <row r="7" spans="1:28" x14ac:dyDescent="0.25">
            <c r="AB7">
                <v>123131</v>
            </c>
        </row>
    </sheetData>
    <phoneticPr fontId="1" type="noConversion"/>
    <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
    <pageSetup paperSize="9" orientation="portrait" horizontalDpi="1200" verticalDpi="1200" r:id="rId1"/>
</worksheet>

【XSSFReader空单元格，空行问题，从上面的xml可以看出】
poi转换excel为xml会忽略空单元格（不是单元格内容为空格，是单元格没有内容，可能是excel格式压缩存储原因）和空行，导致转换后的【数据错位问题】，需要自己实现判断空单元格和空行处理 (根据excel的行列号，比如B1, D1则表明C1是空单元格，行row的行列号由L3（L是列号，3是行号），到AB7，表明4，5，6是空行)

【SAXParser解析器DefaultHandler】
从上面的介绍可以大致了解poi处理excel的过程，我们要做的就是覆盖实现解析的方法，来达到自己的需求
自己的Handler继承DefaultHandler，覆盖一些方法
xml标签是成对出现的，有开始，有结束
startDocument是?xml标签的回调处理方法
startElement方法是读到一个xml开始标签时的回调处理方法，比如上面的<worksheet， <dimension ，<sheetViews，
endElement是标签结束的回调处理方法，比如</sheetView，</c单元格结束，</row行结束
characters方法是处理xml中的v标签中间的内容的回调处理方法

【注意xml中的c与v标签】
c就是cell单元格，c的属性r是行列号，t是类型，当t是s,表示是SST（SharedStringsTable）的索引，其他类型很多，不一一列举，打开调试看看完整xml内容，注意在自己的Handler中处理，比如单元格是日期格式等等
v是单元内容【或SST索引】，注意SST索引的取值方式

【SharedStringsTable–SST索引】
excel中多个单元格共享一个相同字符串，压缩存储, 打开sst.writeTo(System.out)注释，看看生成的xml，v表示的sst索引就是si节点的序号

以下是一个基本的处理类，可以很好理解poi解析excel，可以根据需要完善一下，【包含空单元格处理，空行处理】

package com.hand.hdg.col.collect.service.utils;

import com.hand.hdg.col.collect.service.impl.CollectDataServiceImpl;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.springframework.beans.factory.annotation.Autowired;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import java.util.ArrayList;
import java.util.List;

/**
 * description
 *
 * @author jianqiu.zhao 2019/07/23 4:21 PM
 */
public class POIReader extends DefaultHandler {

    @Autowired
    private CollectDataServiceImpl collectDataServiceImpl;

    //取SST 的索引对应的值
    private SharedStringsTable sst;

    public void setSst(SharedStringsTable sst) {
        this.sst = sst;
    }

    //解析结果保存
    private List<List<String>> container;

    public POIReader(SharedStringsTable sst, List<List<String>> container) {
        this.sst = sst;
        this.container = container;
    }

    /**
     * 存储cell标签下v标签包裹的字符文本内容
     * 在v标签开始后，解析器自动调用characters()保存到 lastContents
     * 【但】当cell标签的属性 s是 t时, 表示取到的lastContents是 SharedStringsTable 的index值
     * 需要在v标签结束时根据 index(lastContents)获取一次真正的值
     */
    private String lastContents;

    //有效数据矩形区域,A1:Y2
    private String dimension;

    //根据dimension得出每行的数据长度
    private int longest;

    //上个有内容的单元格id，判断空单元格
    private String lastCellid;

    //上一行id, 判断空行
    private String lastRowid;

    //行数据保存
    private List<String> currentRow;

    //单元格内容是SST 的索引
    private boolean isSSTIndex = false;

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//        System.out.println("startElement:"+qName);

        lastContents = "";
        if (qName.equals("dimension")) {
            dimension = attributes.getValue("ref");
            longest = covertRowIdtoInt(dimension.substring(dimension.indexOf(":") + 1));
        }
        //行开始
        if (qName.equals("row")) {
            String rowNum = attributes.getValue("r");
            //判断空行
            if (lastRowid != null) {
                //与上一行相差2, 说明中间有空行
                int gap = Integer.parseInt(rowNum) - Integer.parseInt(lastRowid);
                if (gap > 1) {
                    gap -= 1;
                    while (gap > 0) {
                        container.add(new ArrayList<>());
                        gap--;
                    }
                }
            }

            lastRowid = attributes.getValue("r");
            currentRow = new ArrayList<>();
        }
        if (qName.equals("c")) {
            String rowId = attributes.getValue("r");

            //空单元判断，添加空字符到list
            if (lastCellid != null) {
                int gap = covertRowIdtoInt(rowId) - covertRowIdtoInt(lastCellid);
                for (int i = 0; i < gap - 1; i++) {
                    currentRow.add("");
                }
            } else {
                //第一个单元格可能不是在第一列
                if (!"A1".equals(rowId)) {
                    for (int i = 0; i < covertRowIdtoInt(rowId) - 1; i++) {
                        currentRow.add("");
                    }
                }
            }
            lastCellid = rowId;


            //判断单元格的值是SST 的索引，不能直接characters方法取值
            if (attributes.getValue("t") != null && attributes.getValue("t").equals("s")) {
                isSSTIndex = true;
            } else {
                isSSTIndex = false;
            }
        }
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
//        System.out.println("endElement:"+qName);

        //行结束,存储一行数据
        if (qName.equals("row")) {

            //判断最后一个单元格是否在最后，补齐列数
            if (covertRowIdtoInt(lastCellid) < longest) {
                for (int i = 0; i < longest - covertRowIdtoInt(lastCellid); i++) {
                    currentRow.add("");
                }
            }
            container.add(currentRow);
            lastCellid = null;
        }
        //单元格内容标签结束，characters方法会被调用处理内容
        if (qName.equals("v")) {
            //单元格的值是SST 的索引
            if (isSSTIndex) {
                String sstIndex = lastContents.toString();
                try {
                    int idx = Integer.parseInt(sstIndex);
                    XSSFRichTextString rtss = new XSSFRichTextString(
                            sst.getEntryAt(idx));
                    lastContents = rtss.toString();
                    if (rtss.equals("error")){
                        String a = "";
                    }
                    currentRow.add(lastContents);
                } catch (NumberFormatException ex) {
                    System.out.println(lastContents);
                }
            } else {
                currentRow.add(lastContents);
            }

        }

    }


    /**
     * 获取element的文本数据
     *
     * @see org.xml.sax.ContentHandler#characters
     */
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        lastContents += new String(ch, start, length);
    }

    /**
     * 列号转数字   AB7-->28 第28列
     *
     * @param rowId
     * @return
     */
    public static int covertRowIdtoInt(String rowId) {
        int firstDigit = -1;
        for (int c = 0; c < rowId.length(); ++c) {
            if (Character.isDigit(rowId.charAt(c))) {
                firstDigit = c;
                break;
            }
        }
        //AB7-->AB
        //AB是列号, 7是行号
        String newRowId = rowId.substring(0, firstDigit);
        int num = 0;
        int result = 0;
        int length = newRowId.length();
        for (int i = 0; i < length; i++) {
            //先取最低位，B
            char ch = newRowId.charAt(length - i - 1);
            //B表示的十进制2，ascii码相减，以A的ascii码为基准，A表示1，B表示2
            num = (int) (ch - 'A' + 1);
            //列号转换相当于26进制数转10进制
            num *= Math.pow(26, i);
            result += num;
        }
        return result;

    }

    public static void main(String[] args) {
        System.out.println(POIReader.covertRowIdtoInt("AB7"));

    }
}

需要用到的类，pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.dddd</groupId>
    <artifactId>poisaxxls</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.17</version>
        </dependency>
        <dependency>
            <groupId>xerces</groupId>
            <artifactId>xercesImpl</artifactId>
            <version>2.11.0</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

数据库导入采用Jdbc批量操作+事务的方式
首先JDBC连接的url中要加rewriteBatchedStatements参数设为true是批量操作的前提,还可以增加useServerPrepStmts=false.

获取tomcat的jdbc配置

       Class.forName("com.mysql.jdbc.Driver");
       Context cxt = new InitialContext();
       //获取与逻辑名相关联的数据源对象
       DataSource ds=(DataSource)cxt.lookup("java:comp/env/jdbc/hdg_dev");
       conn=ds.getConnection();

例子

package com.hand.hdg.col.collect.service.utils;

import com.hand.hdg.col.verify.constants.HdgVerifyConstants;

import javax.naming.Context;
import javax.naming.InitialContext;
import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.Date;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.List;
import java.util.Map;

/**
 * description
 *
 * @author jianqiu.zhao 2019/07/24 4:13 PM
 */
public class JdbcImport {

    public static void doBatchInsertBasic(String sql, List<List<String>> batchs, List<String> dataTypeList,Long userId) {

        Connection conn = null;
        PreparedStatement pstm =null;
        try {
            Class.forName("com.mysql.jdbc.Driver");
            Context cxt = new InitialContext();
            //获取与逻辑名相关联的数据源对象
            DataSource ds=(DataSource)cxt.lookup("java:comp/env/jdbc/hdg_dev");
            conn=ds.getConnection();

            pstm = conn.prepareStatement(sql);
            conn.setAutoCommit(false);
            Long startTime = System.currentTimeMillis();
            for (List<String> data : batchs) {
                for (int k = 0; k < dataTypeList.size(); k++) {
                    if (k < data.size()) {
                        switch (dataTypeList.get(k)) {
                            case HdgVerifyConstants.VERIFY_DATA_TYPE_DATE:
                                DateFormat format = new SimpleDateFormat("yyyy-MM-dd");
                                try {
                                    if (data.get(k).equals("")) {
                                        pstm.setString(k + 1, data.get(k));
                                    } else {
                                        pstm.setDate(k + 1, (Date) format.parse(data.get(k)));
                                    }
                                } catch (ParseException e) {
                                    e.printStackTrace();
                                }
                                break;
                            case HdgVerifyConstants.VERIFY_DATA_TYPE_DATETIME:
                                DateFormat format1 = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                                try {
                                    if (data.get(k).equals("")) {
                                        pstm.setString(k + 1, data.get(k));
                                    } else {
                                        pstm.setDate(k + 1, (Date) format1.parse(data.get(k)));
                                    }
                                } catch (ParseException e) {
                                    e.printStackTrace();
                                }
                                break;
                            case HdgVerifyConstants.VERIFY_DATA_TYPE_DECIMAL:
                                if (data.get(k).equals("")) {
                                    pstm.setString(k + 1, data.get(k));
                                } else {
                                    pstm.setDouble(k + 1, Double.valueOf(data.get(k)));
                                }
                                break;
                            case HdgVerifyConstants.VERIFY_DATA_TYPE_INTEGER:
                                if (data.get(k).equals("")) {
                                    pstm.setString(k + 1, data.get(k));
                                } else {
                                    pstm.setLong(k + 1, Long.valueOf(data.get(k)));
                                }
                                break;
                            case HdgVerifyConstants.VERIFY_DATA_TYPE_VARCHAR:
                                pstm.setString(k + 1, data.get(k));
                                break;
                            default:
                        }
                    }else{
                        pstm.setString(k+1,"");
                    }
                }
                pstm.setString(dataTypeList.size() + 1, "S01");
                pstm.setString(dataTypeList.size() + 2, userId.toString());
                pstm.setString(dataTypeList.size() + 3, userId.toString());
                pstm.setLong(dataTypeList.size() + 4, -1L);
                pstm.addBatch();
            }
            pstm.executeBatch();
            conn.commit();
            Long endTime = System.currentTimeMillis();
            System.out.println("OK,用时：" + (endTime - startTime));
        } catch (Exception e) {
            e.printStackTrace();
            throw new RuntimeException(e);
        }finally{
            if(pstm!=null){
                try {
                    pstm.close();
                } catch (SQLException e) {
                    e.printStackTrace();
                    throw new RuntimeException(e);
                }
            }
            if(conn!=null){
                try {
                    conn.close();
                } catch (SQLException e) {
                    e.printStackTrace();
                    throw new RuntimeException(e);
                }
            }
        }
    }
}