Java POI操作Excel(Event Model)

作者: mmlmml | 来源:发表于2017-07-21 21:34 被阅读439次

Java POI操作Excel(Event Model)
Java POI操作Excel(User Model)
Java POI操作Excel(Event User Model
使用POI封装一个轻量级Excel解析框架
Java Apache POI 操作 Excel 导出
Java Apache POI 操作 Excel 导入
java操作excel
Java 错误提示org.apache.poi.POIXMLEx
java操作Excel时java.io.IOException:
Java简易Excel导入导出工具

该文章为本系列的第二篇
第一篇为 : Java POI操作Excel(User Model)
第三篇为 : Java POI操作Excel(Event User Model)
第四篇为 : 使用POI封装一个轻量级Excel解析框架

前言

在上一篇文章中,我们不仅对POI有了基本的认识,也学会了使用POI的User model(Dom)的方式进行Excel解析.

但是这种方式在进行大数据量Excel解析的时候,要一次性的解析完整个Excel.而在我们的系统中,是存在多个用户同时上传的情况,这种情况对内存的压力是很大的.轻则Full GC.严重的话甚至会OOM.

所以,POI还提供了抽象程度相比User Model更低,但是也更节约内存的方式来解析Excel.也就是今天要整理的Event Model(Sax)的方式.

xls & xlsx的存储形式

xls采用的是一种名为BIFF8(BinaryInterchangeFileFormat)的文件格式.而xlsx则是采用OOXML(Office open Xml)的格式存储数据.

xls格式Excel的数据组织由于使用的是二进制,没找到太直观的方式展示.故不做特殊说明(实际上是没搞明白怎么看)

xlsx格式Excel则采用Xml来组织数据.所以我们可以很轻易的看到.如下图:

excel.png

xml.png

上面的xml只是sheet1对应的xml.而sheet2则对应另一个xml.

Event Model节约内存原理

User Model的缺点是一次性将文件读入内存,构建一颗Dom树.并且在POI对Excel的抽象中,每一行,每一个单元格都是一个对象.当文件大,数据量多的时候对内存的占用可想而知.

Event Model使用的方式是边读取边解析,并且不会将这些数据封装成Row,Cell这样的对象.而都只是普通的数字或者是字符串.并且这些解析出来的对象是不需要一直驻留在内存中,而是解析完使用后就可以回收.

所以相比于User Model,Event Model更节省内存.效率也更高.但是作为代价,相比User Model功能更少.门槛也要高一些.

解析

关于解析代码,使用POI官网提供的代码.并做一些局部的修改来达到学习的目的.

http://poi.apache.org/spreadsheet/how-to.html

开发环境

Java版本：1.8.0_40
Maven版本：3.3.9
POI版本：3.15

解析xls

Excel内容如下图:

excel_2003.png

解析代码:

public class ReadExcel2003 implements HSSFListener {

    private SSTRecord sstrec;

    /**
     * This method listens for incoming records and handles them as required.
     *
     * @param record The record that was found while reading.
     */
    public void processRecord(Record record) {
        switch (record.getSid()) {
            // the BOFRecord can represent either the beginning of a sheet or the workbook
            case BOFRecord.sid:
                BOFRecord bof = (BOFRecord) record;
                if (bof.getType() == bof.TYPE_WORKBOOK) {
                    System.out.println("Encountered workbook");
                    // assigned to the class level member
                } else if (bof.getType() == bof.TYPE_WORKSHEET) {
                    System.out.println("Encountered sheet reference");
                }
                break;
            case BoundSheetRecord.sid:
                BoundSheetRecord bsr = (BoundSheetRecord) record;
                System.out.println("New sheet named: " + bsr.getSheetname());
                break;
            case RowRecord.sid:
                RowRecord rowrec = (RowRecord) record;
                System.out.println("Row found, first column at "
                        + rowrec.getFirstCol() + " last column at " + rowrec.getLastCol());
                break;
            case NumberRecord.sid:
                NumberRecord numrec = (NumberRecord) record;
                System.out.println("Cell found with value " + numrec.getValue()
                        + " at row " + numrec.getRow() + " and column " + numrec.getColumn());
                break;
            // SSTRecords store a array of unique strings used in Excel.
            case SSTRecord.sid:
                sstrec = (SSTRecord) record;
                for (int k = 0; k < sstrec.getNumUniqueStrings(); k++) {
                    System.out.println("String table value " + k + " = " + sstrec.getString(k));
                }
                break;
            case LabelSSTRecord.sid:
                LabelSSTRecord lrec = (LabelSSTRecord) record;
                System.out.println("String cell found with value "
                        + sstrec.getString(lrec.getSSTIndex()));
                break;
        }
    }

    /**
     * Read an excel file and spit out what we find.
     *
     * @param args Expect one argument that is the file to read.
     * @throws IOException When there is an error processing the file.
     */
    public static void main(String[] args) throws IOException {
        // create a new file input stream with the input file specified
        // at the command line
        FileInputStream fin = new FileInputStream("workbook05.xls");
        // create a new org.apache.poi.poifs.filesystem.Filesystem
        POIFSFileSystem poifs = new POIFSFileSystem(fin);
        // get the Workbook (excel part) stream in a InputStream
        InputStream din = poifs.createDocumentInputStream("Workbook");
        // construct out HSSFRequest object
        HSSFRequest req = new HSSFRequest();
        // lazy listen for ALL records with the listener shown above
        req.addListenerForAllRecords(new ReadExcel2003());
        // create our event factory
        HSSFEventFactory factory = new HSSFEventFactory();
        // process our events based on the document input stream
        factory.processEvents(req, din);
        // once all the events are processed close our file input stream
        fin.close();
        // and our document input stream (don't want to leak these!)
        din.close();
        System.out.println("done.");
    }
}

输出

Encountered workbook
New sheet named: sheet01
String table value 0 = This is a String
Encountered sheet reference
Row found, first column at 0 last column at 5
Row found, first column at 0 last column at 5
Row found, first column at 0 last column at 5
Cell found with value 1.0 at row 0 and column 0
Cell found with value 42932.0 at row 0 and column 1
String cell found with value This is a String
Cell found with value 42932.0 at row 0 and column 3
Cell found with value 2.0 at row 1 and column 1
Cell found with value 42932.0 at row 1 and column 2
String cell found with value This is a String
Cell found with value 42932.0 at row 1 and column 4
Cell found with value 3.0 at row 2 and column 2
Cell found with value 42932.0 at row 2 and column 3
String cell found with value This is a String
done.

我们实现了HSSFListener接口的processRecord方法来自定义了当我们遇到不同Record之后的操作.

现在我们的需求就是Event Model要像User Model那种解析出指定Sheet,指定Row,指定Cell的数据.

我们先来对代码一下小观察：

只能一次性的解析所有的sheet数据.不能分页解析.(这个功能在Dom解析的时候是有的)
解析Row中数据的时候,并不会解析空Cell
先解析了所有的Row,然后才解析了Cell.而不是Row,Cell的嵌套关系.
但是我们可以获取到Cell对应的行号.

通过上面的分析,我们就具备了满足前面所说需求的能力.

具体来说,换页的时候,Cell对应的行号会变成0.我们可以通过这个方案,来判断到了第几个Sheet.

关于指定的Row数据,我们可以根据判断Cell的行号来判断是否需要解析.

而具体的Cell数据,我们可以在一开始传入列数.将每一行的数据解析成一个指定大小的List.并且用空字符串填满这个List.在解析过程中,使用Cell的列号去覆盖List指定位置的值,这样.最终的List中就有这一行中包括空值的数据.想要第一列就获取第几列即可.

解析xlsx

excel的话,我们仍旧使用上面的,仅仅把格式转换为xlsx.
关于代码我们依旧使用官网的demo代码.

解析代码:

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

import java.io.InputStream;
import java.util.Iterator;

public class ReadExcel2007 extends DefaultHandler{
    public void processOneSheet(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();

        XMLReader parser = fetchSheetParser(sst);

        // To look up the Sheet Name / Sheet Order / rID,
        //  you need to process the core Workbook stream.
        // Normally it's of the form rId# or rSheet#
        InputStream sheet1 = r.getSheet("rId1");
        InputSource sheetSource = new InputSource(sheet1);
        parser.parse(sheetSource);
        sheet1.close();
    }

    public void processAllSheets(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();

        XMLReader parser = fetchSheetParser(sst);

        Iterator<InputStream> sheets = r.getSheetsData();
        while(sheets.hasNext()) {
            System.out.println("\nProcessing new sheet:");
            InputStream sheet = sheets.next();
            InputSource sheetSource = new InputSource(sheet);
            parser.parse(sheetSource);
            sheet.close();
            System.out.println("");
        }
    }

    public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
        XMLReader parser =
                XMLReaderFactory.createXMLReader(
                        "org.apache.xerces.parsers.SAXParser"
                );
        ContentHandler handler = new SheetHandler(sst);
        parser.setContentHandler(handler);
        return parser;
    }

    /**
     * See org.xml.sax.helpers.DefaultHandler javadocs
     */
    private static class SheetHandler extends DefaultHandler {
        private SharedStringsTable sst;
        private String lastContents;
        private boolean nextIsString;

        private SheetHandler(SharedStringsTable sst) {
            this.sst = sst;
        }

        public void startElement(String uri, String localName, String name,
                                 Attributes attributes) throws SAXException {
            // c => cell
            if(name.equals("c")) {
                // Print the cell reference
                System.out.print(attributes.getValue("r") + " - ");
                // Figure out if the value is an index in the SST
                String cellType = attributes.getValue("t");
                if(cellType != null && cellType.equals("s")) {
                    nextIsString = true;
                } else {
                    nextIsString = false;
                }
            }
            // Clear contents cache
            lastContents = "";
        }

        public void endElement(String uri, String localName, String name)
                throws SAXException {
            // Process the last contents as required.
            // Do now, as characters() may be called more than once
            if(nextIsString) {
                int idx = Integer.parseInt(lastContents);
                lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
                nextIsString = false;
            }

            // v => contents of a cell
            // Output after we've seen the string contents
            if(name.equals("v")) {
                System.out.println(lastContents);
            }
        }

        public void characters(char[] ch, int start, int length)
                throws SAXException {
            lastContents += new String(ch, start, length);
        }
    }

    public static void main(String[] args) throws Exception {
        ReadExcel2007 example = new ReadExcel2007();
        example.processOneSheet("workbook07.xlsx");
        example.processAllSheets("workbook07.xlsx");
    }

输出结果

A1 - 1
B1 - 42932
C1 - This is a String
D1 - 42932
C2 - D3 - 
Processing new sheet:
A1 - 1
B1 - 42932
C1 - This is a String
D1 - 42932
C2 - D3 - 

Processing new sheet:
A1 - 111

学过xml解析的肯定不会陌生,这段代码完全就是解析Xml的Sax解析代码.如果你只看到了官网的这段demo代码,你一定是一脸懵逼的.你并不知道xml的格式.而我已经在前面为你提供了xml的格式,相信你就不会因此而迷茫了.

几点说明:

xlsx的Event Model是可以指定sheet来解析的.并且是以1为第一个sheet.
如果单元格里没有值,则不会有v标签.在真正解析的时候要考虑到这个因素.
在做User Model的时候也说过,Excel中使用SST来存储字符串常量.所以无论是Xls,Xlsx都有SST.具体的单元格的真是存储实际是SST中的下标.

总结

本文介绍了使用Event Model解析两种格式Excel的Demo代码.官网中也没有对这些代码如何应用于生产提供具体的指导.所以在一开始学习和应用于项目中,都会遇到一些不可避免的问题.

但是在遇到的问题的时候,我们一定要记住,User Model也是基于Event Model.只不过是抽象层级更高.所以理论上的Event Model是可以实现所有User Model的功能.只不过需要我们做更多的思考和探索.但是我相信通过不停的思考与实践.没有问题是解决不掉的.

参考资料

The New Halloween Document

Java POI操作Excel(Event Model)
该文章为本系列的第二篇第一篇为 : Java POI操作Excel(User Model)第三篇为 : Java ...
Java POI操作Excel(User Model)
该文章为本系列的第一篇第二篇为 : Java POI操作Excel(Event Model)第三篇为 : Java...
Java POI操作Excel(Event User Model
该文章为本系列的第三篇第一篇为 : Java POI操作Excel(User Model)第二篇为 : Java ...
使用POI封装一个轻量级Excel解析框架
该文章为本系列的第四篇第一篇为 : Java POI操作Excel(User Model)第二篇为 : Java ...
Java Apache POI 操作 Excel 导出
Java 可以通过 Apache POI 操作 Excel 的导入导出Apache POI 是一套操作 Micro...
Java Apache POI 操作 Excel 导入
Java 可以通过 Apache POI 操作 Excel 的导入导出Apache POI 是一套操作 Micro...
java操作excel
java操作excel 导入POI.jar包，下载地址： http://poi.apache.org/downlo...
Java 错误提示org.apache.poi.POIXMLEx
java 操作excel文件发布后报错 org.apache.poi.POIXMLException: java...
java操作Excel时java.io.IOException:
Java通过POI操作Excel时有没有遇到过java.io.IOException: Failed to rea...
Java简易Excel导入导出工具
Java对Excel的处理最主流估计是POI了，POI几乎涵盖了Excel所有的操作，但是很多时候并不是很需要那么...