Java POI操作Excel(Event Model)

作者: 李龙杰 | 来源:发表于2017-07-21 21:34 被阅读439次

    该文章为本系列的第二篇
    第一篇为 : Java POI操作Excel(User Model)
    第三篇为 : Java POI操作Excel(Event User Model)
    第四篇为 : 使用POI封装一个轻量级Excel解析框架

    前言

    在上一篇文章中,我们不仅对POI有了基本的认识,也学会了使用POI的User model(Dom)的方式进行Excel解析.

    但是这种方式在进行大数据量Excel解析的时候,要一次性的解析完整个Excel.而在我们的系统中,是存在多个用户同时上传的情况,这种情况对内存的压力是很大的.轻则Full GC.严重的话甚至会OOM.

    所以,POI还提供了抽象程度相比User Model更低,但是也更节约内存的方式来解析Excel.也就是今天要整理的Event Model(Sax)的方式.

    xls & xlsx的存储形式

    xls采用的是一种名为BIFF8(BinaryInterchangeFileFormat)的文件格式.而xlsx则是采用OOXML(Office open Xml)的格式存储数据.

    xls格式Excel的数据组织由于使用的是二进制,没找到太直观的方式展示.故不做特殊说明(实际上是没搞明白怎么看)

    xlsx格式Excel则采用Xml来组织数据.所以我们可以很轻易的看到.如下图:

    excel.png xml.png

    上面的xml只是sheet1对应的xml.而sheet2则对应另一个xml.

    Event Model节约内存原理

    User Model的缺点是一次性将文件读入内存,构建一颗Dom树.并且在POI对Excel的抽象中,每一行,每一个单元格都是一个对象.当文件大,数据量多的时候对内存的占用可想而知.

    Event Model使用的方式是边读取边解析,并且不会将这些数据封装成Row,Cell这样的对象.而都只是普通的数字或者是字符串.并且这些解析出来的对象是不需要一直驻留在内存中,而是解析完使用后就可以回收.

    所以相比于User Model,Event Model更节省内存.效率也更高.但是作为代价,相比User Model功能更少.门槛也要高一些.

    解析

    关于解析代码,使用POI官网提供的代码.并做一些局部的修改来达到学习的目的.

    http://poi.apache.org/spreadsheet/how-to.html

    开发环境

    Java版本 :1.8.0_40
    Maven版本:3.3.9
    POI版本 :3.15

    解析xls

    Excel内容如下图:


    excel_2003.png

    解析代码:

    public class ReadExcel2003 implements HSSFListener {
    
        private SSTRecord sstrec;
    
        /**
         * This method listens for incoming records and handles them as required.
         *
         * @param record The record that was found while reading.
         */
        public void processRecord(Record record) {
            switch (record.getSid()) {
                // the BOFRecord can represent either the beginning of a sheet or the workbook
                case BOFRecord.sid:
                    BOFRecord bof = (BOFRecord) record;
                    if (bof.getType() == bof.TYPE_WORKBOOK) {
                        System.out.println("Encountered workbook");
                        // assigned to the class level member
                    } else if (bof.getType() == bof.TYPE_WORKSHEET) {
                        System.out.println("Encountered sheet reference");
                    }
                    break;
                case BoundSheetRecord.sid:
                    BoundSheetRecord bsr = (BoundSheetRecord) record;
                    System.out.println("New sheet named: " + bsr.getSheetname());
                    break;
                case RowRecord.sid:
                    RowRecord rowrec = (RowRecord) record;
                    System.out.println("Row found, first column at "
                            + rowrec.getFirstCol() + " last column at " + rowrec.getLastCol());
                    break;
                case NumberRecord.sid:
                    NumberRecord numrec = (NumberRecord) record;
                    System.out.println("Cell found with value " + numrec.getValue()
                            + " at row " + numrec.getRow() + " and column " + numrec.getColumn());
                    break;
                // SSTRecords store a array of unique strings used in Excel.
                case SSTRecord.sid:
                    sstrec = (SSTRecord) record;
                    for (int k = 0; k < sstrec.getNumUniqueStrings(); k++) {
                        System.out.println("String table value " + k + " = " + sstrec.getString(k));
                    }
                    break;
                case LabelSSTRecord.sid:
                    LabelSSTRecord lrec = (LabelSSTRecord) record;
                    System.out.println("String cell found with value "
                            + sstrec.getString(lrec.getSSTIndex()));
                    break;
            }
        }
    
        /**
         * Read an excel file and spit out what we find.
         *
         * @param args Expect one argument that is the file to read.
         * @throws IOException When there is an error processing the file.
         */
        public static void main(String[] args) throws IOException {
            // create a new file input stream with the input file specified
            // at the command line
            FileInputStream fin = new FileInputStream("workbook05.xls");
            // create a new org.apache.poi.poifs.filesystem.Filesystem
            POIFSFileSystem poifs = new POIFSFileSystem(fin);
            // get the Workbook (excel part) stream in a InputStream
            InputStream din = poifs.createDocumentInputStream("Workbook");
            // construct out HSSFRequest object
            HSSFRequest req = new HSSFRequest();
            // lazy listen for ALL records with the listener shown above
            req.addListenerForAllRecords(new ReadExcel2003());
            // create our event factory
            HSSFEventFactory factory = new HSSFEventFactory();
            // process our events based on the document input stream
            factory.processEvents(req, din);
            // once all the events are processed close our file input stream
            fin.close();
            // and our document input stream (don't want to leak these!)
            din.close();
            System.out.println("done.");
        }
    }
    

    输出

    Encountered workbook
    New sheet named: sheet01
    String table value 0 = This is a String
    Encountered sheet reference
    Row found, first column at 0 last column at 5
    Row found, first column at 0 last column at 5
    Row found, first column at 0 last column at 5
    Cell found with value 1.0 at row 0 and column 0
    Cell found with value 42932.0 at row 0 and column 1
    String cell found with value This is a String
    Cell found with value 42932.0 at row 0 and column 3
    Cell found with value 2.0 at row 1 and column 1
    Cell found with value 42932.0 at row 1 and column 2
    String cell found with value This is a String
    Cell found with value 42932.0 at row 1 and column 4
    Cell found with value 3.0 at row 2 and column 2
    Cell found with value 42932.0 at row 2 and column 3
    String cell found with value This is a String
    done.
    

    我们实现了HSSFListener接口的processRecord方法来自定义了当我们遇到不同Record之后的操作.

    现在我们的需求就是Event Model要像User Model那种解析出指定Sheet,指定Row,指定Cell的数据.

    我们先来对代码一下小观察:

    1. 只能一次性的解析所有的sheet数据.不能分页解析.(这个功能在Dom解析的时候是有的)
    2. 解析Row中数据的时候,并不会解析空Cell
    3. 先解析了所有的Row,然后才解析了Cell.而不是Row,Cell的嵌套关系.
    4. 但是我们可以获取到Cell对应的行号.

    通过上面的分析,我们就具备了满足前面所说需求的能力.

    具体来说,换页的时候,Cell对应的行号会变成0.我们可以通过这个方案,来判断到了第几个Sheet.

    关于指定的Row数据,我们可以根据判断Cell的行号来判断是否需要解析.

    而具体的Cell数据,我们可以在一开始传入列数.将每一行的数据解析成一个指定大小的List.并且用空字符串填满这个List.在解析过程中,使用Cell的列号去覆盖List指定位置的值,这样.最终的List中就有这一行中包括空值的数据.想要第一列就获取第几列即可.

    解析xlsx

    excel的话,我们仍旧使用上面的,仅仅把格式转换为xlsx.
    关于代码我们依旧使用官网的demo代码.

    解析代码:

    import org.apache.poi.openxml4j.opc.OPCPackage;
    import org.apache.poi.xssf.eventusermodel.XSSFReader;
    import org.apache.poi.xssf.model.SharedStringsTable;
    import org.apache.poi.xssf.usermodel.XSSFRichTextString;
    import org.xml.sax.*;
    import org.xml.sax.helpers.DefaultHandler;
    import org.xml.sax.helpers.XMLReaderFactory;
    
    import java.io.InputStream;
    import java.util.Iterator;
    
    public class ReadExcel2007 extends DefaultHandler{
        public void processOneSheet(String filename) throws Exception {
            OPCPackage pkg = OPCPackage.open(filename);
            XSSFReader r = new XSSFReader( pkg );
            SharedStringsTable sst = r.getSharedStringsTable();
    
            XMLReader parser = fetchSheetParser(sst);
    
            // To look up the Sheet Name / Sheet Order / rID,
            //  you need to process the core Workbook stream.
            // Normally it's of the form rId# or rSheet#
            InputStream sheet1 = r.getSheet("rId1");
            InputSource sheetSource = new InputSource(sheet1);
            parser.parse(sheetSource);
            sheet1.close();
        }
    
        public void processAllSheets(String filename) throws Exception {
            OPCPackage pkg = OPCPackage.open(filename);
            XSSFReader r = new XSSFReader( pkg );
            SharedStringsTable sst = r.getSharedStringsTable();
    
            XMLReader parser = fetchSheetParser(sst);
    
            Iterator<InputStream> sheets = r.getSheetsData();
            while(sheets.hasNext()) {
                System.out.println("\nProcessing new sheet:");
                InputStream sheet = sheets.next();
                InputSource sheetSource = new InputSource(sheet);
                parser.parse(sheetSource);
                sheet.close();
                System.out.println("");
            }
        }
    
        public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
            XMLReader parser =
                    XMLReaderFactory.createXMLReader(
                            "org.apache.xerces.parsers.SAXParser"
                    );
            ContentHandler handler = new SheetHandler(sst);
            parser.setContentHandler(handler);
            return parser;
        }
    
        /**
         * See org.xml.sax.helpers.DefaultHandler javadocs
         */
        private static class SheetHandler extends DefaultHandler {
            private SharedStringsTable sst;
            private String lastContents;
            private boolean nextIsString;
    
            private SheetHandler(SharedStringsTable sst) {
                this.sst = sst;
            }
    
            public void startElement(String uri, String localName, String name,
                                     Attributes attributes) throws SAXException {
                // c => cell
                if(name.equals("c")) {
                    // Print the cell reference
                    System.out.print(attributes.getValue("r") + " - ");
                    // Figure out if the value is an index in the SST
                    String cellType = attributes.getValue("t");
                    if(cellType != null && cellType.equals("s")) {
                        nextIsString = true;
                    } else {
                        nextIsString = false;
                    }
                }
                // Clear contents cache
                lastContents = "";
            }
    
            public void endElement(String uri, String localName, String name)
                    throws SAXException {
                // Process the last contents as required.
                // Do now, as characters() may be called more than once
                if(nextIsString) {
                    int idx = Integer.parseInt(lastContents);
                    lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
                    nextIsString = false;
                }
    
                // v => contents of a cell
                // Output after we've seen the string contents
                if(name.equals("v")) {
                    System.out.println(lastContents);
                }
            }
    
            public void characters(char[] ch, int start, int length)
                    throws SAXException {
                lastContents += new String(ch, start, length);
            }
        }
    
        public static void main(String[] args) throws Exception {
            ReadExcel2007 example = new ReadExcel2007();
            example.processOneSheet("workbook07.xlsx");
            example.processAllSheets("workbook07.xlsx");
        }
    

    输出结果

    A1 - 1
    B1 - 42932
    C1 - This is a String
    D1 - 42932
    C2 - D3 - 
    Processing new sheet:
    A1 - 1
    B1 - 42932
    C1 - This is a String
    D1 - 42932
    C2 - D3 - 
    
    Processing new sheet:
    A1 - 111
    

    学过xml解析的肯定不会陌生,这段代码完全就是解析Xml的Sax解析代码.如果你只看到了官网的这段demo代码,你一定是一脸懵逼的.你并不知道xml的格式.而我已经在前面为你提供了xml的格式,相信你就不会因此而迷茫了.

    几点说明:

    • xlsx的Event Model是可以指定sheet来解析的.并且是以1为第一个sheet.
    • 如果单元格里没有值,则不会有v标签.在真正解析的时候要考虑到这个因素.
    • 在做User Model的时候也说过,Excel中使用SST来存储字符串常量.所以无论是Xls,Xlsx都有SST.具体的单元格的真是存储实际是SST中的下标.

    总结

    本文介绍了使用Event Model解析两种格式Excel的Demo代码.官网中也没有对这些代码如何应用于生产提供具体的指导.所以在一开始学习和应用于项目中,都会遇到一些不可避免的问题.

    但是在遇到的问题的时候,我们一定要记住,User Model也是基于Event Model.只不过是抽象层级更高.所以理论上的Event Model是可以实现所有User Model的功能.只不过需要我们做更多的思考和探索.但是我相信通过不停的思考与实践.没有问题是解决不掉的.

    参考资料

    The New Halloween Document

    相关文章

      网友评论

        本文标题:Java POI操作Excel(Event Model)

        本文链接:https://www.haomeiwen.com/subject/qiqykxtx.html