When parsing large excel files using XSSF and the event usermodel, when the needed data referred to by start and length variables refer to the end of ch[], only part of the cell value is returned. This will either give an incorrect numeric value of a cell, or refer an incorrect position in the sharedStrings file. This can be verified by using the poi.xssf.eventusermodel.examples to parse a very large sheet (several thousand rows with a couple dozen columns). Put a conditional breakpoint on the value of the variable start when start + length is 2048 in the method "public void characters(char[] ch, int start, int length) throws SAXException". This will eventually give you a value that is truncated. The value of ch needs to go to 2048 characters, then back up to the last complete set of cell data rather than always cutting at exactly 2048 characters and possibly truncating cell data being referred to. This is a showstopper because the results of using the XSSF eventusermodel can not be trusted when parsing large excel files. Thanks. Paul Dobson
Any chance you could upload a suitably large file that triggers this bug? I have a feeling it's actually a bug in the example, rather than in xssf itself, but I'll need a sample file to test this hunch
Created attachment 22550 [details] Large excel file for testing. Use this file with the attached sample code to duplicate the problem. The file has values in each cell corresponding the location of the cell. The sample code will only work with this file or a similarly formatted excel file.
Created attachment 22551 [details] Java program to test using SampleBigWorkbook.xlsx This program is based on the sample code. It outputs lines when it finds that the cell value from the shared strings table does not match what it should be. It also outputs the value of the cell that was clipped to return the incorrect string from the shared strings table. Let me know if you need anything else. Thanks, Paul
I have attached a sample .xlsx file as well as a java program to test the sample file. The program will only work on the attached .xlsx file or one with cell content laid out in a similar way. Let me know if this is what you needed. Paul
I did some more testing. You are right. It seems to be a problem in the example code. Consider the following change to endElement, and characters methods. I don't know if this is best way to fix the problem but it seems to work in my case. public void endElement(String uri, String localName, String name) throws SAXException { // v => contents of a cell // Output after we've seen the string contents if(nextIsString) { idx = Integer.parseInt(lastContents); lastContents = sst.getSharedStringAt(idx); } if(name.equals("v")) { System.out.println(lastContents); } lastContents = ""; } public void characters(char[] ch, int start, int length) throws SAXException { lastContents += new String(ch, start, length); }
Yup, your fix looks like it's the right one, thanks for figuring that out. I've fixed the documentation and example in svn, so they'll be included in the next beta