Bug 57456

Summary: RecordFormatExpection (Not enough data) with External Data Source
Product: POI Reporter: Joseph F. <jpfourny>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: jpfourny
Priority: P2    
Version: 3.10-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: File to reproduce the issue.

Description Joseph F. 2015-01-18 03:18:38 UTC
I know there have been resolved cases of RecordFormatException involving SSTRecord in the past, but I think I found a new case. The stack trace looks like this:

org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (2) bytes
	at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216)
	at org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:233)
	at org.apache.poi.hssf.record.common.UnicodeString.<init>(UnicodeString.java:405)
	at org.apache.poi.hssf.record.SSTDeserializer.manufactureStrings(SSTDeserializer.java:57)
	at org.apache.poi.hssf.record.SSTRecord.<init>(SSTRecord.java:247)

SSTDeserializer.manufactureStrings(..) is told there are strings, but the RecordInputStream has 0 bytes remaining (although there are more records that follow). I find it interesting that the code attempts to read a Unicode string when there are no bytes remaining ... it will always fail, no? 

General Steps to Reproduce:

* Create blank workbook in Excel 2010 (or later).
* Add a web query (or any other data source) and insert into the sheet. This creates an external sheet ref.
* In the "Usage" tab of the connection properties, specify "Remove data from external data range before saving the workbook".
* Save workbook as XLS file.
* Open XLS file in POI. *Boom*.

Upon saving, Excel is writing the SSTRecord that claims to have strings, but it is actually empty. Odd, but it is what it is. I know POI cannot handle external data sources, but I expect to be able to open the file with POI without blowing up. There may be other sheets with useful data that I want to scrape.
Comment 1 Dominik Stadler 2015-01-18 06:56:12 UTC
Can you please add such a file so it becomes easier to reproduce this. Additional points if you can include a unit test as well.
Comment 2 Joseph F. 2015-01-18 22:43:00 UTC
Created attachment 32376 [details]
File to reproduce the issue.

The file contains no data - just a Web Query that is populated when opened in Excel and the data is removed by Excel on save. Note: The Web Query is not reachable outside of a private intranet, so do not expect it to populate in Excel. The point is to demonstrate that Excel is saving an SSTRecord that chokes up POI. Interestingly, this file opens without error in OpenOffice, which I thought was built on POI, no?
Comment 3 Joseph F. 2015-01-27 18:16:35 UTC
I am setting the status back to "NEW", since I have attached an input file to reproduce the issue, as requested. Unfortunately, I do not have time to build a unit test :(.
Comment 4 Dominik Stadler 2015-02-13 19:15:53 UTC
The problem is caused by a SSTRecord that is not built correctly by Excel. 

According to the spec it should contain the number of strings overall in the file and the number of unique strings that then follow in the record. However in this case the number of strings overall is zero (which itself is not allowed per the spec) and the number of unique strings seems to be randomly set. So we try to read a random number of strings, which predictably fails.

I have worked around this in r1659650 so that if the number of string overall is zero, the SST record is read as empty and thus parsing this file works.