Bug 47685 - extracting text from xls files fails
Summary: extracting text from xls files fails
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 3.2-FINAL
Hardware: PC Windows Vista
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-08-12 03:14 UTC by Christiaan Fluit
Modified: 2010-04-27 05:30 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christiaan Fluit 2009-08-12 03:14:31 UTC
I have a couple of xls files that result in exceptions when I try to extract their text. POI 3.2-FINAL gives the following stacktrace:

org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
	at org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:186)
	at org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:328)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:271)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:196)
	at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:178)
	at [proprietary code trace]
Caused by: java.lang.ArrayIndexOutOfBoundsException
	at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:142)
	at org.apache.poi.hssf.record.RecordInputStream.readByte(RecordInputStream.java:151)
	at org.apache.poi.hssf.record.MMSRecord.<init>(MMSRecord.java:46)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:184)
	... 25 common frames omitted

POI 3.5-beta5 gives this stacktrace:

org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
	at org.apache.poi.hssf.record.RecordFactory$ReflectionRecordCreator.create(RecordFactory.java:71)
	at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:269)
	at org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:248)
	at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.getNextRecord(HSSFRecordStream.java:162)
	at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.nextRecord(HSSFRecordStream.java:93)
	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:141)
	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:98)
	at [proprietary code trace]
Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (1) bytes
	at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:185)
	at org.apache.poi.hssf.record.RecordInputStream.readByte(RecordInputStream.java:193)
	at org.apache.poi.hssf.record.MMSRecord.<init>(MMSRecord.java:46)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.poi.hssf.record.RecordFactory$ReflectionRecordCreator.create(RecordFactory.java:63)
	... 12 more

Due to the nature of these files, I cannot post them here, but I am willing to share them with developers looking into this bug.
Comment 1 Nick Burch 2009-08-12 06:50:08 UTC
Without the file I can only suggest you dig into the problematic record code (MMSRecord), compare that to the published microsoft docs and see if you can spot the issue

Also, it's worth opening the file in a new copy of office, and doing a "save as". If that file opens without issue, then a workaround is probably needed for whatever software wrote your file not quite according to the spec. If that doesn't help, then that looks more like a record bug in poi.
Comment 2 Andreas 2009-10-23 02:04:22 UTC
I had the same problem with a file created in MS Excel. I could solve the problem by removing an image that was embedded over two cells.
Comment 3 Maxim Valyanskiy 2010-04-27 05:30:47 UTC
Fixed in r938372