I have a couple of xls files that result in exceptions when I try to extract their text. POI 3.2-FINAL gives the following stacktrace: org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:186) at org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:328) at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:271) at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:196) at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:178) at [proprietary code trace] Caused by: java.lang.ArrayIndexOutOfBoundsException at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:142) at org.apache.poi.hssf.record.RecordInputStream.readByte(RecordInputStream.java:151) at org.apache.poi.hssf.record.MMSRecord.<init>(MMSRecord.java:46) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:184) ... 25 common frames omitted POI 3.5-beta5 gives this stacktrace: org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionRecordCreator.create(RecordFactory.java:71) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:269) at org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:248) at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.getNextRecord(HSSFRecordStream.java:162) at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.nextRecord(HSSFRecordStream.java:93) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:141) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:98) at [proprietary code trace] Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (1) bytes at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:185) at org.apache.poi.hssf.record.RecordInputStream.readByte(RecordInputStream.java:193) at org.apache.poi.hssf.record.MMSRecord.<init>(MMSRecord.java:46) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.poi.hssf.record.RecordFactory$ReflectionRecordCreator.create(RecordFactory.java:63) ... 12 more Due to the nature of these files, I cannot post them here, but I am willing to share them with developers looking into this bug.
Without the file I can only suggest you dig into the problematic record code (MMSRecord), compare that to the published microsoft docs and see if you can spot the issue Also, it's worth opening the file in a new copy of office, and doing a "save as". If that file opens without issue, then a workaround is probably needed for whatever software wrote your file not quite according to the spec. If that doesn't help, then that looks more like a record bug in poi.
I had the same problem with a file created in MS Excel. I could solve the problem by removing an image that was embedded over two cells.
Fixed in r938372