Bug 47448 - [PATCH] org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException: Initialisation of record 0x0 left 10 bytes remaining still to be read
[PATCH] org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException: I...
Product: POI
Classification: Unclassified
Component: HSSF
PC Linux
: P2 normal (vote)
: ---
Assigned To: POI Developers List
Depends on:
  Show dependency tree
Reported: 2009-06-29 07:37 UTC by Maxim Valyanskiy
Modified: 2009-07-05 07:14 UTC (History)
0 users

file that causes exception (377.50 KB, application/octet-stream)
2009-06-29 07:37 UTC, Maxim Valyanskiy
Solution (29.18 KB, patch)
2009-06-30 03:30 UTC, Maxim Valyanskiy
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Maxim Valyanskiy 2009-06-29 07:37:05 UTC
Exception on Excel file parsing (file attached)

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1cbfe9d
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:116)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57)
Caused by: org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException: Initialisation of record 0x0 left 10 bytes remaining still to be read.
	at org.apache.poi.hssf.record.RecordInputStream.hasNextRecord(RecordInputStream.java:124)
	at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.getNextRecord(HSSFRecordStream.java:126)
	at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.nextRecord(HSSFRecordStream.java:93)
	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:141)
	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:98)
	at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:145)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:106)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
	... 3 more
Comment 1 Maxim Valyanskiy 2009-06-29 07:37:55 UTC
Created attachment 23899 [details]
file that causes exception
Comment 2 Josh Micich 2009-06-29 12:30:49 UTC
I am pretty sure that the file you have has non-zero padding bytes which was the cause for bug 46987.  Unfortunately, the fix for that bug was in a different class (RecordFactory) and in your case the problem occurs in HSSFEventFactory/HSSFRecordStream.  The problem is with record iteration boundary checking logic, and the ideal solution would be to refactor the existing code in RecordFactory so that HSSFEventFactory can use it too.

You can re-save the file in Excel and that should correct the problem.  This may be a work-around if you don't have too many files that are affected.
Comment 3 Maxim Valyanskiy 2009-06-30 03:28:38 UTC
Thanx, now it works fine (patch included :-)
Comment 4 Maxim Valyanskiy 2009-06-30 03:30:40 UTC
Created attachment 23909 [details]
Comment 5 Maxim Valyanskiy 2009-06-30 03:43:56 UTC
Please copy "src/java/org/apache/poi/hssf/eventusermodel/HSSFRecordStream.java" to "src/java/org/apache/poi/hssf/record/RecordFactoryInputStream.java" before appying patch and remove "src/java/org/apache/poi/hssf/eventusermodel/HSSFRecordStream.java" after apply
Comment 6 Yegor Kozlov 2009-07-05 07:14:35 UTC
Applied in r791251

There is no junit, but your refactoring is well proved by existing tests so I think it's OK for this fix to be checked in.