Bug 47448

Summary: [PATCH] org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException: Initialisation of record 0x0 left 10 bytes remaining still to be read
Product: POI Reporter: Maxim Valyanskiy <max.valjanski>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: xyang200
Priority: P2    
Version: 3.5-dev   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: file that causes exception
Solution

Description Maxim Valyanskiy 2009-06-29 07:37:05 UTC
Exception on Excel file parsing (file attached)

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1cbfe9d
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:116)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57)
Caused by: org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException: Initialisation of record 0x0 left 10 bytes remaining still to be read.
	at org.apache.poi.hssf.record.RecordInputStream.hasNextRecord(RecordInputStream.java:124)
	at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.getNextRecord(HSSFRecordStream.java:126)
	at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.nextRecord(HSSFRecordStream.java:93)
	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:141)
	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:98)
	at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:145)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:106)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
	... 3 more
Comment 1 Maxim Valyanskiy 2009-06-29 07:37:55 UTC
Created attachment 23899 [details]
file that causes exception
Comment 2 Josh Micich 2009-06-29 12:30:49 UTC
I am pretty sure that the file you have has non-zero padding bytes which was the cause for bug 46987.  Unfortunately, the fix for that bug was in a different class (RecordFactory) and in your case the problem occurs in HSSFEventFactory/HSSFRecordStream.  The problem is with record iteration boundary checking logic, and the ideal solution would be to refactor the existing code in RecordFactory so that HSSFEventFactory can use it too.

You can re-save the file in Excel and that should correct the problem.  This may be a work-around if you don't have too many files that are affected.
Comment 3 Maxim Valyanskiy 2009-06-30 03:28:38 UTC
Thanx, now it works fine (patch included :-)
Comment 4 Maxim Valyanskiy 2009-06-30 03:30:40 UTC
Created attachment 23909 [details]
Solution
Comment 5 Maxim Valyanskiy 2009-06-30 03:43:56 UTC
Please copy "src/java/org/apache/poi/hssf/eventusermodel/HSSFRecordStream.java" to "src/java/org/apache/poi/hssf/record/RecordFactoryInputStream.java" before appying patch and remove "src/java/org/apache/poi/hssf/eventusermodel/HSSFRecordStream.java" after apply
Comment 6 Yegor Kozlov 2009-07-05 07:14:35 UTC
Applied in r791251

There is no junit, but your refactoring is well proved by existing tests so I think it's OK for this fix to be checked in. 

Regards,
Yegor