Exception on Excel file parsing (file attached) Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1cbfe9d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:116) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57) Caused by: org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException: Initialisation of record 0x0 left 10 bytes remaining still to be read. at org.apache.poi.hssf.record.RecordInputStream.hasNextRecord(RecordInputStream.java:124) at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.getNextRecord(HSSFRecordStream.java:126) at org.apache.poi.hssf.eventusermodel.HSSFRecordStream.nextRecord(HSSFRecordStream.java:93) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:141) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:98) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:145) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:106) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) ... 3 more
Created attachment 23899 [details] file that causes exception
I am pretty sure that the file you have has non-zero padding bytes which was the cause for bug 46987. Unfortunately, the fix for that bug was in a different class (RecordFactory) and in your case the problem occurs in HSSFEventFactory/HSSFRecordStream. The problem is with record iteration boundary checking logic, and the ideal solution would be to refactor the existing code in RecordFactory so that HSSFEventFactory can use it too. You can re-save the file in Excel and that should correct the problem. This may be a work-around if you don't have too many files that are affected.
Thanx, now it works fine (patch included :-)
Created attachment 23909 [details] Solution
Please copy "src/java/org/apache/poi/hssf/eventusermodel/HSSFRecordStream.java" to "src/java/org/apache/poi/hssf/record/RecordFactoryInputStream.java" before appying patch and remove "src/java/org/apache/poi/hssf/eventusermodel/HSSFRecordStream.java" after apply
Applied in r791251 There is no junit, but your refactoring is well proved by existing tests so I think it's OK for this fix to be checked in. Regards, Yegor