Bug 54213 - Exception parsing XLS embedded in PPT file
Summary: Exception parsing XLS embedded in PPT file
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HSLF (show other bugs)
Version: 3.8-FINAL
Hardware: PC Mac OS X 10.4
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-27 16:20 UTC by mikemccand
Modified: 2012-11-28 11:34 UTC (History)
0 users



Attachments
Extracted 1.xls file (9.50 KB, application/vnd.ms-excel)
2012-11-27 16:20 UTC, mikemccand
Details
emb.ppt that contains embedded chart/excel document (90.00 KB, application/vnd.ms-powerpoint)
2012-11-27 16:20 UTC, mikemccand
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mikemccand 2012-11-27 16:20:06 UTC
Created attachment 29644 [details]
Extracted 1.xls file

This is a spinoff from https://issues.apache.org/jira/browse/TIKA-1033

I used Tika to extract embedded documents from the attached emb.ppt.  One of those documents is a chart, and Tika detects it as an excel document and TikaCLI -z saves it as 1.xls (attached).

But when I try to parse the 1.xls with Tika it hits an exception:

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4eaf6cb1
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121)
Caused by: org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
	at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
	at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
	at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
	at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:292)
	at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:144)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 5 more
Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (2) bytes
	at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216)
	at org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:233)
	at org.apache.poi.hssf.record.WindowOneRecord.<init>(WindowOneRecord.java:71)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
	at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57)
	... 15 more

However, Excel 2007 also cannot open 1.xls ... so I'm not sure where the bug really is (Tika's extraction of 1.xls from emb.ppt, or Tika/POI's parsing of 1.xls).
Comment 1 mikemccand 2012-11-27 16:20:47 UTC
Created attachment 29645 [details]
emb.ppt that contains embedded  chart/excel document
Comment 2 Yegor Kozlov 2012-11-28 08:47:34 UTC
The raw object is a MSGraph.Chart, not a Excel workbook. Don't be misled by the stream name "Workbook" - it is just a format convention.  

The MSGraph.Chart format is a derivative from BIFF8. The content stream consists of records but the structure and length of the records *CAN* be totally different from their analogues in the binary .xls format.   

For example, POI-HSSF parser detects record with sid=0x3d as WindowOneRecord and expects that such a record consists of nine shorts and has size of 18 bytes (9 fields of 2 bytes each) .  

the MSGraph.Chart format is different: depending on the position of WindowOneRecord  in the stream it can be either 18 bytes (nine two-byte fields) or 10 bytes (five two-byte fields), see section 2.4.104 in [MS-OGRAPH].pdf

I found similar discrepancies for SelectionRecord (0x001D) and LinkedDataRecord (0x1051).  

All this means that using HSSF to parse MSGraph.Chart is not quite correct. It is a special case you need a special parser to handle it. 

What information do you need to extract from embedded charts? Series text and data labels? What else ? 

I'm thinking of a special record factory and a even-driven parser that will read only specific bits of data. We may need to extend current API to support it.

Regards,
Yegor
Comment 3 Nick Burch 2012-11-28 09:58:59 UTC
Interesting, all news to me!

Is there an easy way that you know to tell if a file containing a Workbook entry is really an Excel file, or instead a MSGraph.Chart? We'll need that logic for Tika
Comment 4 Yegor Kozlov 2012-11-28 11:17:27 UTC
I don't know an easy way to tell MSGraph.Chart from a real Excel file.  For embedded documents Tika should always check ProgID, this property is stored in the host container. 

In this particular case you are reading embedded data from a .ppt file and you should check OLEShape#getProgID(). For Excel it should return "Worksheet", for Word - "Document", for MSGraph - "MSGraph.Chart", etc. One problem is that ProgID can contain suffix, e.g. "MSGraph.Chart.8" so it should be a regex check or "startWith" logic. 



(In reply to comment #3)
> Interesting, all news to me!
> 
> Is there an easy way that you know to tell if a file containing a Workbook
> entry is really an Excel file, or instead a MSGraph.Chart? We'll need that
> logic for Tika
Comment 5 mikemccand 2012-11-28 11:34:21 UTC
Thanks Yegor!

> What information do you need to extract from embedded charts? Series text and data labels? What else ?

I think series text and data labels would be awesome ... maybe also the data values themselves if possible ... I'm not sure what other textual elements an MSGraph.Chart can have (title?).