Created attachment 29644 [details] Extracted 1.xls file This is a spinoff from https://issues.apache.org/jira/browse/TIKA-1033 I used Tika to extract embedded documents from the attached emb.ppt. One of those documents is a chart, and Tika detects it as an excel document and TikaCLI -z saves it as 1.xls (attached). But when I try to parse the 1.xls with Tika it hits an exception: Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4eaf6cb1 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121) Caused by: org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:292) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (2) bytes at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216) at org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:233) at org.apache.poi.hssf.record.WindowOneRecord.<init>(WindowOneRecord.java:71) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57) ... 15 more However, Excel 2007 also cannot open 1.xls ... so I'm not sure where the bug really is (Tika's extraction of 1.xls from emb.ppt, or Tika/POI's parsing of 1.xls).
Created attachment 29645 [details] emb.ppt that contains embedded chart/excel document
The raw object is a MSGraph.Chart, not a Excel workbook. Don't be misled by the stream name "Workbook" - it is just a format convention. The MSGraph.Chart format is a derivative from BIFF8. The content stream consists of records but the structure and length of the records *CAN* be totally different from their analogues in the binary .xls format. For example, POI-HSSF parser detects record with sid=0x3d as WindowOneRecord and expects that such a record consists of nine shorts and has size of 18 bytes (9 fields of 2 bytes each) . the MSGraph.Chart format is different: depending on the position of WindowOneRecord in the stream it can be either 18 bytes (nine two-byte fields) or 10 bytes (five two-byte fields), see section 2.4.104 in [MS-OGRAPH].pdf I found similar discrepancies for SelectionRecord (0x001D) and LinkedDataRecord (0x1051). All this means that using HSSF to parse MSGraph.Chart is not quite correct. It is a special case you need a special parser to handle it. What information do you need to extract from embedded charts? Series text and data labels? What else ? I'm thinking of a special record factory and a even-driven parser that will read only specific bits of data. We may need to extend current API to support it. Regards, Yegor
Interesting, all news to me! Is there an easy way that you know to tell if a file containing a Workbook entry is really an Excel file, or instead a MSGraph.Chart? We'll need that logic for Tika
I don't know an easy way to tell MSGraph.Chart from a real Excel file. For embedded documents Tika should always check ProgID, this property is stored in the host container. In this particular case you are reading embedded data from a .ppt file and you should check OLEShape#getProgID(). For Excel it should return "Worksheet", for Word - "Document", for MSGraph - "MSGraph.Chart", etc. One problem is that ProgID can contain suffix, e.g. "MSGraph.Chart.8" so it should be a regex check or "startWith" logic. (In reply to comment #3) > Interesting, all news to me! > > Is there an easy way that you know to tell if a file containing a Workbook > entry is really an Excel file, or instead a MSGraph.Chart? We'll need that > logic for Tika
Thanks Yegor! > What information do you need to extract from embedded charts? Series text and data labels? What else ? I think series text and data labels would be awesome ... maybe also the data values themselves if possible ... I'm not sure what other textual elements an MSGraph.Chart can have (title?).