Summary: | Extract text from Microsoft Write document | ||
---|---|---|---|
Product: | POI | Reporter: | gaurav.chd3 |
Component: | POIFS | Assignee: | POI Developers List <dev> |
Status: | RESOLVED WONTFIX | ||
Severity: | enhancement | CC: | gaurav.chd3 |
Priority: | P2 | ||
Version: | 3.16-dev | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: | DOC file |
Same comment as bug 61265 and bug 61257, please provide a better bug title and include the version of POI that you're using. You can remove the javax.swing, java.awt, and sun calls in the stack trace. Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x0000AB000000BE31, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document Google Docs reported your file as corrupt as well. Are you sure this is a valid doc file and not encrypted? This is a valid DOC file. This is an old file (year 1991). When we open this file (with text encoding as windows default) and resave it in docx format. Then, the docx format gets parsed successfully. If this is a 1991 Word file, then perhaps HWPFOldDocument (for Word 6 and Word 95) should be used instead of HWPFDocument (BIFF8). It's possible that this file format predates Word 6. Not sure if POI or Tika should be specifying a different file handler, though it's possible POI (and therefore Tika) can't read this ancient format. The o.a.p.poifs.storage.HeaderBlock constructor recognizes that this file is not a BIFF2, 3, or 4 document. Looks like POI doesn't currently support reading this file format. Opening the binary file in a text editor reveals that most of the document contents are saved as ASCII, with a few special characters to embed figures and designate the start of sections. This doesn't look like any OLE2 file I have seen before. Presumably if all that is needed is text extraction, you could use `strings` on this document. Changing this to an enhancement request in case someone is interested in figuring out what archaic file format this is and writing a primitive parser that can extract text from the document. Starting with 0x31be, the provided file is presumably a Microsoft Write file, typically found with a .wri extension, though later saved with a .doc extension and *optionally* saved in an OLE2 container (this file isn't). This format dates back to the Windows 1.0 days (1985). http://www.filesignatures.net/index.php?page=search&search=31BE&mode=SIG https://en.wikipedia.org/wiki/Microsoft_Write Strictly speaking, Write is not part of the Microsoft Office suite. I've added a more helpful exception for these files in r1801376, based on the mime magic from Apache Tika for them I think we won't pursue full support for such ancient file formats, better to convert them to something newer as likely all sorts of tools won't be able to handle these files any more soon. Detection was improved, so we at least state now that we found a Write-document which we cannot read. |
Created attachment 35105 [details] DOC file The full exception stack trace is included below: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@547eb45 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:357) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:308) at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94) at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77) at javax.swing.TransferHandler.importData(Unknown Source) at javax.swing.TransferHandler$DropHandler.drop(Unknown Source) at java.awt.dnd.DropTarget.drop(Unknown Source) at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source) at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown Source) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown Source) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown Source) at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source) at java.awt.Component.dispatchEventImpl(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown Source) at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Window.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.EventQueue.dispatchEventImpl(Unknown Source) at java.awt.EventQueue.access$500(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source) Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x0000AB000000BE31, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:181) at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:124) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 43 more