Bug 61266 - Extract text from Microsoft Write document
Summary: Extract text from Microsoft Write document
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: POIFS (show other bugs)
Version: 3.16-dev
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-09 10:50 UTC by gaurav.chd3
Modified: 2017-07-09 16:27 UTC (History)
1 user (show)



Attachments
DOC file (73.00 KB, application/msword)
2017-07-09 10:50 UTC, gaurav.chd3
Details

Note You need to log in before you can comment on or make changes to this bug.
Description gaurav.chd3 2017-07-09 10:50:54 UTC
Created attachment 35105 [details]
DOC file

The full exception stack trace is included below:

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@547eb45
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
	at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
	at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:357)
	at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:308)
	at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
	at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
	at javax.swing.TransferHandler.importData(Unknown Source)
	at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
	at java.awt.dnd.DropTarget.drop(Unknown Source)
	at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source)
	at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown Source)
	at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown Source)
	at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown Source)
	at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source)
	at java.awt.Component.dispatchEventImpl(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
	at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown Source)
	at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
	at java.awt.Container.dispatchEventImpl(Unknown Source)
	at java.awt.Window.dispatchEventImpl(Unknown Source)
	at java.awt.Component.dispatchEvent(Unknown Source)
	at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
	at java.awt.EventQueue.access$500(Unknown Source)
	at java.awt.EventQueue$3.run(Unknown Source)
	at java.awt.EventQueue$3.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source)
	at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source)
	at java.awt.EventQueue$4.run(Unknown Source)
	at java.awt.EventQueue$4.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source)
	at java.awt.EventQueue.dispatchEvent(Unknown Source)
	at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
	at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
	at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
	at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
	at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
	at java.awt.EventDispatchThread.run(Unknown Source)
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x0000AB000000BE31, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:181)
	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:124)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	... 43 more
Comment 1 Javen O'Neal 2017-07-09 11:35:44 UTC
Same comment as bug 61265 and bug 61257, please provide a better bug title and include the version of POI that you're using.

You can remove the javax.swing, java.awt, and sun calls in the stack trace.
Comment 2 Javen O'Neal 2017-07-09 11:41:53 UTC
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x0000AB000000BE31, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

Google Docs reported your file as corrupt as well. Are you sure this is a valid doc file and not encrypted?
Comment 3 gaurav.chd3 2017-07-09 11:58:15 UTC
This is a valid DOC file. This is an old file (year 1991). When we open this file (with text encoding as windows default) and resave it in docx format. Then, the docx format gets parsed successfully.
Comment 4 Javen O'Neal 2017-07-09 12:38:42 UTC
If this is a 1991 Word file, then perhaps HWPFOldDocument (for Word 6 and Word 95) should be used instead of HWPFDocument (BIFF8). It's possible that this file format predates Word 6.

Not sure if POI or Tika should be specifying a different file handler, though it's possible POI (and therefore Tika) can't read this ancient format.

The o.a.p.poifs.storage.HeaderBlock constructor recognizes that this file is not a BIFF2, 3, or 4 document.
Comment 5 Javen O'Neal 2017-07-09 13:16:08 UTC
Looks like POI doesn't currently support reading this file format.

Opening the binary file in a text editor reveals that most of the document contents are saved as ASCII, with a few special characters to embed figures and designate the start of sections. This doesn't look like any OLE2 file I have seen before.

Presumably if all that is needed is text extraction, you could use `strings` on this document.

Changing this to an enhancement request in case someone is interested in figuring out what archaic file format this is and writing a primitive parser that can extract text from the document.
Comment 6 Javen O'Neal 2017-07-09 13:26:31 UTC
Starting with 0x31be, the provided file is presumably a Microsoft Write file, typically found with a .wri extension, though later saved with a .doc extension and *optionally* saved in an OLE2 container (this file isn't). This format dates back to the Windows 1.0 days (1985).

http://www.filesignatures.net/index.php?page=search&search=31BE&mode=SIG
https://en.wikipedia.org/wiki/Microsoft_Write

Strictly speaking, Write is not part of the Microsoft Office suite.
Comment 7 Nick Burch 2017-07-09 16:27:27 UTC
I've added a more helpful exception for these files in r1801376, based on the mime magic from Apache Tika for them