Bug 61267 - Extract text from Microsoft Word 2.0 (pre-OLE2) document
Summary: Extract text from Microsoft Word 2.0 (pre-OLE2) document
Status: RESOLVED WONTFIX
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: 3.16-dev
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-09 10:54 UTC by gaurav.chd3
Modified: 2018-04-02 17:18 UTC (History)
1 user (show)



Attachments
Meta data of attached word file gets parsed. However, content of file is not parsed and is blank (14.32 KB, application/msword)
2017-07-09 10:54 UTC, gaurav.chd3
Details

Note You need to log in before you can comment on or make changes to this bug.
Description gaurav.chd3 2017-07-09 10:54:28 UTC
Created attachment 35106 [details]
Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Meta data of attached word file gets parsed. However, content of file is not parsed and is blank
Comment 1 Javen O'Neal 2017-07-10 01:52:50 UTC
The file begins with the following bytes:
> 00000000  db a5 2d 00 31 40 09 04  00 00 00 00 2d 00 00 00  |..-.1@......-...|

And has quite a bit of ASCII embedded in it. This doesn't look like a OLE2 BIFF8 Microsoft Word .doc file nor an OOXML Word .docx file. This looks more like a Microsoft Write .wri file, though has a different magic number.

> 00000180  09 4d 65 6d 62 65 72 20  6f 66 20 33 47 50 50 20  |.Member of 3GPP |
> 00000190  28 41 52 49 42 29 0d 0a  4d 72 2e 20 42 65 6e 6e  |(ARIB)..Mr. Benn|

Furthermore, I cannot open this file with Google Docs.

Are you sure this is a Microsoft Word file?
I wasn't able to find any common uses of this magic number.
Comment 2 Javen O'Neal 2017-07-10 01:56:16 UTC
Nevermind. Looks like this claims to be a Word 2.0 file.

http://www.filesignatures.net/index.php?page=search&search=DBA52D00&mode=SIG
> DB A5 2D 00   Word 2.0 file, ASCII
Comment 3 Javen O'Neal 2017-07-10 02:12:43 UTC
There are several entry points into POI. We should figure out what class should be responsible for checking the first few bytes (magic number) of a file to figure out what file format it is (Tika style).

We could continue adding known magic numbers to o.a.p.poifs.HeaderBlock, but we may want to reuse that code elsewhere, such as WorkbookFactory/DocumentFactory/SlideshowFactory, the Extractor classes for Tika, etc.
Comment 4 Nick Burch 2017-07-10 10:27:30 UTC
Tip for next time - run the Tika App jar in --detect mode to see if the file magic is known. In this case, Tika knows it's application/msword2

pre-OLE2 word2 has 2 magics, word5 has 1 (at least that Tika knows about), do people think it's worth adding helpful exceptions in POI for those too?
Comment 5 Dominik Stadler 2018-04-02 17:18:23 UTC
In r1828176 we have added detection for word2 files and thus now make it easier to spot that Apache POI does not support this type of file. 

I think there are currently no plans to fully support this very old format, please reopen this with initial patches for review if you are interested in this feature and you can work on implementing and maintaining this.