61267 – Extract text from Microsoft Word 2.0 (pre-OLE2) document

Bug 61267 - Extract text from Microsoft Word 2.0 (pre-OLE2) document

Summary: Extract text from Microsoft Word 2.0 (pre-OLE2) document

Status:	RESOLVED WONTFIX

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	POI Overall (show other bugs)
Version:	3.16-dev
Hardware:	PC All

Importance:	P2 enhancement (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-07-09 10:54 UTC by gaurav.chd3
Modified:	2018-04-02 17:18 UTC (History)
CC List:	1 user (show)

Attachments
Meta data of attached word file gets parsed. However, content of file is not parsed and is blank (14.32 KB, application/msword) 2017-07-09 10:54 UTC, gaurav.chd3	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description gaurav.chd3 2017-07-09 10:54:28 UTC

Created attachment 35106 [details]
Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Meta data of attached word file gets parsed. However, content of file is not parsed and is blank

Comment 1 Javen O'Neal 2017-07-10 01:52:50 UTC

The file begins with the following bytes:
> 00000000  db a5 2d 00 31 40 09 04  00 00 00 00 2d 00 00 00  |..-.1@......-...|

And has quite a bit of ASCII embedded in it. This doesn't look like a OLE2 BIFF8 Microsoft Word .doc file nor an OOXML Word .docx file. This looks more like a Microsoft Write .wri file, though has a different magic number.

> 00000180  09 4d 65 6d 62 65 72 20  6f 66 20 33 47 50 50 20  |.Member of 3GPP |
> 00000190  28 41 52 49 42 29 0d 0a  4d 72 2e 20 42 65 6e 6e  |(ARIB)..Mr. Benn|

Furthermore, I cannot open this file with Google Docs.

Are you sure this is a Microsoft Word file?
I wasn't able to find any common uses of this magic number.

Comment 2 Javen O'Neal 2017-07-10 01:56:16 UTC

Nevermind. Looks like this claims to be a Word 2.0 file.

http://www.filesignatures.net/index.php?page=search&search=DBA52D00&mode=SIG
> DB A5 2D 00   Word 2.0 file, ASCII

Comment 3 Javen O'Neal 2017-07-10 02:12:43 UTC

There are several entry points into POI. We should figure out what class should be responsible for checking the first few bytes (magic number) of a file to figure out what file format it is (Tika style).

We could continue adding known magic numbers to o.a.p.poifs.HeaderBlock, but we may want to reuse that code elsewhere, such as WorkbookFactory/DocumentFactory/SlideshowFactory, the Extractor classes for Tika, etc.

Comment 4 Nick Burch 2017-07-10 10:27:30 UTC

Tip for next time - run the Tika App jar in --detect mode to see if the file magic is known. In this case, Tika knows it's application/msword2

pre-OLE2 word2 has 2 magics, word5 has 1 (at least that Tika knows about), do people think it's worth adding helpful exceptions in POI for those too?

Comment 5 Dominik Stadler 2018-04-02 17:18:23 UTC

In r1828176 we have added detection for word2 files and thus now make it easier to spot that Apache POI does not support this type of file. 

I think there are currently no plans to fully support this very old format, please reopen this with initial patches for review if you are interested in this feature and you can work on implementing and maintaining this.