Summary: | Extract text from Microsoft Word 2.0 (pre-OLE2) document | ||
---|---|---|---|
Product: | POI | Reporter: | gaurav.chd3 |
Component: | POI Overall | Assignee: | POI Developers List <dev> |
Status: | RESOLVED WONTFIX | ||
Severity: | enhancement | CC: | gaurav.chd3 |
Priority: | P2 | ||
Version: | 3.16-dev | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: | Meta data of attached word file gets parsed. However, content of file is not parsed and is blank |
The file begins with the following bytes: > 00000000 db a5 2d 00 31 40 09 04 00 00 00 00 2d 00 00 00 |..-.1@......-...| And has quite a bit of ASCII embedded in it. This doesn't look like a OLE2 BIFF8 Microsoft Word .doc file nor an OOXML Word .docx file. This looks more like a Microsoft Write .wri file, though has a different magic number. > 00000180 09 4d 65 6d 62 65 72 20 6f 66 20 33 47 50 50 20 |.Member of 3GPP | > 00000190 28 41 52 49 42 29 0d 0a 4d 72 2e 20 42 65 6e 6e |(ARIB)..Mr. Benn| Furthermore, I cannot open this file with Google Docs. Are you sure this is a Microsoft Word file? I wasn't able to find any common uses of this magic number. Nevermind. Looks like this claims to be a Word 2.0 file. http://www.filesignatures.net/index.php?page=search&search=DBA52D00&mode=SIG > DB A5 2D 00 Word 2.0 file, ASCII There are several entry points into POI. We should figure out what class should be responsible for checking the first few bytes (magic number) of a file to figure out what file format it is (Tika style). We could continue adding known magic numbers to o.a.p.poifs.HeaderBlock, but we may want to reuse that code elsewhere, such as WorkbookFactory/DocumentFactory/SlideshowFactory, the Extractor classes for Tika, etc. Tip for next time - run the Tika App jar in --detect mode to see if the file magic is known. In this case, Tika knows it's application/msword2 pre-OLE2 word2 has 2 magics, word5 has 1 (at least that Tika knows about), do people think it's worth adding helpful exceptions in POI for those too? In r1828176 we have added detection for word2 files and thus now make it easier to spot that Apache POI does not support this type of file. I think there are currently no plans to fully support this very old format, please reopen this with initial patches for review if you are interested in this feature and you can work on implementing and maintaining this. |
Created attachment 35106 [details] Meta data of attached word file gets parsed. However, content of file is not parsed and is blank Meta data of attached word file gets parsed. However, content of file is not parsed and is blank