Bug 51320 - Determine whether parts other than QuillContents may contain useful text to extract and if so, support extraction from those
Summary: Determine whether parts other than QuillContents may contain useful text to e...
Status: RESOLVED LATER
Alias: None
Product: POI
Classification: Unclassified
Component: HPBF (show other bugs)
Version: 3.2-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on: 51317
Blocks:
  Show dependency tree
 
Reported: 2011-06-03 17:58 UTC by Dmitry Goldenberg
Modified: 2015-03-22 19:30 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Goldenberg 2011-06-03 17:58:14 UTC
Right now, only QuillContents is taken into account when extracting text.

It seems worth researching whether any useful text may be extraced from the Main and the Escher parts.

This is related to 51317 - Need ability to stream and chunk data out of MS Publisher documents. If any extra parts get exposed we'd ideally want streaming available on it.
Comment 1 Nick Burch 2011-06-03 19:47:46 UTC
The Escher parts are being passed by DDF. So, it should be fairly easy to walk through them in some sample files and see if there's any useful text in there. If there is, extending the text extractor to look for what we've identified should be fairly straight forward. Any chance you could take a look in some files you have to hand?

As for the main part, I seem to recall the issue is having no idea what on earth is stored in it or the format... First up you'd want to look at hex dumps, and see if there is handy text in there. If there is, then look at several files to see if it's in the same place. If not, look for what might be offsets to where the text lives, and if the offsets are in a predictable place then we're ok.

Needs some investigations, sorry!
Comment 2 Dmitry Goldenberg 2011-06-03 23:41:42 UTC
Nick,

Sorry I am swamped at the moment. This is not as critical since Quills get one most of the content it seems...
Comment 3 Dominik Stadler 2015-03-22 19:30:36 UTC
Resolving this for now as there has not been any activity for years.