Bug 51317 - Need ability to stream and chunk data out of MS Publisher documents
Summary: Need ability to stream and chunk data out of MS Publisher documents
Alias: None
Product: POI
Classification: Unclassified
Component: HPBF (show other bugs)
Version: 3.2-FINAL
Hardware: All All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on: 45602
Blocks: 51320
  Show dependency tree
Reported: 2011-06-03 17:33 UTC by Dmitry Goldenberg
Modified: 2015-03-22 19:29 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Goldenberg 2011-06-03 17:33:27 UTC
This is a follow-up to 45602 (Add Java API for MS Publisher .pub files).

Basically, we need to be able to stream text data out of pub files and have enough API hooks to control its chunking.

Right now, HPBFDocument doesn't support the NIO version of the POI file system which makes it load the whole document into memory.

Text extraction is done from the QuillContents object (probably needs to examine the other parts like Main, Escher etc - subject of another ticket). QuillContents currently reads the whole document input stream into a single byte buffer, then makes sense of it and splits it into bits, then picks out the text and hyperlink bits.

For streaming, we'd want a way to not load everything at once but:
a. emit bits as they're encountered
b. make their contents streamable/chunkable, since a single bit may contain a lot of text data

I've attempted to implement this but came across exceptions in NDocumentInputStream - subject of another ticket.

Additionally, this functionality would ideally cover Publisher 2010 files which I don't believe it does - subject of another ticket.
Comment 1 Dominik Stadler 2015-03-22 19:27:18 UTC
There is basic text extracting provided via PublisherTextExtractor, more support like chunking, streaming, ... is likely not being added unless someone can provide patches that add this.