Bug 51317

Summary: Need ability to stream and chunk data out of MS Publisher documents
Product: POI Reporter: Dmitry Goldenberg <dgoldenberg>
Component: HPBFAssignee: POI Developers List <dev>
Status: RESOLVED LATER    
Severity: enhancement    
Priority: P2    
Version: 3.2-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   
Bug Depends on: 45602    
Bug Blocks: 51320    

Description Dmitry Goldenberg 2011-06-03 17:33:27 UTC
This is a follow-up to 45602 (Add Java API for MS Publisher .pub files).

Basically, we need to be able to stream text data out of pub files and have enough API hooks to control its chunking.

Right now, HPBFDocument doesn't support the NIO version of the POI file system which makes it load the whole document into memory.

Text extraction is done from the QuillContents object (probably needs to examine the other parts like Main, Escher etc - subject of another ticket). QuillContents currently reads the whole document input stream into a single byte buffer, then makes sense of it and splits it into bits, then picks out the text and hyperlink bits.

For streaming, we'd want a way to not load everything at once but:
a. emit bits as they're encountered
b. make their contents streamable/chunkable, since a single bit may contain a lot of text data

I've attempted to implement this but came across exceptions in NDocumentInputStream - subject of another ticket.

Additionally, this functionality would ideally cover Publisher 2010 files which I don't believe it does - subject of another ticket.
Comment 1 Dominik Stadler 2015-03-22 19:27:18 UTC
There is basic text extracting provided via PublisherTextExtractor, more support like chunking, streaming, ... is likely not being added unless someone can provide patches that add this.