|Summary:||Need ability to stream and chunk data out of MS Publisher documents|
|Product:||POI||Reporter:||Dmitry Goldenberg <dgoldenberg>|
|Component:||HPBF||Assignee:||POI Developers List <dev>|
|Bug Depends on:||45602|
Description Dmitry Goldenberg 2011-06-03 17:33:27 UTC
This is a follow-up to 45602 (Add Java API for MS Publisher .pub files). Basically, we need to be able to stream text data out of pub files and have enough API hooks to control its chunking. Right now, HPBFDocument doesn't support the NIO version of the POI file system which makes it load the whole document into memory. Text extraction is done from the QuillContents object (probably needs to examine the other parts like Main, Escher etc - subject of another ticket). QuillContents currently reads the whole document input stream into a single byte buffer, then makes sense of it and splits it into bits, then picks out the text and hyperlink bits. For streaming, we'd want a way to not load everything at once but: a. emit bits as they're encountered b. make their contents streamable/chunkable, since a single bit may contain a lot of text data I've attempted to implement this but came across exceptions in NDocumentInputStream - subject of another ticket. Additionally, this functionality would ideally cover Publisher 2010 files which I don't believe it does - subject of another ticket.
Comment 1 Dominik Stadler 2015-03-22 19:27:18 UTC
There is basic text extracting provided via PublisherTextExtractor, more support like chunking, streaming, ... is likely not being added unless someone can provide patches that add this.