Related to 51317 - Need ability to stream and chunk data out of MS Publisher documents. I attempted to implement streaming and chunking of data out of pub files and got errors as below. Basically I attempted to read from DocumentInputStream in chunks, in succession, rather than read in the whole stream into a large preallocated byte array. byte[] filler = new byte[25]; byte[] bytes = new byte[8]; int read = dis.read(bytes, 0, 8); if (read <= 0) { // } else { String f8 = new String(bytes); if (!f8.equals("CHNKINK ")) { throw new IllegalArgumentException("Expecting 'CHNKINK ' but was '" + f8 + "'"); } // Ignore the next 24, for now at least dis.read(filler, 8, 24); for (int i = 0; i < 20; i++) { int offset = 0x20 + i * 24; bytes = new byte[25]; read = dis.read(bytes, offset, bytes.length); Note the line which attempts to read the filler 24 bytes so we can get to the bits. I had to try it there because was getting error simply trying to do read(bytes, offset, bytes.length). Errors are all like this first: Exception in thread "main" java.lang.IndexOutOfBoundsException: can't read past buffer boundaries at org.apache.poi.poifs.filesystem.NDocumentInputStream.read(NDocumentInputStream.java:142) at org.apache.poi.poifs.filesystem.DocumentInputStream.read(DocumentInputStream.java:118) Now, if we examine NDocumentInputStream.read(byte[], int, int), there is a conditional there: if (off < 0 || len < 0 || b.length < off + len) { This assumes that the byte array is large and you're going in sequence. If you want to jump around you'd presumably want to check b.length < len. Tried that. Got the next error as follows: Exception in thread "main" java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkBounds(Buffer.java:530) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:125) at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully(NDocumentInputStream.java:250) at org.apache.poi.poifs.filesystem.NDocumentInputStream.read(NDocumentInputStream.java:151) at org.apache.poi.poifs.filesystem.DocumentInputStream.read(DocumentInputStream.java:118)
Attachment is too big to attached even in a zip. Please let me know if you want the file. I suspect this will happen on many or most pub files.
Does this happen even on small publisher files? I'm guessing it may affect anything where an entry in a POIFS is more than one big block. If you can reproduce it with one of the small sample files we already have, that'd mean we could use them and make life easy :)
Created attachment 27107 [details] Smaller file to repro on
Nick, Yes, it's reproducible on smaller files (please see attached). Thanks
Can you try with svn trunk, and see if it helps? I fixed a few bits on the weekend, and I updated most of the DocumentInputStream tests to check NPOIFS too There is a mark/reset issue though, need to fix that before I can write a test for your specific case.
I can certainly try. So is all this stuff going into trunk? We were actually on NIO 3.2 branch... Also would ideally like your other fix too :) Do you think you'll be looking into streaming API's for HPBF? Then perhaps we could write the chunking on top of that...
The NIO 3.2 branch hasn't been worked on for quite some time, and isn't likely to receive any new work. As such, I'm closing this as "In a Later Version", sorry If you can reproduce this problem still on 3.11, please let us know. As it stands, a very similar unit test passes on trunk, and has done for some time, so I think your problem is solved!