Bug 51318

Summary: Exceptions in NDocumentInputStream preventing streaming of data out of MS Publisher files
Product: POI Reporter: Dmitry Goldenberg <dgoldenberg>
Component: HPBFAssignee: POI Developers List <dev>
Status: RESOLVED LATER    
Severity: critical    
Priority: P2    
Version: 3.2-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: Smaller file to repro on

Description Dmitry Goldenberg 2011-06-03 17:44:01 UTC
Related to 51317 - Need ability to stream and chunk data out of MS Publisher documents.

I attempted to implement streaming and chunking of data out of pub files and got errors as below.

Basically I attempted to read from DocumentInputStream in chunks, in succession, rather than read in the whole stream into a large preallocated byte array.

    byte[] filler = new byte[25]; 
    
    byte[] bytes = new byte[8];
    int read = dis.read(bytes, 0, 8);
    
    if (read <= 0) {
      // 
    } else {
      String f8 = new String(bytes);
      if (!f8.equals("CHNKINK ")) {
        throw new IllegalArgumentException("Expecting 'CHNKINK ' but was '" + f8 + "'");
      }
      // Ignore the next 24, for now at least
    
      dis.read(filler, 8, 24);
      
      for (int i = 0; i < 20; i++) {
        int offset = 0x20 + i * 24;
        
        bytes = new byte[25];
        read = dis.read(bytes, offset, bytes.length);

Note the line which attempts to read the filler 24 bytes so we can get to the bits. I had to try it there because was getting error simply trying to do read(bytes, offset, bytes.length).

Errors are all like this first:
Exception in thread "main" java.lang.IndexOutOfBoundsException: can't read past buffer boundaries
	at org.apache.poi.poifs.filesystem.NDocumentInputStream.read(NDocumentInputStream.java:142)
	at org.apache.poi.poifs.filesystem.DocumentInputStream.read(DocumentInputStream.java:118)

Now, if we examine NDocumentInputStream.read(byte[], int, int), there is a conditional there:
if (off < 0 || len < 0 || b.length < off + len) {

This assumes that the byte array is large and you're going in sequence. If you want to jump around you'd presumably want to check b.length < len.

Tried that. Got the next error as follows:
Exception in thread "main" java.lang.IndexOutOfBoundsException
	at java.nio.Buffer.checkBounds(Buffer.java:530)
	at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:125)
	at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully(NDocumentInputStream.java:250)
	at org.apache.poi.poifs.filesystem.NDocumentInputStream.read(NDocumentInputStream.java:151)
	at org.apache.poi.poifs.filesystem.DocumentInputStream.read(DocumentInputStream.java:118)
Comment 1 Dmitry Goldenberg 2011-06-03 17:48:47 UTC
Attachment is too big to attached even in a zip. Please let me know if you want the file. I suspect this will happen on many or most pub files.
Comment 2 Nick Burch 2011-06-03 19:35:49 UTC
Does this happen even on small publisher files? I'm guessing it may affect anything where an entry in a POIFS is more than one big block. If you can reproduce it with one of the small sample files we already have, that'd mean we could use them and make life easy :)
Comment 3 Dmitry Goldenberg 2011-06-03 23:22:41 UTC
Created attachment 27107 [details]
Smaller file to repro on
Comment 4 Dmitry Goldenberg 2011-06-03 23:23:52 UTC
Nick,

Yes, it's reproducible on smaller files (please see attached).

Thanks
Comment 5 Nick Burch 2011-06-06 14:36:52 UTC
Can you try with svn trunk, and see if it helps? I fixed a few bits on the weekend, and I updated most of the DocumentInputStream tests to check NPOIFS too

There is a mark/reset issue though, need to fix that before I can write a test for your specific case.
Comment 6 Dmitry Goldenberg 2011-06-06 14:40:49 UTC
I can certainly try. So is all this stuff going into trunk?
We were actually on NIO 3.2 branch...  Also would ideally like your other fix too :)

Do you think you'll be looking into streaming API's for HPBF? Then perhaps we could write the chunking on top of that...
Comment 7 Nick Burch 2014-12-20 07:25:23 UTC
The NIO 3.2 branch hasn't been worked on for quite some time, and isn't likely to receive any new work. As such, I'm closing this as "In a Later Version", sorry

If you can reproduce this problem still on 3.11, please let us know. As it stands, a very similar unit test passes on trunk, and has done for some time, so I think your problem is solved!