51318 – Exceptions in NDocumentInputStream preventing streaming of data out of MS Publisher files

Bug 51318 - Exceptions in NDocumentInputStream preventing streaming of data out of MS Publisher files

Summary: Exceptions in NDocumentInputStream preventing streaming of data out of MS Pub...

Status:	RESOLVED LATER

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HPBF (show other bugs)
Version:	3.2-FINAL
Hardware:	All All

Importance:	P2 critical (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-06-03 17:44 UTC by Dmitry Goldenberg
Modified:	2014-12-20 07:25 UTC (History)
CC List:	0 users

Attachments
Smaller file to repro on (86.50 KB, application/vnd.ms-publisher) 2011-06-03 23:22 UTC, Dmitry Goldenberg	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dmitry Goldenberg 2011-06-03 17:44:01 UTC

Related to 51317 - Need ability to stream and chunk data out of MS Publisher documents.

I attempted to implement streaming and chunking of data out of pub files and got errors as below.

Basically I attempted to read from DocumentInputStream in chunks, in succession, rather than read in the whole stream into a large preallocated byte array.

    byte[] filler = new byte[25]; 
    
    byte[] bytes = new byte[8];
    int read = dis.read(bytes, 0, 8);
    
    if (read <= 0) {
      // 
    } else {
      String f8 = new String(bytes);
      if (!f8.equals("CHNKINK ")) {
        throw new IllegalArgumentException("Expecting 'CHNKINK ' but was '" + f8 + "'");
      }
      // Ignore the next 24, for now at least
    
      dis.read(filler, 8, 24);
      
      for (int i = 0; i < 20; i++) {
        int offset = 0x20 + i * 24;
        
        bytes = new byte[25];
        read = dis.read(bytes, offset, bytes.length);

Note the line which attempts to read the filler 24 bytes so we can get to the bits. I had to try it there because was getting error simply trying to do read(bytes, offset, bytes.length).

Errors are all like this first:
Exception in thread "main" java.lang.IndexOutOfBoundsException: can't read past buffer boundaries
	at org.apache.poi.poifs.filesystem.NDocumentInputStream.read(NDocumentInputStream.java:142)
	at org.apache.poi.poifs.filesystem.DocumentInputStream.read(DocumentInputStream.java:118)

Now, if we examine NDocumentInputStream.read(byte[], int, int), there is a conditional there:
if (off < 0 || len < 0 || b.length < off + len) {

This assumes that the byte array is large and you're going in sequence. If you want to jump around you'd presumably want to check b.length < len.

Tried that. Got the next error as follows:
Exception in thread "main" java.lang.IndexOutOfBoundsException
	at java.nio.Buffer.checkBounds(Buffer.java:530)
	at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:125)
	at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully(NDocumentInputStream.java:250)
	at org.apache.poi.poifs.filesystem.NDocumentInputStream.read(NDocumentInputStream.java:151)
	at org.apache.poi.poifs.filesystem.DocumentInputStream.read(DocumentInputStream.java:118)

Comment 1 Dmitry Goldenberg 2011-06-03 17:48:47 UTC

Attachment is too big to attached even in a zip. Please let me know if you want the file. I suspect this will happen on many or most pub files.

Comment 2 Nick Burch 2011-06-03 19:35:49 UTC

Does this happen even on small publisher files? I'm guessing it may affect anything where an entry in a POIFS is more than one big block. If you can reproduce it with one of the small sample files we already have, that'd mean we could use them and make life easy :)

Comment 3 Dmitry Goldenberg 2011-06-03 23:22:41 UTC

Created attachment 27107 [details]
Smaller file to repro on

Comment 4 Dmitry Goldenberg 2011-06-03 23:23:52 UTC

Nick,

Yes, it's reproducible on smaller files (please see attached).

Thanks

Comment 5 Nick Burch 2011-06-06 14:36:52 UTC

Can you try with svn trunk, and see if it helps? I fixed a few bits on the weekend, and I updated most of the DocumentInputStream tests to check NPOIFS too

There is a mark/reset issue though, need to fix that before I can write a test for your specific case.

Comment 6 Dmitry Goldenberg 2011-06-06 14:40:49 UTC

I can certainly try. So is all this stuff going into trunk?
We were actually on NIO 3.2 branch...  Also would ideally like your other fix too :)

Do you think you'll be looking into streaming API's for HPBF? Then perhaps we could write the chunking on top of that...

Comment 7 Nick Burch 2014-12-20 07:25:23 UTC

The NIO 3.2 branch hasn't been worked on for quite some time, and isn't likely to receive any new work. As such, I'm closing this as "In a Later Version", sorry

If you can reproduce this problem still on 3.11, please let us know. As it stands, a very similar unit test passes on trunk, and has done for some time, so I think your problem is solved!