Bug 13478

Summary: [PATCH] [RFE] POIFS, RawDataBlock: Missing workaround for low performance InputStreams
Product: POI Reporter: Jens Gerhard <gerhajns>
Component: POIFSAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: blocker CC: paulk, shimonp, sys, volker.tirjan
Priority: P3    
Version: 2.0-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: There is the BlockingInputStream wrapper.
tar.bz2 with patch and RawDataBlock.java
Patch block read functions to loop until correct amount is read, or EOF
Thanks :-) Here's a new patch with a testcase.

Description Jens Gerhard 2002-10-10 07:41:15 UTC
Description:
I'm using the HSSF classes to create an index with lucene. The implementation of
my spider is using the URLConnection.getInputStream to read the excel data
through http. The POIFSFileSystem has some problems with slow InputStream over
http protocol so that the RawDataBlock class throws an exception, if the
inputstream doesn't receive continous.

Platforms:
This problem occurs on all our WindowsNT/W2K workstations and servers with Java
SDK 1.3.x/1.4.x. I'm using lucene 1.2 and POI 1.5 or POI 1.8-dev.

Thing about a solution:
Please create a wrapper around the stream.read (line 87, RawDataBlock class,
POI-version 1.8-dev) method so that the RawDataBlock constructor has now
problems with asynchronous InputStreams like HTTP-Streams.

Here is my draft proposal for a solution:

      public int read(byte[] bf)
        throws IOException
      {
          // We had to revert to byte per byte reading to keep
          // with slow network connections on one hand, without
          // missing the end-of-file.
          // THIS IS A BLOCKING BLOCK READ!!!
          int i = 0;
          int b = 4611;
          while ( i < bf.length )
          {
              b = is.read();
              if ( b == -1 ) 
                  break;
              bf[i++] = (byte) b;
          }
          if ( i == 0 && b == -1 )
              return -1;
          return i;
      }

regards
Jens
Comment 1 Andy Oliver 2002-10-10 11:02:38 UTC
Changed version to 2.0-dev as this is a feature.  Also if you're actually using
1.5....upgrade to 1.5.1.  Glad you're doing this for Lucene.  I kept meaning to
but haven't gotten around to it.

Okay, so why can't you use BufferedInputStream?  If that doesn't work create a
"BlockingInputStream" which takes an InputStream and a timeout as input.  I'll
commit it to util.
Comment 2 Jens Gerhard 2002-10-10 14:09:46 UTC
This is our the complete "BlockingInputStream", because BufferedInputStream
doesn't work with our ExcelReader:


  public class SafePOIStream
      extends InputStream
  {
      protected InputStream is;

      public SafePOIStream(InputStream is)
      {
          this.is = is;
      }

      public int available()
        throws IOException
      {
          return is.available();
      }

      public void close()
        throws IOException
      {
          is.close();
      }

      public void mark(int readLimit)
      {
          is.mark(readLimit);
      }

      public boolean markSupported()
      {
          return is.markSupported();
      }

      public int read()
        throws IOException
      {
          return is.read();
      }

      public int read(byte[] bf)
        throws IOException
      {
          // We had to revert to byte per byte reading to keep
          // with slow network connections on one hand, without
          // missing the end-of-file.
          // THIS IS A BLOCKING BLOCK READ!!!
          int i = 0;
          int b = 4611;
          while ( i < bf.length )
          {
              b = is.read();
              if ( b == -1 ) 
                  break;
              bf[i++] = (byte) b;
          }
          if ( i == 0 && b == -1 )
              return -1;
          return i;
      }

      public int read(byte[] bf, int s, int l)
        throws IOException
      {
          return is.read(bf, s, l);
      }

      public void reset()
        throws IOException
      {
          is.reset();
      }

      public long skip(long n)
        throws IOException
      {
          return is.skip(n);
      }
    }
Comment 3 Andy Oliver 2002-10-10 15:02:53 UTC
please attach it as a file (click create attachment) it is increadibly difficult
to get files out via cut/paste from here without them being total garbage. 
Please also attach a unit test for this.  I'd prefer it be called
BlockingInputStream and be attached in a directory preserving archive relative
to jakarta-poi.  the package should be org.apache.poi.util.  
Comment 4 Jens Gerhard 2002-10-11 05:30:46 UTC
Created attachment 3422 [details]
There is the BlockingInputStream wrapper.
Comment 5 Avik Sengupta 2003-03-18 18:33:39 UTC
*** Bug 18117 has been marked as a duplicate of this bug. ***
Comment 6 Avik Sengupta 2003-03-18 19:01:06 UTC
added BlockingInputStream to poi/util. 
Comment 7 Tony Chao 2003-03-18 19:27:34 UTC
Created attachment 5412 [details]
tar.bz2 with patch and RawDataBlock.java
Comment 8 Tony Chao 2003-03-18 19:28:04 UTC
I disagree with the above solution.

BufferedInputStream does not work because it is also not guaranteed to return
512 bytes on each read (what RawDataBlock excepts).

The above solution remedies the situation by providing the user with an 
alternate InputStream that that guarantees it is able to return the properly 
sized block of data. The BlockingInputStream supplied, does not time out.

I feel the real problem is that RawDataBlock should be more tolerant in it's 
use of InputStreams. RawDataBlock should repeat reads until the desired chuck 
of data base been read. If too many failed reads, then RawDataBlock should 
throw error as it does now.

I have attached above my proposed solution. Patch file and RawDataBlock.java
Comment 9 frank nestel 2003-03-19 08:11:27 UTC
Hi, I'm a collegue of Jens. Just wanted to confirm Tonies comment:
The SafePOIStream is our work arround, which is enough to make POI
work in our project. A fix to RawDataBlock and whatever else was
simply beyond our work scope. For reasons I haven't figures I cannot
download the attachment, is there a different way to obtain it?
BTW the BlockingInputStream has proven to help also in other cases
of slow input, so it might still worth to have arround.
Comment 10 Avik Sengupta 2003-03-19 09:09:16 UTC
I am not convinced from a "seperation of concerns" point of view that this
functionality belongs to RawDataBlock. I would much rather have timouts etc
outside of that class, hence my liking for a BlockingInputStream. Better to have
timeouts added there. 

I would go so far as to suggest that if this is an important issue, we should
change the signature of POIfs to accept a blocking input stream only. Tho this
is an extreme suggestion, i find it cleaner than having RawDataBlock retry stuff. 

Thoughts, folks?
Comment 11 Andy Oliver 2003-03-19 13:29:54 UTC
"A fix to RawDataBlock and whatever else was
simply beyond our work scope"  - you'll go far in opensource with this attitude
;-) -- EVERYTHING we do is beyond our work scope.
Comment 12 Andy Oliver 2003-03-19 13:31:23 UTC
Thought:  Hope BlockingInputStream times out.  I'd rather see the changes to
RawDataBlock than require someone to wrap every input stream.  If it should be
required then it should be an alteration of RawDataBlock...
Comment 13 Avik Sengupta 2003-03-24 21:44:38 UTC
Any other opinion on this? I'm still only half convinced, would like to hear
opinion of others. 
Comment 14 frank nestel 2003-03-25 08:27:02 UTC
The decision where to handle the timeout depends on how you look at
RawDataBlock. If you really see RawDataBlock as a low level communication layer
it should care itself, if not something else should handle the problem. 

Separation of concerns is an argument, since in a high speed environment one
might want not use the BlockingInputStream for efficency reasons whereas over
net-sockets it seems to be necessary all the time. One could think of tuning
parameters (like timeouts) which  could be made configurable on the actual speed
of a connection.

On the other hand the guarantee to deliver 512 bytes is entirely incompatible
with the (documented) contract of jdk InputStream since this makes no guarantees
how many bytes are returned. In that respect RawDataBlock is based on a wrong
assumption of API and should handle the problem.

See, I haven't made up my mind myself yet, just 2 Euro Cents.
Comment 15 Chris Nokleberg 2003-04-14 04:58:29 UTC
I actually just wrote about this very topic in my blog:
http://sixlegs.com/blog/java/inputstreams-suck.html

In my testing I discovered this bug in POI, which I use quite a bit. I think it
is a (severe) bug in RawDataBlock. The contract for InputStream clearly states
that the call may return less than the number of requested bytes. Any method
that needs a certain number of bytes should therefore loop until it has the
total amount.

I don't think performance considerations have any validity here, because:

a) read(byte[], int, int) blocks anyway, since it *is* guaranteed to return at
least one byte or -1 for end of file.

b) You're not going to be doing anything useful until you finish loading the
data. If you are, you should be loading in a separate thread anyway (which is
also where you could handle timeouts).

Chris
Comment 16 Tony Chao 2003-04-17 16:48:54 UTC
I've seen lots of comments regarding this bug but no decision yet.
I still believe fixing RawDataBlock is the correct thing to do.
But perhaps we could put this to rest and get a vote or an executive decision 
by someone?
Comment 17 Andy Oliver 2003-04-17 17:04:24 UTC
propose a vote on the poi-dev list..  We'll vote on it (preface with [VOTE])...
only committer votes are binding but there is no rule about calling a vote AFAIK
Comment 18 Chris Nokleberg 2003-04-17 17:42:21 UTC
Created attachment 5879 [details]
Patch block read functions to loop until correct amount is read, or EOF
Comment 19 Chris Nokleberg 2003-04-17 17:50:54 UTC
The above patch changes DocumentBlock, HeaderBlockReader, and RawDataBlock.
Instead of directly calling in.read(data), they now call readFully in a new
IOUtils class. readFully simply loops, trying to read the full block of data
(i.e. 512 bytes). I've tested the resulting jar and it solves my problems with
slow input streams.

I tried to create a testcase but I'm having trouble with the ant build. "ant
test" complains of unknown junit2 task, and "./build.bat" says it is missing
centipede-user-input.
Comment 20 Danny Mui 2003-04-17 18:09:43 UTC
run: 

build test
Comment 21 Chris Nokleberg 2003-04-17 18:28:09 UTC
Created attachment 5883 [details]
Thanks :-) Here's a new patch with a testcase.
Comment 22 Chris Nokleberg 2003-06-03 00:20:56 UTC
*** Bug 15807 has been marked as a duplicate of this bug. ***
Comment 23 Tony Chao 2003-06-13 19:08:54 UTC
why is this bug still open?
there's been lots of suggestions and at least 2 working patches

could someone with committer level access please select either mine or chris's 
patch and put this bug to rest? it really cant take more than 10 mins to do.
Comment 24 Andy Oliver 2003-06-13 19:58:59 UTC
Explain to me why you can't just pass in a BufferedInputStream (which is blocking)?
Comment 25 Chris Nokleberg 2003-06-13 20:19:27 UTC
BufferedInputStream is not blocking. From the javadoc:

"This iterated read continues until one of the following conditions becomes true:
    * The specified number of bytes have been read,
    * The read method of the underlying stream returns -1, indicating end-of-file
    * The available method of the underlying stream returns zero, indicating
that further input requests would block."

Furthermore, even if this did work, it should not be up to the user to supply a
BufferedInputStream. The POI methods are documented as taking an InputStream,
but fail when given a valid (albeit "slow") InputStream. The fix is simple--just
loop until you have the data you need (i.e. my patch).
Comment 26 Tony Chao 2003-06-13 20:30:20 UTC
are you asking about BufferedInputStream or BlockingInputStream?

BufferedInputStream is not guaranteed to return enough bytes to fill your 
buffer so it will not return 512 bytes if you're using tcp/ip as input stream.
I've tried this, it doesnt work.


BlockingInputStream as proposed by Jens does not time out, and you yourself
(Andrew) agreed that it is undesirable for each user to have to wrap their 
input streams manually for RawDataBlock to work correctly. If that's required, 
then the changes should be made to RawDataBlock


Chris's proposal actually does not time out either, but it at least 
encapsulates the problem so that the end user will not have to wrap every 
input stream.

Time out is especially important for tcp/ip streams that may be lost.
If you look at my patch (it's above as a tar.bz2 attachment and also inlined 
below), my solution attempts to read until the desired 512 bytes is available, 
but if it finds zero data on over 100 consecutive reads, it gives up and 
returns so that the application will not be stalled.



inlined patch for my fix:




Index: RawDataBlock.java
===================================================================
RCS file: /home/cvspublic/jakarta-
poi/src/java/org/apache/poi/poifs/storage/RawDataBlock.java,v
retrieving revision 1.2
diff -u -r1.2 RawDataBlock.java
--- RawDataBlock.java	15 Mar 2002 02:47:56 -0000	1.2
+++ RawDataBlock.java	18 Mar 2003 19:16:04 -0000
@@ -62,18 +62,39 @@
 /**
  * A big block created from an InputStream, holding the raw data
  *
- * @author Marc Johnson (mjohnson at apache dot org
+ * @author Marc Johnson (mjohnson at apache dot org)
+ * @author Tony Chao (sys at yahoo dot com)
  */
 
 public class RawDataBlock
     implements ListManagedBlock
 {
+    /**
+     * Number of consecutive failed reads before giving up.
+     * (currently set to 5)
+     */
+    public static int  RETRY_COUNT = 5;
+
+    /**
+     * Number of milliseconds to sleep before retrying a failed read.
+     * (currently set to 100)
+     */
+    public static long RETRY_WAIT  = 100;
+    
     private byte[]  _data;
     private boolean _eof;
 
     /**
-     * Constructor RawDataBlock
-     *
+     * Constructor RawDataBlock.
+     * 
+     * Reads up to POIFSConstants.BIG_BLOCK_SIZE # of bytes and creates
+     * RawDataBlock from data read. If inputstream is not ready for read
+     * (e.g. read() returns 0), sleep briefly (RETRY_WAIT ms), and retry
+     * up to RETRY_COUNT consecutive times. This permits using various
+     * types of inputstreams (ServletInputStream, ByteArrayInputStream, etc)
+     * as input, but still exits gracefully (as opposed to waiting forever)
+     * if inputstream is unavailable.
+     * 
      * @param stream the InputStream from which the data will be read
      *
      * @exception IOException on I/O errors, and if an insufficient
@@ -84,8 +105,23 @@
         throws IOException
     {
         _data = new byte[ POIFSConstants.BIG_BLOCK_SIZE ];
-        int count = stream.read(_data);
+        int read = stream.read(_data);
+        int count = read;
+        int failcount = 0;
 
+        while (read >= 0 &&
+               count < POIFSConstants.BIG_BLOCK_SIZE &&
+               failcount < RETRY_COUNT)
+        {
+            read = stream.read(_data, count, _data.length-count);
+            count = (read >= 0) ? (count+read) : -1;
+            failcount = (read > 0) ? 0 : (failcount+1);
+            if (read==0) {
+                try {
+                    Thread.sleep(RETRY_WAIT);
+                } catch (InterruptedException e) {}
+            }
+        }
         if (count == -1)
         {
             _eof = true;
Comment 27 Tony Chao 2003-06-13 20:41:19 UTC
oops
typo

i meant that my patch re-reads up to 5 connsecutive reads of zero
but waits 100 milliseconds in between each read
Comment 28 Chris Nokleberg 2003-06-13 20:42:09 UTC
Tony, I don't think your patch is correct. It assumes that InputStream.read can
return zero, and bases the retrying/timeout on that fact.

InputStream.read will actually only ever return zero of that is the number of
bytes requested. Otherwise it will return -1 (EOF), a value greater than zero (#
bytes read), or throw an exception.

I believe that in the case of an HTTP timeout your stream will return -1, or at
least throw an exception, so my patch should work for you (maybe you can try it).
Comment 29 Tony Chao 2003-06-13 21:27:57 UTC
Chris:

you're correct in that according to the javadocs, InputStream.read()
says:
"If len is zero, then no bytes are read and 0 is returned; otherwise, there is 
an attempt to read at least one byte. If no byte is available because the 
stream is at end of file, the value -1 is returned; otherwise, at least one 
byte is read and stored into b."

however, the javadoc from BufferedInputStream.read() says:
"This iterated read continues until one of the following conditions becomes 
true:
    * The specified number of bytes have been read,
    * The read method of the underlying stream returns -1, indicating end-of-
file
    * The available method of the underlying stream returns zero, indicating
that further input requests would block."


I believe InputStream.Available() will return 0 where as InputStream.read() 
will return >0

This would in turn imply BufferedInputStream.read() could return 0 even if 
len!=0

Perhaps you're right and the timeout loop is unncessary, but even so, my
code should still work at least as well as yours.
Comment 30 Tony Chao 2003-06-13 21:29:59 UTC
also..to clarify

my code does NOT throw an exception if the http stream times out.
my code will simply give up on trying to read more and return what it was able 
to read so far.

Comment 31 Chris Nokleberg 2003-06-13 21:36:25 UTC
>> I believe InputStream.Available() will return 0 where as InputStream.read() 
                       
>> will return >0                 

This is true.                                                                     
                                                                               
                    
>> This would in turn imply BufferedInputStream.read() could return 0 even if  
                       
>> len!=0

Nope, BufferedInputStream must adhere to the InputStream contract. It says it
does too: "This method implements the general contract of the corresponding read
method of the InputStream class."

The checking of available only occures after the initial read of the underlying
stream. The first read is done w/o checking available: "If the first read on the
underlying stream returns -1 to indicate end-of-file then this method returns
-1. Otherwise this method returns the number of bytes actually read."

So, it is impossible for BufferedInputStream (or any correctly implemented
InputStream) to return zero unless len == 0.

Chris
Comment 32 Tony Chao 2003-06-13 21:44:19 UTC
ok agreed.
then that make mine and chris's patch equivalent.

we just need a committer to make the patch and close this bug.
either patch will do.
having to manually wrap each inputStream sucks.

chris has nice test classes written up.
take chris's.

but please make the patch and close this bug.
this bug has been open for too long IMO.
Comment 33 Andy Oliver 2003-06-13 23:11:50 UTC
Lets move the discussion onto the list.  I suspect most of the committers are now just 
confused.  I know I sure as heck am.
Comment 34 Andy Oliver 2003-07-24 13:38:58 UTC
okay...lets apply this.  Chris convinced me.  He seems to be pretty with it.  I
still don't understand why BufferedInputStream doesn't work...but what do I know.
Comment 35 Avik Sengupta 2003-07-31 19:45:07 UTC
Applied, but i need help with the testcase here. 

There is one file src/testcases/org/apache/poi/poifs/filesystem/SlowInputStream.java

However, its not a junit testcase, nor does it seem to have a main method.
Chris, can u help me here? How's one supposed to test it. 

All existing tescases pass, so am keeping this in for the moment. 
Comment 36 Chris Nokleberg 2003-07-31 20:11:18 UTC
It's a helper class. In addition to the bug fix, the patch modifies one of the
existing testcases to use it. I'll close the bug but let me know if you need any
more info.
Comment 37 Avik Sengupta 2003-07-31 20:13:06 UTC
Sorry, just noticed that TestDocument.java has been patched at line 107 to read
from SlowInputStream rather than ByteArrayInputStream

  -        document = new POIFSDocument("foo", new ByteArrayInputStream(array));
  +        document = new POIFSDocument("foo", new SlowInputStream(new
ByteArrayInputStream(array)));


Shouldnt we be testing both, rather replacing the existing test.. slightly
confused here :( 
Comment 38 Chris Nokleberg 2003-07-31 20:17:10 UTC
If you want to duplicate the testcase that's fine, it just seemed like overkill
to me. There's already plenty of tests that read in a file via
ByteArrayInputStream. Also, this will probably all be irrelevant in 3.0...
Comment 39 Fredo 2003-09-22 19:32:23 UTC
First: My english is bad & I'm new in Java

I'm working with Poi, great thing... but i have the same error now, and found 
this bug report. I wanted to install one of this patches but i don't know how! 
I'm using:
poi-bin-2.0-pre3-20030728
eclipse
win NT

I can't open this files to patch the .jar file. Can someone help me? sorry for 
this newbi question

greets
fred
Comment 40 Chris Nokleberg 2003-10-29 10:22:38 UTC
*** Bug 24211 has been marked as a duplicate of this bug. ***
Comment 41 Paul King 2007-07-08 05:04:15 UTC
It doesn't look like the code in RawDataBlock has been changed to cater for the
error mentioned here. If you are reading from a file system that happens to
cache data in blocks of a multiple of 512, then everything will work fine - and
the error checking code should never be called. If you read the spreadsheet from
a jar, from a socket, from a web server stream etc., then the error handling
code will prematurely throw the exception. It seems the correct way to fix this
would be just to remove the erroneous error checking code:

    public RawDataBlock(InputStream stream) throws IOException
    {
        _data = new byte[512];
        int count = IOUtils.readFully(stream, _data);
        _eof = count == -1;
    }

And let IOUtils do its job. Let mme know if you want this as a patch.
Tested against 3.0.1-FINAL.

Comment 42 Paul King 2007-07-08 05:33:03 UTC
I should have added that WebTest has been shipping with a patched POI containing
the suggested fix in my last comment since April 2006. WebTest is in widespread
use, so this should have had quite a bit of testing, but I am not sure how many
WebTest users use the Excel testing capabilities - I have personally been
involved with a few but I only have exposure to a fraction of the WebTest user
base. We hope to be able to use a standard version of POI rather than a patched
one so that we can make WebTest deployment to a maven repository simpler.
Comment 43 Andy Oliver 2007-07-08 06:50:49 UTC
Please don't reopen this and/or change the release.  Instead create a new bug,
preface it with patch and target it.  Don't tease "if you want my patch and you
love me", just attach :-).  

-Andy