Summary: | [PATCH] [RFE] POIFS, RawDataBlock: Missing workaround for low performance InputStreams | ||
---|---|---|---|
Product: | POI | Reporter: | Jens Gerhard <gerhajns> |
Component: | POIFS | Assignee: | POI Developers List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | blocker | CC: | paulk, shimonp, sys, volker.tirjan |
Priority: | P3 | ||
Version: | 2.0-dev | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: |
There is the BlockingInputStream wrapper.
tar.bz2 with patch and RawDataBlock.java Patch block read functions to loop until correct amount is read, or EOF Thanks :-) Here's a new patch with a testcase. |
Description
Jens Gerhard
2002-10-10 07:41:15 UTC
Changed version to 2.0-dev as this is a feature. Also if you're actually using 1.5....upgrade to 1.5.1. Glad you're doing this for Lucene. I kept meaning to but haven't gotten around to it. Okay, so why can't you use BufferedInputStream? If that doesn't work create a "BlockingInputStream" which takes an InputStream and a timeout as input. I'll commit it to util. This is our the complete "BlockingInputStream", because BufferedInputStream doesn't work with our ExcelReader: public class SafePOIStream extends InputStream { protected InputStream is; public SafePOIStream(InputStream is) { this.is = is; } public int available() throws IOException { return is.available(); } public void close() throws IOException { is.close(); } public void mark(int readLimit) { is.mark(readLimit); } public boolean markSupported() { return is.markSupported(); } public int read() throws IOException { return is.read(); } public int read(byte[] bf) throws IOException { // We had to revert to byte per byte reading to keep // with slow network connections on one hand, without // missing the end-of-file. // THIS IS A BLOCKING BLOCK READ!!! int i = 0; int b = 4611; while ( i < bf.length ) { b = is.read(); if ( b == -1 ) break; bf[i++] = (byte) b; } if ( i == 0 && b == -1 ) return -1; return i; } public int read(byte[] bf, int s, int l) throws IOException { return is.read(bf, s, l); } public void reset() throws IOException { is.reset(); } public long skip(long n) throws IOException { return is.skip(n); } } please attach it as a file (click create attachment) it is increadibly difficult to get files out via cut/paste from here without them being total garbage. Please also attach a unit test for this. I'd prefer it be called BlockingInputStream and be attached in a directory preserving archive relative to jakarta-poi. the package should be org.apache.poi.util. Created attachment 3422 [details]
There is the BlockingInputStream wrapper.
*** Bug 18117 has been marked as a duplicate of this bug. *** added BlockingInputStream to poi/util. Created attachment 5412 [details]
tar.bz2 with patch and RawDataBlock.java
I disagree with the above solution. BufferedInputStream does not work because it is also not guaranteed to return 512 bytes on each read (what RawDataBlock excepts). The above solution remedies the situation by providing the user with an alternate InputStream that that guarantees it is able to return the properly sized block of data. The BlockingInputStream supplied, does not time out. I feel the real problem is that RawDataBlock should be more tolerant in it's use of InputStreams. RawDataBlock should repeat reads until the desired chuck of data base been read. If too many failed reads, then RawDataBlock should throw error as it does now. I have attached above my proposed solution. Patch file and RawDataBlock.java Hi, I'm a collegue of Jens. Just wanted to confirm Tonies comment: The SafePOIStream is our work arround, which is enough to make POI work in our project. A fix to RawDataBlock and whatever else was simply beyond our work scope. For reasons I haven't figures I cannot download the attachment, is there a different way to obtain it? BTW the BlockingInputStream has proven to help also in other cases of slow input, so it might still worth to have arround. I am not convinced from a "seperation of concerns" point of view that this functionality belongs to RawDataBlock. I would much rather have timouts etc outside of that class, hence my liking for a BlockingInputStream. Better to have timeouts added there. I would go so far as to suggest that if this is an important issue, we should change the signature of POIfs to accept a blocking input stream only. Tho this is an extreme suggestion, i find it cleaner than having RawDataBlock retry stuff. Thoughts, folks? "A fix to RawDataBlock and whatever else was simply beyond our work scope" - you'll go far in opensource with this attitude ;-) -- EVERYTHING we do is beyond our work scope. Thought: Hope BlockingInputStream times out. I'd rather see the changes to RawDataBlock than require someone to wrap every input stream. If it should be required then it should be an alteration of RawDataBlock... Any other opinion on this? I'm still only half convinced, would like to hear opinion of others. The decision where to handle the timeout depends on how you look at RawDataBlock. If you really see RawDataBlock as a low level communication layer it should care itself, if not something else should handle the problem. Separation of concerns is an argument, since in a high speed environment one might want not use the BlockingInputStream for efficency reasons whereas over net-sockets it seems to be necessary all the time. One could think of tuning parameters (like timeouts) which could be made configurable on the actual speed of a connection. On the other hand the guarantee to deliver 512 bytes is entirely incompatible with the (documented) contract of jdk InputStream since this makes no guarantees how many bytes are returned. In that respect RawDataBlock is based on a wrong assumption of API and should handle the problem. See, I haven't made up my mind myself yet, just 2 Euro Cents. I actually just wrote about this very topic in my blog: http://sixlegs.com/blog/java/inputstreams-suck.html In my testing I discovered this bug in POI, which I use quite a bit. I think it is a (severe) bug in RawDataBlock. The contract for InputStream clearly states that the call may return less than the number of requested bytes. Any method that needs a certain number of bytes should therefore loop until it has the total amount. I don't think performance considerations have any validity here, because: a) read(byte[], int, int) blocks anyway, since it *is* guaranteed to return at least one byte or -1 for end of file. b) You're not going to be doing anything useful until you finish loading the data. If you are, you should be loading in a separate thread anyway (which is also where you could handle timeouts). Chris I've seen lots of comments regarding this bug but no decision yet. I still believe fixing RawDataBlock is the correct thing to do. But perhaps we could put this to rest and get a vote or an executive decision by someone? propose a vote on the poi-dev list.. We'll vote on it (preface with [VOTE])... only committer votes are binding but there is no rule about calling a vote AFAIK Created attachment 5879 [details]
Patch block read functions to loop until correct amount is read, or EOF
The above patch changes DocumentBlock, HeaderBlockReader, and RawDataBlock. Instead of directly calling in.read(data), they now call readFully in a new IOUtils class. readFully simply loops, trying to read the full block of data (i.e. 512 bytes). I've tested the resulting jar and it solves my problems with slow input streams. I tried to create a testcase but I'm having trouble with the ant build. "ant test" complains of unknown junit2 task, and "./build.bat" says it is missing centipede-user-input. run: build test Created attachment 5883 [details]
Thanks :-) Here's a new patch with a testcase.
*** Bug 15807 has been marked as a duplicate of this bug. *** why is this bug still open? there's been lots of suggestions and at least 2 working patches could someone with committer level access please select either mine or chris's patch and put this bug to rest? it really cant take more than 10 mins to do. Explain to me why you can't just pass in a BufferedInputStream (which is blocking)? BufferedInputStream is not blocking. From the javadoc: "This iterated read continues until one of the following conditions becomes true: * The specified number of bytes have been read, * The read method of the underlying stream returns -1, indicating end-of-file * The available method of the underlying stream returns zero, indicating that further input requests would block." Furthermore, even if this did work, it should not be up to the user to supply a BufferedInputStream. The POI methods are documented as taking an InputStream, but fail when given a valid (albeit "slow") InputStream. The fix is simple--just loop until you have the data you need (i.e. my patch). are you asking about BufferedInputStream or BlockingInputStream? BufferedInputStream is not guaranteed to return enough bytes to fill your buffer so it will not return 512 bytes if you're using tcp/ip as input stream. I've tried this, it doesnt work. BlockingInputStream as proposed by Jens does not time out, and you yourself (Andrew) agreed that it is undesirable for each user to have to wrap their input streams manually for RawDataBlock to work correctly. If that's required, then the changes should be made to RawDataBlock Chris's proposal actually does not time out either, but it at least encapsulates the problem so that the end user will not have to wrap every input stream. Time out is especially important for tcp/ip streams that may be lost. If you look at my patch (it's above as a tar.bz2 attachment and also inlined below), my solution attempts to read until the desired 512 bytes is available, but if it finds zero data on over 100 consecutive reads, it gives up and returns so that the application will not be stalled. inlined patch for my fix: Index: RawDataBlock.java =================================================================== RCS file: /home/cvspublic/jakarta- poi/src/java/org/apache/poi/poifs/storage/RawDataBlock.java,v retrieving revision 1.2 diff -u -r1.2 RawDataBlock.java --- RawDataBlock.java 15 Mar 2002 02:47:56 -0000 1.2 +++ RawDataBlock.java 18 Mar 2003 19:16:04 -0000 @@ -62,18 +62,39 @@ /** * A big block created from an InputStream, holding the raw data * - * @author Marc Johnson (mjohnson at apache dot org + * @author Marc Johnson (mjohnson at apache dot org) + * @author Tony Chao (sys at yahoo dot com) */ public class RawDataBlock implements ListManagedBlock { + /** + * Number of consecutive failed reads before giving up. + * (currently set to 5) + */ + public static int RETRY_COUNT = 5; + + /** + * Number of milliseconds to sleep before retrying a failed read. + * (currently set to 100) + */ + public static long RETRY_WAIT = 100; + private byte[] _data; private boolean _eof; /** - * Constructor RawDataBlock - * + * Constructor RawDataBlock. + * + * Reads up to POIFSConstants.BIG_BLOCK_SIZE # of bytes and creates + * RawDataBlock from data read. If inputstream is not ready for read + * (e.g. read() returns 0), sleep briefly (RETRY_WAIT ms), and retry + * up to RETRY_COUNT consecutive times. This permits using various + * types of inputstreams (ServletInputStream, ByteArrayInputStream, etc) + * as input, but still exits gracefully (as opposed to waiting forever) + * if inputstream is unavailable. + * * @param stream the InputStream from which the data will be read * * @exception IOException on I/O errors, and if an insufficient @@ -84,8 +105,23 @@ throws IOException { _data = new byte[ POIFSConstants.BIG_BLOCK_SIZE ]; - int count = stream.read(_data); + int read = stream.read(_data); + int count = read; + int failcount = 0; + while (read >= 0 && + count < POIFSConstants.BIG_BLOCK_SIZE && + failcount < RETRY_COUNT) + { + read = stream.read(_data, count, _data.length-count); + count = (read >= 0) ? (count+read) : -1; + failcount = (read > 0) ? 0 : (failcount+1); + if (read==0) { + try { + Thread.sleep(RETRY_WAIT); + } catch (InterruptedException e) {} + } + } if (count == -1) { _eof = true; oops typo i meant that my patch re-reads up to 5 connsecutive reads of zero but waits 100 milliseconds in between each read Tony, I don't think your patch is correct. It assumes that InputStream.read can return zero, and bases the retrying/timeout on that fact. InputStream.read will actually only ever return zero of that is the number of bytes requested. Otherwise it will return -1 (EOF), a value greater than zero (# bytes read), or throw an exception. I believe that in the case of an HTTP timeout your stream will return -1, or at least throw an exception, so my patch should work for you (maybe you can try it). Chris: you're correct in that according to the javadocs, InputStream.read() says: "If len is zero, then no bytes are read and 0 is returned; otherwise, there is an attempt to read at least one byte. If no byte is available because the stream is at end of file, the value -1 is returned; otherwise, at least one byte is read and stored into b." however, the javadoc from BufferedInputStream.read() says: "This iterated read continues until one of the following conditions becomes true: * The specified number of bytes have been read, * The read method of the underlying stream returns -1, indicating end-of- file * The available method of the underlying stream returns zero, indicating that further input requests would block." I believe InputStream.Available() will return 0 where as InputStream.read() will return >0 This would in turn imply BufferedInputStream.read() could return 0 even if len!=0 Perhaps you're right and the timeout loop is unncessary, but even so, my code should still work at least as well as yours. also..to clarify my code does NOT throw an exception if the http stream times out. my code will simply give up on trying to read more and return what it was able to read so far. >> I believe InputStream.Available() will return 0 where as InputStream.read() >> will return >0 This is true. >> This would in turn imply BufferedInputStream.read() could return 0 even if >> len!=0 Nope, BufferedInputStream must adhere to the InputStream contract. It says it does too: "This method implements the general contract of the corresponding read method of the InputStream class." The checking of available only occures after the initial read of the underlying stream. The first read is done w/o checking available: "If the first read on the underlying stream returns -1 to indicate end-of-file then this method returns -1. Otherwise this method returns the number of bytes actually read." So, it is impossible for BufferedInputStream (or any correctly implemented InputStream) to return zero unless len == 0. Chris ok agreed. then that make mine and chris's patch equivalent. we just need a committer to make the patch and close this bug. either patch will do. having to manually wrap each inputStream sucks. chris has nice test classes written up. take chris's. but please make the patch and close this bug. this bug has been open for too long IMO. Lets move the discussion onto the list. I suspect most of the committers are now just confused. I know I sure as heck am. okay...lets apply this. Chris convinced me. He seems to be pretty with it. I still don't understand why BufferedInputStream doesn't work...but what do I know. Applied, but i need help with the testcase here. There is one file src/testcases/org/apache/poi/poifs/filesystem/SlowInputStream.java However, its not a junit testcase, nor does it seem to have a main method. Chris, can u help me here? How's one supposed to test it. All existing tescases pass, so am keeping this in for the moment. It's a helper class. In addition to the bug fix, the patch modifies one of the existing testcases to use it. I'll close the bug but let me know if you need any more info. Sorry, just noticed that TestDocument.java has been patched at line 107 to read from SlowInputStream rather than ByteArrayInputStream - document = new POIFSDocument("foo", new ByteArrayInputStream(array)); + document = new POIFSDocument("foo", new SlowInputStream(new ByteArrayInputStream(array))); Shouldnt we be testing both, rather replacing the existing test.. slightly confused here :( If you want to duplicate the testcase that's fine, it just seemed like overkill to me. There's already plenty of tests that read in a file via ByteArrayInputStream. Also, this will probably all be irrelevant in 3.0... First: My english is bad & I'm new in Java I'm working with Poi, great thing... but i have the same error now, and found this bug report. I wanted to install one of this patches but i don't know how! I'm using: poi-bin-2.0-pre3-20030728 eclipse win NT I can't open this files to patch the .jar file. Can someone help me? sorry for this newbi question greets fred *** Bug 24211 has been marked as a duplicate of this bug. *** It doesn't look like the code in RawDataBlock has been changed to cater for the error mentioned here. If you are reading from a file system that happens to cache data in blocks of a multiple of 512, then everything will work fine - and the error checking code should never be called. If you read the spreadsheet from a jar, from a socket, from a web server stream etc., then the error handling code will prematurely throw the exception. It seems the correct way to fix this would be just to remove the erroneous error checking code: public RawDataBlock(InputStream stream) throws IOException { _data = new byte[512]; int count = IOUtils.readFully(stream, _data); _eof = count == -1; } And let IOUtils do its job. Let mme know if you want this as a patch. Tested against 3.0.1-FINAL. I should have added that WebTest has been shipping with a patched POI containing the suggested fix in my last comment since April 2006. WebTest is in widespread use, so this should have had quite a bit of testing, but I am not sure how many WebTest users use the Excel testing capabilities - I have personally been involved with a few but I only have exposure to a fraction of the WebTest user base. We hope to be able to use a standard version of POI rather than a patched one so that we can make WebTest deployment to a maven repository simpler. Please don't reopen this and/or change the release. Instead create a new bug, preface it with patch and target it. Don't tease "if you want my patch and you love me", just attach :-). -Andy |