I have come across some files generated by scientific instruments whose big-block size is not 512, but rather 4096. The power-of-two (12) is properly stored in the header, but POIFS ignores that entirely, resorting to a built-in constant. I'd post some files, but they average 260MB each. I can help develop/test this, but I'll probably need some guidance first.
Please write the dev list. I'm unsure of this one because IIRC XLS files have a default block size of 4096 (for smaller files)... It could be we said "heck with it" if it liked the smaller one too.
Marc confirmed this. . He needs a file though. Any way to generate a smaller file? if not then if you have ssh (and scp in particular), email me and I can give you a spot to upload it to. (please tar/bz2 or tar/gz it first :-) )
Created attachment 18875 [details] Patch for current SVN tree; checks the first few bytes of a POIFS file to see if the big-block size is 512 or 4096.
The patch doesn't look to be threadsafe to me If we had two files open, one with a 512 blocksize, and another with a 4096 blocksize, then I think it'd fail, as it's all using a single static int on POIFSFileSystem I think before we could apply this, we would need a sample file with the alternate block size (so we can write a unit test for all this), and the patch would need to be slightly re-worked to be threadsafe (i.e. not use a static for something that can vary between concurrantly open files).
> I think before we could apply this, we would need a sample file with the alternate block size Maybe attached file could help: WordExtractor fails to decode it with: java.io.IOException: Unable to read entire block; 122 bytes read before EOF; expected 512 bytes
Created attachment 21663 [details] docfile with incorrect(?) block size
Word files should always be 512 byte blocks, so I think attachment 21663 [details] isn't quite appropriate for this bug - it's probably just a truncated file
but it's being opened OK in word 2003 :)
There has been partial, thread safe support for this in svn for a while now However, without a file with a 4096 block size, we can't test that this works properly If you do have a file with 4096 blocks, please do re-open the bug and upload it, then we can write a unit test for it. Alas all the files we can find (word, powerpoint, excel, visio etc) are all 512 byte blocks.
This has now been properly solved, along with sample files for unit tests, see bug #49139