Created attachment 28131 [details] The patch with my workaround I have a .doc file which is OK from the MSOffice POV java.nio.BufferUnderflowException at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) at org.apache.poi.poifs.property.NPropertyTable.buildProperties(NPropertyTable.java:93) at org.apache.poi.poifs.property.NPropertyTable.<init>(NPropertyTable.java:62) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:379) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:293) The BFFValidator returns <BFFValidation path="twenty-tips.doc" datetime="01/10/12 16:07:25" result="ERROR 0x80030109. Docfile zostal uszkodzony. " reason="The Microsoft Office Binary File Format Validator encountered an error reading the file you specified."> </BFFValidation> In English it's "Docfile has been corrupted". I came up with a workaround. In NPropertyTable.buildProperties, instead of data = new byte[bigBlockSize.getBigBlockSize()]; I would put: int dataSize = bigBlockSize.getBigBlockSize() <= bb.remaining() ? bigBlockSize.getBigBlockSize() : bb.remaining(); data = new byte[dataSize]; So get the big block size only if it's less than or equal to the number of remaining bytes. Otherwise, just get the remaining bytes. The file is obviously corrupted, yet it opens up just fine in Word and I can get fulltext and metadata with the old POIFSFileSystem. This problem popped up in my regression tests, when I switched to NPOIFSFileSystem. It seems like a safe workaround to me. For correct files, it won't change anything, for other corrupted files it will probably move the error to somewhere within PropertyFactory.convertToProperties. For my file, it's the difference between life and death. Unfortunately I can't share the file.
Are you able to give us a bit more info on the property stream that's misbehaving? I'd be interested in knowing: * How long is it, in bytes? * How many blocks is the property stream split over? * If you look at the bytes of the problem block, is it null padded?
I took a very close look in the debugger. POIFSViewer seems to work at a higher-level, where blocks are already combined into streams. I know nothing about the POI format, yet from what I understand it goes like this: NPropertyTable is constructed with an iterator on byte buffers. Each byte buffer represents a single block. In this file the blocks are 512-bytes large. The NPropertyTable constructor goes through this stack trace twice: ByteArrayBackedDataSource.read(int, long) line: 48 NPOIFSFileSystem.getBlockAt(int) line: 420 NPOIFSStream$StreamBlockByteBufferIterator.next() line: 213 NPOIFSStream$StreamBlockByteBufferIterator.next() line: 1 NPropertyTable.buildProperties(Iterator<ByteBuffer>, POIFSBigBlockSize) line: 84 The first time getBlockAt is called with 946. When I look at offset 947*512=484864 within the file it contains four: UTF-16 strings like "Root Entry", "Data", "1Table", "WordDocument". AFAIU these are names of top-level directory entries. This block is parsed correctly by PropertyFactory.convertToProperties(data, properties); Afterwards comes the second block, index 956. It also comes down to ByteArrayBackedDataSource.read(int, long) line: 48. Unfortunately the (957*512 + 512) exceeds the size of the file. The returned byte buffer is only 510 bytes large, hence the BufferUnderflowException. I don't know how many blocks should there be (there is BAT, but I don't understand it). What I know, is that this file has been truncated somewhere in the process. When the second block is parsed, with 510 bytes, the PropertyFactory.convertToProperties begins with int property_count = data.length / POIFSConstants.PROPERTY_SIZE; In my case this evaluates to 3. The last 126 bytes are not taken into account - hence no errors. The second block, when viewed in XVI shows UTF-16 strings "SummaryInformation", "DocumentSummaryInformation", and "\u0001CompObj" (the three "correct" properties). The fourth, truncated property contains only zeros: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 FF FF FF FF FF FF FF FF FF FF FF FF 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Therefore no information is lost. I think that my workaround is actually correct.
Just to check - is your file size a multiple of 512? (It's supposed to be, but based on what you're saying I think it might be 2 bytes short)
It's 490 494. 490494 div 512: 957 490494 mod 512: 126 It's 2 bytes short.
I think this should be fixed in r1229963. I've taken a slightly different approach, where we log the situation and pad the byte array with zeros (rather than passing a short byte array). Can you see if that solves it for your file, and close the bug if so?
Yup, works. Thanks a lot.