Some of our own test cases have been failing since an update from POI 3.1 to 3.2. The embedded images now come out incorrectly or are not found at all.
Test file 1 (too big to attach. :-( ) http://dl.getdropbox.com/u/50201/nonconsecutive-images.doc This one now gets 3 images instead of 4. Correct MD5 digests for each image (first three do match up so the images it does pick up must be okay.) 851be142bce6d01848e730cb6903f39e 7fc6d8fb58b09ababd036d10a0e8c039 a7dc644c40bc2fbf17b2b62d07f99248 72d07b8db5fad7099d90bc4c304b4666 <-- this is the missing one
I wonder if this is another area of hwpf that is assuming bytes/characters but getting characters/bytes. (I tried to make it a bit more sane for 3.2, so that unicode text extraction worked more reliably, but the file format it really crazy about this sort of thing) One test that'd be interesting is creating a few small test files with images, and seeing how hwpf copes with them: * non unicode, one image near start * non unicode, one image near end * non unicode, image near start, image near end * unicode, one image near start * unicode, one image near end * unicode, image near start, image near end If 5 and 6 have issues with their later images, we'll know it's another byte/character problem. If it's something different, maybe it'll help us track down. As a bonus, the files will make a good regression test for the future :)
Any suggestions on how to go about this? I tried doing what was suggested on the list, and incrementally adding images to existing documents, but even for "broken" ones, Word saved them in a form which fixed any warnings.
The fact that re-saving them in word fixes the warning does tend to make me think these files really are slightly dodgy, and it isn't just us First up, could you try older versions of word, to see if maybe there was one earlier version which produced these dodgy files? Otherwise, if you could find a couple of different files which do trigger the warning, but not all have image issues, that'd be a big help. Especially if we could have two versions of each, the original (with warning) and the newer word re-saved version (without warnings). We can then compare them, see the differences, and hopefully figure out what needs fixing/working around
Same file as before, with one more image added in and the number edited from 4 to 5 in the document... 851be142bce6d01848e730cb6903f39e 7fc6d8fb58b09ababd036d10a0e8c039 a7dc644c40bc2fbf17b2b62d07f99248 5eee0af68b7856b731a7775db8a6e6e2 72d07b8db5fad7099d90bc4c304b4666 <-- missing (same image as before) http://dl.getdropbox.com/u/50201/nonconsecutive-images-2.doc Warnings: A property claimed to start before zero, at -512! Resetting it to zero, and hoping for the best A property claimed to start before zero, at -512! Resetting it to zero, and hoping for the best
Actually for comparison, this is the same warnings I'm getting on the original file in this case, i.e. the Unicode one isn't appearing for this test case, only for the other ones (which I can't redistribute, because someone checked in IP as test cases.) :-(
Images are properly read with current trunk. I included your sample in our collection of test documents and added a unit test. Yegor