Bug 41933 - PicturesTable#getAlllPictures() sometimes loses images
Summary: PicturesTable#getAlllPictures() sometimes loses images
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.0-dev
Hardware: Other other
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-03-22 17:31 UTC by Trejkaz (pen name)
Modified: 2009-10-05 18:08 UTC (History)
0 users



Attachments
hwpf-embedded-image-in-header.doc (40.00 KB, application/octet-stream)
2009-04-07 18:56 UTC, Trejkaz (pen name)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Trejkaz (pen name) 2007-03-22 17:31:02 UTC
A constructed document containing four embedded images returns only two when
calling PicturesTable#getAllPictures().

On debugging, it turns out that the pos offset becomes incorrect at the third
image -- the block type at that position is not recognised as an image type, and
the skipOn value is too big to be realistic.
Comment 1 Trejkaz (pen name) 2007-03-22 17:35:23 UTC
Tried to attach the sample document but the bug tracker won't let me because
it's too big.

Here's a link:
http://trypticon.org/files/four-images.doc
Comment 2 Trejkaz (pen name) 2007-03-22 17:55:17 UTC
Here's a bit of analysis of the data stream:

Start of first block (an image):

00000000 E6 0D 07 00 44 00 64 00 00 00 00 00 00 00 08 00 ....D.d.........

Start of second block (an image):

0 + 70DE6 = 70DE6
                 `----------.
                            V
00070DE0 4E 44 AE 42 60 82 CB F3 07 00 44 00 64 00 00 00 ND.B`.....D.d...

Start of third block, which seems to be wrong:

70DE6 + 7F3CB = F01B1
             ,----'
            v
000F01B0 82 9A 00 16 24 01 17 24 01 49 66 01 00 00 00 01 ....$..$.If.....
000F01C0 96 6C 00 21 76 00 02 68 01 35 D6 05 00 01 03 A1 .l.!v..h.5......
000F01D0 11 35 D6 05 01 02 03 0D 12 23 76 00 01 A1 11 23 .5.......#v....#
000F01E0 76 01 02 0D 12 3A 56 0B 00 02 96 6C 00 07 94 CE v....:V....l....
000F01F0 0D 0A 74 00 00 A0 04 13 D6 30 00 00 00 00 04 01 ..t......0......
000F0200 00 00 00 00 00 00 04 01 00 00 00 00 00 00 04 01 ................
000F0210 00 00 00 00 00 00 04 01 00 00 00 00 00 00 04 01 ................
000F0220 00 00 00 00 00 00 04 01 00 00 14 F6 01 00 00 15 ................
000F0230 36 01 35 D6 05 00 01 03 A1 11 35 D6 05 01 02 03 6.5.......5.....

           ACTUAL IMAGE BLOCK HEADER IS HERE----.
                                                v
000F0240 0D 12 61 F6 03 6C 00 79 74 40 0F 7D 00 D1 E0 01 ..a..l.yt@.}....
000F0250 00 44 00 64 00 00 00 00 00 00 00 08 00 00 00 00 .D.d............
Comment 3 Trejkaz (pen name) 2007-04-18 18:43:20 UTC
Okay, the short answer is that the pictures table alone isn't enough.  The code
will need to read every CharacterRun, pull out the picture offsets, and then
remove duplicates (and probably sort the list too, to keep it in the same order
so that nothing which was already calling getAllPictures() breaks.)
Comment 4 Trejkaz (pen name) 2007-04-18 19:44:57 UTC
Here's a workaround for anyone else who needs to get all pictures out.  I'm not
sure if this is actually the best way to implement getAllPictures() because it
requires access to the whole document.

  HWPFDocument document = ...

  SortedMap<Integer, CharacterRun> runs = new TreeMap<Integer, CharacterRun>();

  PicturesTable picturesTable = document.getPicturesTable();
  Range wholeDocument = document.getRange();
  int runCount = wholeDocument.numCharacterRuns();
  for (int i = 0; i < runCount; i++) {
    CharacterRun run = wholeDocument.getCharacterRun(i);
    if (picturesTable.hasPicture(run)) {
      int picOffset = run.getPicOffset();
      if (runs.containsKey(picOffset)) {
        continue;
      }

      runs.put(picOffset, run);
    }
  }

  allPictures = new ArrayList<Picture>();
  for (CharacterRun run : runs.values()) {
    allPictures.add(picturesTable.extractPicture(run, false));
  }
Comment 5 Nick Burch 2007-12-04 04:23:38 UTC
In the absence of any other fixes, shall we go ahead and commit this?
Comment 6 Squeeself 2008-04-03 14:35:03 UTC
After a lot of digging into why some of my word documents are getting all their pictures, I've determined that the above changes are necessary (but still, for some reason, don't get all pictures...still investigating why). The data stream that the PictureTable gets the picture data from also can contain other object data, which is often placed at the beginning of the datastream. The only way to tell where picture data begins is to iterate through the CharacterRuns of the document. Otherwise, the current code takes one look at some arbitrary object data and skips everything.

Unfortunately, there's still a problem in some of the documents I've been looking at. For instance, I have a document with several dozen pictures...yet only 1 CharacterRun with a picoffset is in the document. I've been trying to determine what additional way these pictures are accessed, but haven't been able to find it out yet. Still, the above fix really needs to be put into the code.
Comment 7 Trejkaz (pen name) 2009-03-11 22:08:55 UTC
There is no fix above, only a workaround which might be useful from the outside.

Incidentally I have just run across a document which has one picture in the PicturesTable but which is seemingly not referenced from any CharacterRun -- despite being visible when opening the document in Word.  This may be another way to hide a pic in a character run so I'm still trying to figure it out, but it does indicate that the workaround I posted won't necessarily catch all images.  Maybe the trick is to use getAllPictures() plus my workaround, and somehow remove duplicates between the two.
Comment 8 Trejkaz (pen name) 2009-03-11 22:20:00 UTC
Looking at the current trunk version (and the slightly older version we're running) it seems that half of my workaround (rewritten) ended up becoming the current version of getAllPictures().

The only inconsistency is that the rewrite in the POI version results in it returning the same image multiple times (as it's no longer building the map to keep track of this.)

I'm not sure whether I should (a) modify POI so that it removes the duplicates and submit that up, or (b) add some method to Picture so that I can de-duplicate myself from the outside, and then submit that smaller modification up.

Obviously I'm leaning toward (a) but since it seems like the duplicate removal was intentionally removed from the code, perhaps (b) is the better option?
Comment 9 Trejkaz (pen name) 2009-03-22 15:57:24 UTC
Setting to NEW in case the NEEDINFO state was responsible for the lack of responses.




BTW, I have noticed something minor, which is that "sometimes" it will return images of length 4, which can be safely ignored.  This appears to occur when you insert a HR or some other picture which ends up being stored as Escher, although I haven't yet managed to prove this.

This might account for what some people have seen (including one who mailed me directly) where a document with only one pic in it returns more than one from this method.
Comment 10 Trejkaz (pen name) 2009-04-07 18:56:55 UTC
Created attachment 23456 [details]
hwpf-embedded-image-in-header.doc

Here's a file where HWPF still appears to not find the embedded image.  I culled all the text from it and re-saved it using Word 2003, and it still causes the same problem.
Comment 11 Trejkaz (pen name) 2009-04-07 18:58:03 UTC
Test case:

    @Test
    public void testLostEmbeddedImage() throws Exception
    {
        File file = new File("hwpf-embedded-image-in-header.doc");
        HWPFDocument doc = new HWPFDocument(new POIFSFileSystem(new FileInputStream(file)));

        List<?> pics = doc.getPicturesTable().getAllPictures();
        assertEquals("Should be 1 picture", 1, pics.size());

        // In case POI happens to have a utility like this:
        //Picture pic = (Picture) pics.get(0);
        //assertDigestEquals("Wrong content", "0f36cd9f60222d978b205d891008f4a8", pic.getContent());
    }
Comment 12 Sławomir Mycka 2009-04-09 06:44:48 UTC
I've added new bug in getPictureTable() maybe somebody could look at it by the way:
http://issues.apache.org/bugzilla/show_bug.cgi?id=46981
Comment 13 Trejkaz (pen name) 2009-10-05 18:08:19 UTC
These now work correctly as of POI 3.5 FINAL:
  hwpf-embedded-image-in-header
  four-images.doc

I will mark the bug RESOLVED FIXED for this reason.  I have a new file which fails in a similar way but I suspect it's actually an unrelated issue so I will raise a new bug for it.