Bug 58718

Summary: Master styles not initialized after hitting an AIOOBE in an earlier ppt
Product: POI Reporter: Tim Allison <tallison>
Component: HSLFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: blocker    
Priority: P2    
Version: 3.13-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: One triggering file from govdocs1
one file that is implicated
Other two files

Description Tim Allison 2015-12-10 19:47:42 UTC
Created attachment 33337 [details]
One triggering file from govdocs1

While testing rc1 for 3.14-beta1 (running tika in multithreaded batch mode), I found that we're getting ~7000 of the following exception:

org.apache.poi.hslf.exceptions.HSLFException: Master styles not initialized
	at org.apache.poi.hslf.usermodel.HSLFSlideMaster.setSlideShow(HSLFSlideMaster.java:144)
	at org.apache.poi.hslf.usermodel.HSLFSlideShow.buildSlidesAndNotes(HSLFSlideShow.java:362)
	at org.apache.poi.hslf.usermodel.HSLFSlideShow.<init>(HSLFSlideShow.java:152)
	at org.apache.poi.hslf.usermodel.HSLFSlideShow.<init>(HSLFSlideShow.java:185)
	at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)

We were getting roughly this many with 3.13 (it turns out).

I'm not able to reproduce this with the single-threaded Tika-app, and I'm not able to reproduce this with JUnit running multiple threads again and again on a handful of triggering files. 

We did not have these exceptions in Tika 1.9 (POI 3.12).  I haven't tested more recent intermediate versions yet.
Comment 1 Dominik Stadler 2015-12-11 09:41:49 UTC
Are the Tike-tests that reproduce this available somewhere? If it is possible to consistently reproduce the error somehow it could be a good match for a git-bisect run to determine the exact commit that introduced this behavior...
Comment 2 Tim Allison 2015-12-11 20:39:42 UTC
Y, first step was to find minimal failing set.

I think I found that this is not a multi-threading issue, but it is an issue of file order.

This code fails in a standalone project with POI 3.13 Note that it is pure POI.

My current POI trunk isn't building so I can't go pure, pure POI. 


    @Test
    public void testSequential() throws Exception {
        File dir = new File("C:\\data\\badppts");
        String[] fileNames = new String[]{
                "008495.ppt",
                "008524.ppt",
                "008558.ppt"
        };
        for (String fName : fileNames) {
            InputStream is = null;
            try {
                is = new FileInputStream(new File(dir, fName));
                PowerPointExtractor ex = new PowerPointExtractor(is);
                ex.getText();
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                if (is != null) {
                    try {
                        is.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
            }
        }
    }
Comment 3 Tim Allison 2015-12-11 20:42:09 UTC
Created attachment 33340 [details]
one file that is implicated
Comment 4 Tim Allison 2015-12-11 20:42:23 UTC
Created attachment 33341 [details]
Other two files
Comment 5 Tim Allison 2015-12-11 20:52:09 UTC
No exception in 3.13-beta or before.  Exception starts in 3.13.

Apologies for the mess of attached files (couldn't zip 3 test files together and fit onto bugzilla).

Also apologies for the non unit test...wanted to get this finding out asap.
Comment 6 Tim Allison 2015-12-11 21:20:57 UTC
Looks like we can go even more minimal, just use these two files:
                "008495.ppt",
                "008558.ppt"


The first one causes: 
java.lang.ArrayIndexOutOfBoundsException: 110
	at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:161)
	at org.apache.poi.hslf.record.TxMasterStyleAtom.init(TxMasterStyleAtom.java:157)
	at org.apache.poi.hslf.record.TxMasterStyleAtom.<init>(TxMasterStyleAtom.java:70)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
	at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:181)
	at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:128)
	at org.apache.poi.hslf.record.MainMaster.<init>(MainMaster.java:64)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
	at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:181)
	at org.apache.poi.hslf.record.Record.buildRecordAtOffset(Record.java:103)
	at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:294)
	at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:275)
	at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.<init>(HSLFSlideShowImpl.java:179)
	at org.apache.poi.hslf.extractor.PowerPointExtractor.<init>(PowerPointExtractor.java:117)
	at org.apache.poi.hslf.extractor.PowerPointExtractor.<init>(PowerPointExtractor.java:98)
	at org.apache.poi.hslf.extractor.PowerPointExtractor.<init>(PowerPointExtractor.java

And this exception leaves a residue which causes the "Master styles not initialized" exception in 008558.ppt.  However, if you just parse 008558.ppt by itself, no exception.
Comment 7 Dominik Stadler 2015-12-11 23:25:04 UTC
git-bisect came up with the following commit that changed this: r1717351

    - #47904 - Update text styles in HSLF MasterSlide
    - common sl unification for TextParagraph.setTextAlign
Comment 8 Andreas Beeker 2015-12-13 01:55:41 UTC
Fixed in r1719758

This was a cloning error in TabStopPropCollection.
Whenever a tab stop collection was not empty, the tab stops were also added to 
the global property template
and copied over to next the parsed properties.
Comment 9 Tim Allison 2015-12-14 18:43:51 UTC
Thank you, Andi and Dominik!

I was wrong about the affected versions...Sorry.  This does not appear to affect 3.13-final.  It did start with 3.14-beta1 rc1, which is consistent with Dominik's finding.  This also explains why I didn't find it while we were prepping for the release of Tika 1.11 -- the problem didn't exist then.