Summary: | Master styles not initialized after hitting an AIOOBE in an earlier ppt | ||
---|---|---|---|
Product: | POI | Reporter: | Tim Allison <tallison> |
Component: | HSLF | Assignee: | POI Developers List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | blocker | ||
Priority: | P2 | ||
Version: | 3.13-FINAL | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: |
One triggering file from govdocs1
one file that is implicated Other two files |
Are the Tike-tests that reproduce this available somewhere? If it is possible to consistently reproduce the error somehow it could be a good match for a git-bisect run to determine the exact commit that introduced this behavior... Y, first step was to find minimal failing set. I think I found that this is not a multi-threading issue, but it is an issue of file order. This code fails in a standalone project with POI 3.13 Note that it is pure POI. My current POI trunk isn't building so I can't go pure, pure POI. @Test public void testSequential() throws Exception { File dir = new File("C:\\data\\badppts"); String[] fileNames = new String[]{ "008495.ppt", "008524.ppt", "008558.ppt" }; for (String fName : fileNames) { InputStream is = null; try { is = new FileInputStream(new File(dir, fName)); PowerPointExtractor ex = new PowerPointExtractor(is); ex.getText(); } catch (Exception e) { e.printStackTrace(); } finally { if (is != null) { try { is.close(); } catch (IOException e) { e.printStackTrace(); } } } } } Created attachment 33340 [details]
one file that is implicated
Created attachment 33341 [details]
Other two files
No exception in 3.13-beta or before. Exception starts in 3.13. Apologies for the mess of attached files (couldn't zip 3 test files together and fit onto bugzilla). Also apologies for the non unit test...wanted to get this finding out asap. Looks like we can go even more minimal, just use these two files: "008495.ppt", "008558.ppt" The first one causes: java.lang.ArrayIndexOutOfBoundsException: 110 at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:161) at org.apache.poi.hslf.record.TxMasterStyleAtom.init(TxMasterStyleAtom.java:157) at org.apache.poi.hslf.record.TxMasterStyleAtom.<init>(TxMasterStyleAtom.java:70) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:181) at org.apache.poi.hslf.record.Record.findChildRecords(Record.java:128) at org.apache.poi.hslf.record.MainMaster.<init>(MainMaster.java:64) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.hslf.record.Record.createRecordForType(Record.java:181) at org.apache.poi.hslf.record.Record.buildRecordAtOffset(Record.java:103) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.read(HSLFSlideShowImpl.java:294) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.buildRecords(HSLFSlideShowImpl.java:275) at org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.<init>(HSLFSlideShowImpl.java:179) at org.apache.poi.hslf.extractor.PowerPointExtractor.<init>(PowerPointExtractor.java:117) at org.apache.poi.hslf.extractor.PowerPointExtractor.<init>(PowerPointExtractor.java:98) at org.apache.poi.hslf.extractor.PowerPointExtractor.<init>(PowerPointExtractor.java And this exception leaves a residue which causes the "Master styles not initialized" exception in 008558.ppt. However, if you just parse 008558.ppt by itself, no exception. git-bisect came up with the following commit that changed this: r1717351 - #47904 - Update text styles in HSLF MasterSlide - common sl unification for TextParagraph.setTextAlign Fixed in r1719758 This was a cloning error in TabStopPropCollection. Whenever a tab stop collection was not empty, the tab stops were also added to the global property template and copied over to next the parsed properties. Thank you, Andi and Dominik! I was wrong about the affected versions...Sorry. This does not appear to affect 3.13-final. It did start with 3.14-beta1 rc1, which is consistent with Dominik's finding. This also explains why I didn't find it while we were prepping for the release of Tika 1.11 -- the problem didn't exist then. |
Created attachment 33337 [details] One triggering file from govdocs1 While testing rc1 for 3.14-beta1 (running tika in multithreaded batch mode), I found that we're getting ~7000 of the following exception: org.apache.poi.hslf.exceptions.HSLFException: Master styles not initialized at org.apache.poi.hslf.usermodel.HSLFSlideMaster.setSlideShow(HSLFSlideMaster.java:144) at org.apache.poi.hslf.usermodel.HSLFSlideShow.buildSlidesAndNotes(HSLFSlideShow.java:362) at org.apache.poi.hslf.usermodel.HSLFSlideShow.<init>(HSLFSlideShow.java:152) at org.apache.poi.hslf.usermodel.HSLFSlideShow.<init>(HSLFSlideShow.java:185) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) We were getting roughly this many with 3.13 (it turns out). I'm not able to reproduce this with the single-threaded Tika-app, and I'm not able to reproduce this with JUnit running multiple threads again and again on a handful of triggering files. We did not have these exceptions in Tika 1.9 (POI 3.12). I haven't tested more recent intermediate versions yet.