In TIKA-712 and TIKA-1171, we tried to ignore placeholder text from master slides. Three of our unit tests for this fail after upgrading to 4.0.0-SNAPSHOT. slide.isPlaceHolder() is now returning false for some shapes for which it used to return true. In the few cases I've looked at, it appears to only be a regression for date and slidenumber placeholders. With 4.0.0 and the sl framework this code: @Test public void masterBoilerplateTest() throws Exception { try (InputStream is = getResourceAsStream("/test-documents/testPPT_masterText.ppt")) { SlideShow slideShow = SlideShowFactory.create(is); for (Object masterSheetObj : slideShow.getSlideMasters()) { MasterSheet ms = (MasterSheet)masterSheetObj; for (Object shape : ms.getShapes()) { if (shape instanceof TextShape) { TextShape ts = (TextShape)shape; System.out.println(ts.getShapeName() + " : " +ts.isPlaceholder() + " : >"+ts.getText().replaceAll("[\r\n]", "\n")+"<"); } } } } } prints out this: Title Placeholder 1 : true : >Click to edit Master title style< Text Placeholder 2 : true : >Click to edit Master text styles Second level Third level Fourth level Fifth level< Date Placeholder 3 : false : >*< Footer Placeholder 4 : false : >< Slide Number Placeholder 5 : false : >*< TextBox 6 : false : >Text that I added to the master slide< Title Placeholder 1 : true : >Click to edit Master title style< Text Placeholder 2 : true : >Click to edit Master text styles Second level Third level Fourth level Fifth level< Date Placeholder 3 : false : >*< Footer Placeholder 4 : false : >< Slide Number Placeholder 5 : false : >*< With out older HSLF-specific code, this: public void masterBoilerplateTest() throws Exception { try (InputStream is = getResourceAsStream("/test-documents/testPPT_masterText.ppt")) { HSLFSlideShow slideShow = new HSLFSlideShow(is); for (HSLFMasterSheet masterSheet : slideShow.getSlideMasters()) { for (HSLFShape shape : masterSheet.getShapes()) { if (shape instanceof TextShape) { TextShape ts = (TextShape)shape; System.out.println(shape.getShapeName()+ " : " +HSLFMasterSheet.isPlaceholder(shape) + " : >"+ts.getText().replaceAll("[\r\n]", "\n")+"<"); } } } } } prints out this: Rectangle : true : >Click to edit Master title style< Rectangle : true : >Click to edit Master text styles Second level Third level Fourth level Fifth level< Rectangle : true : >*< Rectangle : true : >< Rectangle : true : >*< TextBox : false : >Text that I added to the master slide< Rectangle : true : >Click to edit Master title style< Rectangle : true : >Click to edit Master text styles Second level Third level Fourth level Fifth level< Rectangle : true : >*< Rectangle : true : >< Rectangle : true : >*<
Please add your testPPT_masterText.ppt
Created attachment 36080 [details] testPPT_masterText.ppt
3.17: @Override public boolean isPlaceholder() { OEPlaceholderAtom oep = getPlaceholderAtom(); if (oep != null) { return true; } //special case for files saved in Office 2007 RoundTripHFPlaceholder12 hldr = getHFPlaceholderAtom(); if (hldr != null) { return true; } return false; } 4.0.0-SNAPSHOT: @Override public boolean isPlaceholder() { return ((getPlaceholderAtom() != null) || //special case for files saved in Office 2007 (getHFPlaceholderAtom() != null)) && // check for metro shape of complex placeholder (!new HSLFMetroShape<HSLFShape>(this).hasMetroBlob()) ; } It looks like an extra condition was added in r1829453 (Bug 62092) that the shape have a metroBlob. Should we revert that condition or do we need to do something else?
Created attachment 36082 [details] another example This file shows that a placeholder can have a metroblob and still be a place holder...I think...we shouldn't extract "Titelmasterformat durch Klicken bearbeiten"...right?
Actually, even in the current WithMaster.ppt file, "Footer from the master slide" is included 3 times, even though it should only appear once. When I remove the new condition, the tests pass, and the footer only appears once. In r1837742, I reverted the behavior. Please reopen and fix if I've misunderstood something about metroblobs.
When I've added the metroblob condition, I've searched for a difference between a placeholder, which is visible as-is vs. just a placeholder. Although there was no check for masterTitleText before, I'd expect it to be contained in the result - especially as setMasterByDefault(true) is set. When deciding between including or omitting a slidemaster fragment, I would go for including in the uncertain case. I'm ok with keeping it the way it is now, if Tika is also handling it like that.
Andi, It hurt me to modify anything you've done! I'm ok with reverting to include the metroblob condition, and we can run our own check within Tika for placeholder/not placeholder to match the previous behavior. If you look at ppt files in the content/content_diffs_ignore_exceptions.xlsx in http://162.242.228.174/reports/poi-4.0.0_reports.tar.gz , there's quite a bit of extra content with "click" "master "style" (columns: TOP_10_UNIQUE_TOKEN_DIFFS_B and TOP_10_MORE_IN_B). I defer to you on this.