Bug 62591

Summary: Trivial regression in trunk in isPlaceHolder in newer sl
Product: POI Reporter: Tim Allison <tallison>
Component: HSLFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: trivial    
Priority: P2    
Version: 4.0.x-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: testPPT_masterText.ppt
another example

Description Tim Allison 2018-08-01 13:38:04 UTC
In TIKA-712 and TIKA-1171, we tried to ignore placeholder text from master slides.  Three of our unit tests for this fail after upgrading to 4.0.0-SNAPSHOT.  slide.isPlaceHolder() is now returning false for some shapes for which it used to return true.  In the few cases I've looked at, it appears to only be a regression for date and slidenumber placeholders.

With 4.0.0 and the sl framework this code:

    @Test
    public void masterBoilerplateTest() throws Exception {
        try (InputStream is = getResourceAsStream("/test-documents/testPPT_masterText.ppt")) {
            SlideShow slideShow = SlideShowFactory.create(is);
            for (Object masterSheetObj : slideShow.getSlideMasters()) {
                MasterSheet ms = (MasterSheet)masterSheetObj;
                for (Object shape : ms.getShapes()) {
                    if (shape instanceof TextShape) {
                        TextShape ts = (TextShape)shape;
                        System.out.println(ts.getShapeName() + " : " +ts.isPlaceholder() + " : >"+ts.getText().replaceAll("[\r\n]", "\n")+"<");
                    }
                }
            }
        }
    }

prints out this:
Title Placeholder 1 : true : >Click to edit Master title style<
Text Placeholder 2 : true : >Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level<
Date Placeholder 3 : false : >*<
Footer Placeholder 4 : false : ><
Slide Number Placeholder 5 : false : >*<
TextBox 6 : false : >Text that I added to the master slide<
Title Placeholder 1 : true : >Click to edit Master title style<
Text Placeholder 2 : true : >Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level<
Date Placeholder 3 : false : >*<
Footer Placeholder 4 : false : ><
Slide Number Placeholder 5 : false : >*<


With out older HSLF-specific code, this:
    public void masterBoilerplateTest() throws Exception {
        try (InputStream is = getResourceAsStream("/test-documents/testPPT_masterText.ppt")) {
            HSLFSlideShow slideShow = new HSLFSlideShow(is);
            for (HSLFMasterSheet masterSheet : slideShow.getSlideMasters()) {
                for (HSLFShape shape : masterSheet.getShapes()) {
                    if (shape instanceof TextShape) {
                        TextShape ts = (TextShape)shape;
                        System.out.println(shape.getShapeName()+ " : " +HSLFMasterSheet.isPlaceholder(shape) + " : >"+ts.getText().replaceAll("[\r\n]", "\n")+"<");
                    }
                }
            }
        }
    }

prints out this:
Rectangle : true : >Click to edit Master title style<
Rectangle : true : >Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level<
Rectangle : true : >*<
Rectangle : true : ><
Rectangle : true : >*<
TextBox : false : >Text that I added to the master slide<
Rectangle : true : >Click to edit Master title style<
Rectangle : true : >Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level<
Rectangle : true : >*<
Rectangle : true : ><
Rectangle : true : >*<
Comment 1 Andreas Beeker 2018-08-07 21:07:30 UTC
Please add your testPPT_masterText.ppt
Comment 2 Tim Allison 2018-08-09 12:57:16 UTC
Created attachment 36080 [details]
testPPT_masterText.ppt
Comment 3 Tim Allison 2018-08-09 13:37:57 UTC
  
3.17:
 @Override
	    public boolean isPlaceholder() {
	        OEPlaceholderAtom oep = getPlaceholderAtom();
	        if (oep != null) {
	            return true;
	        }
	
	        //special case for files saved in Office 2007
	        RoundTripHFPlaceholder12 hldr = getHFPlaceholderAtom();
	        if (hldr != null) {
	            return true;
	        }
	
	        return false;
	    }

4.0.0-SNAPSHOT:
    @Override
    public boolean isPlaceholder() {
        return
            ((getPlaceholderAtom() != null) ||
            //special case for files saved in Office 2007
            (getHFPlaceholderAtom() != null)) &&
            // check for metro shape of complex placeholder
            (!new HSLFMetroShape<HSLFShape>(this).hasMetroBlob())
        ;
    }

It looks like an extra condition was added in r1829453 (Bug 62092) that the shape have a metroBlob.

Should we revert that condition or do we need to do something else?
Comment 4 Tim Allison 2018-08-09 14:40:18 UTC
Created attachment 36082 [details]
another example

This file shows that a placeholder can have a metroblob and still be a place holder...I think...we shouldn't extract "Titelmasterformat durch Klicken bearbeiten"...right?
Comment 5 Tim Allison 2018-08-09 15:51:17 UTC
Actually, even in the current WithMaster.ppt file, "Footer from the master slide" is included 3 times, even though it should only appear once.

When I remove the new condition, the tests pass, and the footer only appears once.

In r1837742, I reverted the behavior.  Please reopen and fix if I've misunderstood something about metroblobs.
Comment 6 Andreas Beeker 2018-08-09 23:11:35 UTC
When I've added the metroblob condition, I've searched for a difference between a placeholder, which is visible as-is vs. just a placeholder.

Although there was no check for masterTitleText before, I'd expect it to be contained in the result - especially as setMasterByDefault(true) is set.

When deciding between including or omitting a slidemaster fragment, I would go for including in the uncertain case.

I'm ok with keeping it the way it is now, if Tika is also handling it like that.
Comment 7 Tim Allison 2018-08-10 15:22:48 UTC
Andi,

  It hurt me to modify anything you've done!  I'm ok with reverting to include 
the metroblob condition, and we can run our own check within Tika for placeholder/not placeholder to match the previous behavior.

  If you look at ppt files in the content/content_diffs_ignore_exceptions.xlsx in http://162.242.228.174/reports/poi-4.0.0_reports.tar.gz , there's quite a bit of extra content with "click" "master "style" (columns: TOP_10_UNIQUE_TOKEN_DIFFS_B and TOP_10_MORE_IN_B).

  I defer to you on this.