Bug 60003 - Regression: HSLF Powerpoint text extractor from footer of master slide
Summary: Regression: HSLF Powerpoint text extractor from footer of master slide
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSLF (show other bugs)
Version: 3.15-dev
Hardware: PC All
: P2 regression (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on: 58144
Blocks:
  Show dependency tree
 
Reported: 2016-08-15 03:26 UTC by Javen O'Neal
Modified: 2016-10-08 18:09 UTC (History)
0 users



Attachments
Failing unit test (1.09 KB, patch)
2016-08-15 04:15 UTC, Javen O'Neal
Details | Diff
Failing unit test with logging (1.80 KB, patch)
2016-08-15 08:22 UTC, Javen O'Neal
Details | Diff
extract custom placeholder (6.52 KB, text/x-diff)
2016-09-18 21:13 UTC, Andreas Beeker
Details
Tikas test file (747.00 KB, application/vnd.ms-powerpoint)
2016-09-18 21:14 UTC, Andreas Beeker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Javen O'Neal 2016-08-15 03:26:41 UTC
When running the Tika corpus regression tests for POI 3.15 beta 3 RC1, Tim found a regression where the text "Prague" in the footer area of a master slide of an OLE2 Powerpoint document was not returned by the PowerPointExtractor.

This regression was introduced in r1743769 as part of bug 58144.
Comment 1 Javen O'Neal 2016-08-15 04:15:20 UTC
Created attachment 34150 [details]
Failing unit test

commoncrawl2/XX/XX6MTKAWWQSRDI56YYBZBAC4BCB4AWWK rename to 60003.ppt

https://issues.apache.org/jira/browse/TIKA-2013

To be fair, re-saving the slideshow in LibreOffice does not reproduce the problem (possible LibreOffice bug).
LibreOffice does not show the Master slide on any of the sheets, so if the PowerPointExtractor's goal is to get the text that is visibly displayed on the sheets and not unused hidden templates, then POI may be doing the right thing in 3.15 beta 3 RC 1.
Comment 2 Javen O'Neal 2016-08-15 08:22:50 UTC
Created attachment 34152 [details]
Failing unit test with logging

It looks like the omission of "Prague" from the PowerPointExtractor output was likely intentional [1]

> 210	if(HSLFMasterSheet.isPlaceholder(sh)) {
> 211	    // don't bother about boiler
> 212	    // plate text on master
> 213	    // sheets
> 214	    continue;
> 215	}

Specifically, POI identified this master slide footer as a placeholder. Since "placeholders aren't normal shapes, they are visible only in the Edit Master mode" [2], they are omitted from the powerpoint extractor output. If POI incorrectly identified this as a placeholder or the file was incorrectly saved treating this as a placeholder, then this text should be included in the PowerPointExtractor output.

> Ignoring boiler plate (placeholder) text '*' on slide master
> Ignoring boiler plate (placeholder) text 'Plan4all Kick-off Meeting, 14th May 2009, Prague' on slide master
> Ignoring boiler plate (placeholder) text '*' on slide master
> Ignoring boiler plate (placeholder) text 'Click to edit Master title style' on slide master

The change in functionality is likely somewhere in HeadersFooters.java where
> if(_newRecord) attach();
was no longer called, or changes to the implementation of
> isVisible

Extracting from the Master slide goes back to bug 48161 in 2009.

[1] https://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hslf/extractor/PowerPointExtractor.java?revision=1748783&view=markup#l206
[2] https://poi.apache.org/apidocs/org/apache/poi/hslf/usermodel/HSLFMasterSheet.html#isPlaceholder(org.apache.poi.hslf.usermodel.HSLFShape)
Comment 3 Andreas Beeker 2016-09-18 21:13:46 UTC
Added a patch which also extracts custom placeholder. The placeholders in the 
example have metro blobs attached and I use those to distinguish between 
default and custom placeholders. As this also applies to slide-number and date 
field, I also check for their default text "*" ...
Of course this only applies when mastersheet texts are requested - the tradeoff 
is between having no custom texts and additional default texts in the output -
so I guess this unclean handling for master sheets is ok ...
Comment 4 Andreas Beeker 2016-09-18 21:13:49 UTC
Created attachment 34265 [details]
extract custom placeholder
Comment 5 Andreas Beeker 2016-09-18 21:14:03 UTC
Created attachment 34266 [details]
Tikas test file
Comment 6 Andreas Beeker 2016-10-08 18:09:59 UTC
Applied via r1763927