Summary: | Regression: HSLF Powerpoint text extractor from footer of master slide | ||
---|---|---|---|
Product: | POI | Reporter: | Javen O'Neal <onealj> |
Component: | HSLF | Assignee: | POI Developers List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | regression | ||
Priority: | P2 | ||
Version: | 3.15-dev | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Bug Depends on: | 58144 | ||
Bug Blocks: | |||
Attachments: |
Failing unit test
Failing unit test with logging extract custom placeholder Tikas test file |
Description
Javen O'Neal
2016-08-15 03:26:41 UTC
Created attachment 34150 [details] Failing unit test commoncrawl2/XX/XX6MTKAWWQSRDI56YYBZBAC4BCB4AWWK rename to 60003.ppt https://issues.apache.org/jira/browse/TIKA-2013 To be fair, re-saving the slideshow in LibreOffice does not reproduce the problem (possible LibreOffice bug). LibreOffice does not show the Master slide on any of the sheets, so if the PowerPointExtractor's goal is to get the text that is visibly displayed on the sheets and not unused hidden templates, then POI may be doing the right thing in 3.15 beta 3 RC 1. Created attachment 34152 [details] Failing unit test with logging It looks like the omission of "Prague" from the PowerPointExtractor output was likely intentional [1] > 210 if(HSLFMasterSheet.isPlaceholder(sh)) { > 211 // don't bother about boiler > 212 // plate text on master > 213 // sheets > 214 continue; > 215 } Specifically, POI identified this master slide footer as a placeholder. Since "placeholders aren't normal shapes, they are visible only in the Edit Master mode" [2], they are omitted from the powerpoint extractor output. If POI incorrectly identified this as a placeholder or the file was incorrectly saved treating this as a placeholder, then this text should be included in the PowerPointExtractor output. > Ignoring boiler plate (placeholder) text '*' on slide master > Ignoring boiler plate (placeholder) text 'Plan4all Kick-off Meeting, 14th May 2009, Prague' on slide master > Ignoring boiler plate (placeholder) text '*' on slide master > Ignoring boiler plate (placeholder) text 'Click to edit Master title style' on slide master The change in functionality is likely somewhere in HeadersFooters.java where > if(_newRecord) attach(); was no longer called, or changes to the implementation of > isVisible Extracting from the Master slide goes back to bug 48161 in 2009. [1] https://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hslf/extractor/PowerPointExtractor.java?revision=1748783&view=markup#l206 [2] https://poi.apache.org/apidocs/org/apache/poi/hslf/usermodel/HSLFMasterSheet.html#isPlaceholder(org.apache.poi.hslf.usermodel.HSLFShape) Added a patch which also extracts custom placeholder. The placeholders in the example have metro blobs attached and I use those to distinguish between default and custom placeholders. As this also applies to slide-number and date field, I also check for their default text "*" ... Of course this only applies when mastersheet texts are requested - the tradeoff is between having no custom texts and additional default texts in the output - so I guess this unclean handling for master sheets is ok ... Created attachment 34265 [details]
extract custom placeholder
Created attachment 34266 [details]
Tikas test file
|