Bug 62092

Summary: Text not extracted from grouped text shapes in HSLF
Product: POI Reporter: Tim Allison <tallison>
Component: HSLFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.17-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: SL Common SlideShowExtractor incl. fix for GroupShapes

Description Tim Allison 2018-02-09 18:35:31 UTC
On TIKA-2569, a user reported that we aren't extracting text from grouped textshapes in HSLF...all works in pptx.  I added a workaround at the Tika level for now.

Test file: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testPPT_groups.ppt

Unit test at the Tika level:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java#L300

When the user calls getTextParagraphs() on a slide, that should include the text from grouped textshapes, right?

If not and we have the intended behavior, and the user has to walk through HSLFGroupShapes, we can close this out.
Comment 1 Andreas Beeker 2018-03-11 17:19:35 UTC
I'm currently working on a SL Common SlideShowExtractor, i.e. trying to deprecate the old extractors and moving the getTextParagraphs there.
Comment 2 Andreas Beeker 2018-03-28 23:20:50 UTC
Created attachment 35820 [details]
SL Common SlideShowExtractor incl. fix for GroupShapes

The patch contains the SL Common SlideShowExtractor which made quite a few changes necessary to handle Placeholders and header/footer information across X/HSLF.
This also includes handling for group shapes.

There are a few API breaks included, e.g. for XSLF comments, therefore I would like to a have a review.

What do you think about the PlaceholderDetails helper class?
Comment 3 Javen O'Neal 2018-04-01 08:54:41 UTC
(In reply to Tim Allison from comment #0)
> When the user calls getTextParagraphs() on a slide, that should include the
> text from grouped textshapes, right?
That sounds correct to me.
Comment 4 Andreas Beeker 2018-04-01 09:15:42 UTC
X/HSLF/Slide.getTextParagraphs() doesn't return HeaderFooters consistently over the various formats, i.e. PPT<2007 has a special HeaderFooters record which can't be easily wrapped into a TextParagraph. I guess the reason for this method was anyway just an easy access for Tika, so that's what the extractor class is for.

How about delegating to the extractor (and changing the signature) or removing it?
Comment 5 Andreas Beeker 2018-04-18 15:04:28 UTC
Provided common slideshow extractor via r1829453