Bug 62092 - Text not extracted from grouped text shapes in HSLF
Summary: Text not extracted from grouped text shapes in HSLF
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSLF (show other bugs)
Version: 3.17-FINAL
Hardware: All All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-09 18:35 UTC by Tim Allison
Modified: 2018-04-18 15:04 UTC (History)
0 users



Attachments
SL Common SlideShowExtractor incl. fix for GroupShapes (93.39 KB, application/tar+gzip)
2018-03-28 23:20 UTC, Andreas Beeker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2018-02-09 18:35:31 UTC
On TIKA-2569, a user reported that we aren't extracting text from grouped textshapes in HSLF...all works in pptx.  I added a workaround at the Tika level for now.

Test file: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testPPT_groups.ppt

Unit test at the Tika level:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java#L300

When the user calls getTextParagraphs() on a slide, that should include the text from grouped textshapes, right?

If not and we have the intended behavior, and the user has to walk through HSLFGroupShapes, we can close this out.
Comment 1 Andreas Beeker 2018-03-11 17:19:35 UTC
I'm currently working on a SL Common SlideShowExtractor, i.e. trying to deprecate the old extractors and moving the getTextParagraphs there.
Comment 2 Andreas Beeker 2018-03-28 23:20:50 UTC
Created attachment 35820 [details]
SL Common SlideShowExtractor incl. fix for GroupShapes

The patch contains the SL Common SlideShowExtractor which made quite a few changes necessary to handle Placeholders and header/footer information across X/HSLF.
This also includes handling for group shapes.

There are a few API breaks included, e.g. for XSLF comments, therefore I would like to a have a review.

What do you think about the PlaceholderDetails helper class?
Comment 3 Javen O'Neal 2018-04-01 08:54:41 UTC
(In reply to Tim Allison from comment #0)
> When the user calls getTextParagraphs() on a slide, that should include the
> text from grouped textshapes, right?
That sounds correct to me.
Comment 4 Andreas Beeker 2018-04-01 09:15:42 UTC
X/HSLF/Slide.getTextParagraphs() doesn't return HeaderFooters consistently over the various formats, i.e. PPT<2007 has a special HeaderFooters record which can't be easily wrapped into a TextParagraph. I guess the reason for this method was anyway just an easy access for Tika, so that's what the extractor class is for.

How about delegating to the extractor (and changing the signature) or removing it?
Comment 5 Andreas Beeker 2018-04-18 15:04:28 UTC
Provided common slideshow extractor via r1829453