On TIKA-2569, a user reported that we aren't extracting text from grouped textshapes in HSLF...all works in pptx. I added a workaround at the Tika level for now. Test file: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testPPT_groups.ppt Unit test at the Tika level: https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java#L300 When the user calls getTextParagraphs() on a slide, that should include the text from grouped textshapes, right? If not and we have the intended behavior, and the user has to walk through HSLFGroupShapes, we can close this out.
I'm currently working on a SL Common SlideShowExtractor, i.e. trying to deprecate the old extractors and moving the getTextParagraphs there.
Created attachment 35820 [details] SL Common SlideShowExtractor incl. fix for GroupShapes The patch contains the SL Common SlideShowExtractor which made quite a few changes necessary to handle Placeholders and header/footer information across X/HSLF. This also includes handling for group shapes. There are a few API breaks included, e.g. for XSLF comments, therefore I would like to a have a review. What do you think about the PlaceholderDetails helper class?
(In reply to Tim Allison from comment #0) > When the user calls getTextParagraphs() on a slide, that should include the > text from grouped textshapes, right? That sounds correct to me.
X/HSLF/Slide.getTextParagraphs() doesn't return HeaderFooters consistently over the various formats, i.e. PPT<2007 has a special HeaderFooters record which can't be easily wrapped into a TextParagraph. I guess the reason for this method was anyway just an easy access for Tika, so that's what the extractor class is for. How about delegating to the extractor (and changing the signature) or removing it?
Provided common slideshow extractor via r1829453