Bug 62319 - Decommission XSLF-/PowerPointExtractor
Summary: Decommission XSLF-/PowerPointExtractor
Alias: None
Product: POI
Classification: Unclassified
Component: SL Common (show other bugs)
Version: 4.0.x-dev
Hardware: All All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Blocks: 59548
  Show dependency tree
Reported: 2018-04-20 12:48 UTC by Andreas Beeker
Modified: 2018-04-20 20:04 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Beeker 2018-04-20 12:48:33 UTC
This following commit includes the refactorings to use SlideShowExtractor instead of the format specific XSLF/PowerPointExtractor classes. As SlideShowExtractor extends POITextExtractor directly, the OLE2ExtractorFactory can't always return a POIOLE2TextExtractor. I've tried to minimize/hide the effects on this by using generics, so user-code probably just needs to be recompiled ... but will throw an exception for slideshows, if it assigns it to a POIOLE2TextExtractor reference.

I think the abstract classes POIOLE2TextExtractor and POIXMLTextExtractor should be deprecated anyway, as the use-case of using the extractor to determine the format and then use it to access the document and OLE2/OOXML specific properties is not what the extractors are thought for.

We have WorkbookFactory and SlideShowFactory (and maybe sometime also a factory for H/XWPF) which job is to create a document from different sources, that's also the reason why SlideShowExtractor only accepts a SlideShow and not any other low-level sources, i.e. to keep the concers of determining the format and extracting the text separate.

As a compromise I've introduced the getDocument() in POITextExtractor, but user code needs to know what kind of document is returned and cast it accordingly.

What's currently missing is the extraction of SlideLayout shapes (see TestXSLFPowerPointExtractor.testGetMasterText()), which I want to provide as a separate commit to this issue.
Comment 1 Andreas Beeker 2018-04-20 12:53:52 UTC
First part applied via r1829653
Comment 2 Andreas Beeker 2018-04-20 20:04:51 UTC
Added slide layout extraction via r1829677