Bug 65721 - Extracting embedded files not possible from non-standard ppt
Summary: Extracting embedded files not possible from non-standard ppt
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HSLF (show other bugs)
Version: 5.0.x-dev
Hardware: PC Linux
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2021-12-03 19:33 UTC by Tim Allison
Modified: 2021-12-26 15:34 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2021-12-03 19:33:46 UTC
Over on https://issues.apache.org/jira/browse/TIKA-3526, matcha007 shared a ppt file created by WPS 表格 that handles embedded files slightly differently than standard ppt.

I tried some basic stuff with 5.1.0 and still had little luck.

The file is: https://issues.apache.org/jira/secure/attachment/13032100/13032100_embedded+attachment.ppt

When I do the usual iterate through slides and then iterate through shapes looking for HSLFObjectShape, the objectShape.getObjectData() returns null because, as matcha007 pointed out, the _exEmbed is not found in HSLFObjectShape's 

private ExEmbed getExEmbed(boolean create) {...

matcha007 found that if he added 3 to the objectId, in getExEmbed, it seemed to work on this file, but there's no motivation for that (that I know of), and it looks like it would break everything else.

I can extract the embedded files if I iterate through HSLFObjectData from that slideshow level:
        POIFSFileSystem pfs = new POIFSFileSystem(p.toFile());
        try (HSLFSlideShow ss = new HSLFSlideShow(pfs.getRoot())) {

            HSLFObjectData[] objectData = ss.getEmbeddedObjects();

However, I can't then link those back to the ids in the shapes for this particular file.

What can we do with this file?