Bug 65320 - XWPFDocument cannot read data: embedded images
Summary: XWPFDocument cannot read data: embedded images
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 5.0.0-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2021-05-21 06:45 UTC by Keith Paterson
Modified: 2021-12-26 21:55 UTC (History)
1 user (show)

Example document (11.85 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2021-05-21 06:46 UTC, Keith Paterson

Note You need to log in before you can comment on or make changes to this bug.
Description Keith Paterson 2021-05-21 06:45:36 UTC
The loading of a document that contains images of type 'image/png;base64' fails. 

Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: The specified content type 'image/png;base64' is not compliant with RFC 2616: malformed content type.
	at org.apache.poi.openxml4j.opc.internal.ContentType.<init>(ContentType.java:154)
	at org.apache.poi.openxml4j.opc.ZipPackagePart.<init>(ZipPackagePart.java:83)
	at org.apache.poi.openxml4j.opc.ZipPackage$EntryTriple.register(ZipPackage.java:334)
	at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:291)
	at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:742)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:315)
	at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:47)

There is a stripped down example at https://github.com/Portree-Kid/testdoc
Comment 1 Keith Paterson 2021-05-21 06:46:26 UTC
Created attachment 37872 [details]
Example document
Comment 2 PJ Fanning 2021-10-08 17:55:08 UTC
Noone else has reported a similar issue and the docx [Content_Types].xml just seems wrong. 

<Override PartName="/word/media/rId21.png" ContentType="image/png;base64" />

rId21.png is not base64 encoded - it is a valid png file without base64 encoding

if this was a common issue, I would agree with hacking POI to handle it - but so far, this seems like a bug in whatever app produced the attached docx
Comment 3 Dominik Stadler 2021-12-26 21:55:14 UTC
Sounds like a problem with the application which produces these files. Unless it happens more often, we do not plan to introduce more graceful parsing/handling of such files in Apache POI.