Bug 45552

Summary:	Content from a document (docx, xlsx, or pptx) linked to a 2007 pptx document is not extracted.
Product:	POI	Reporter:	xtrim <grizolle_benedicte>
Component:	POI Overall	Assignee:	POI Developers List <dev>
Status:	RESOLVED WONTFIX
Severity:	normal
Priority:	P2
Version:	unspecified
Target Milestone:	---
Hardware:	PC
OS:	Windows Server 2003
Attachments:	Contains JUnit test class and documents used for testing.

Description xtrim 2008-08-05 05:33:43 UTC

Created attachment 22375 [details]
Contains JUnit test class and documents used for testing.

The text contained in a document linked to the current ppt 2007 document is not extracted.
Find in attachments the JUnit test class and the documents used for testing.
We expected to extract the word "testdoc".

Notes on the attached documents:


- the document "ContentLinkedObject_word.pptx" contains the word "testdoc" in the docx linked document.

- the document "ContentLinkedObject_excel.pptx" contains the word "testdoc" in the xlsx linked document.

- the document "ContentLinkedObject_ppt.pptx" contains the word "testdoc" in the pptx linked document.


"TestUnitPoi35Filter.java" is the JUnit class.

Comment 1 Dominik Stadler 2016-04-10 11:43:00 UTC

As far as I see with LibreOffice, these are actually hyperlinks to local files, so not an embedded document, so I don't think POI should try to extract text from those by default anyway. So at most this would be some advanced option to enable, but even then it is likely better done in user-space, i.e. you can iterate the shapes and see if there are hyperlinks and then try to open those documents pointed to by the hyperlinks.

For now I don't think we plan to work on this in POI itself until someone proposes patches together with proper unit-test coverage.