Bug 54722 - HSLF: Text cannot be read from tables
Summary: HSLF: Text cannot be read from tables
Alias: None
Product: POI
Classification: Unclassified
Component: HSLF (show other bugs)
Version: 3.9-FINAL
Hardware: PC Linux
: P2 regression (vote)
Target Milestone: ---
Assignee: POI Developers List
: 54736 (view as bug list)
Depends on:
Reported: 2013-03-18 22:28 UTC by Phil Persad
Modified: 2013-09-27 15:46 UTC (History)
2 users (show)

Contains text in table (76.00 KB, application/vnd.ms-powerpoint)
2013-03-18 22:28 UTC, Phil Persad
Patch for reading text from tables (1.49 KB, patch)
2013-04-25 16:59 UTC, Phil Persad
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Phil Persad 2013-03-18 22:28:12 UTC
Created attachment 30077 [details]
Contains text in table

When calling PowerPointExtractor.getText on a .ppt file, all text contained within a table is not extracted.  I've created and attached a simple document that demonstrates the behaviour in question.

Upgraded from 3.5 to 3.9.  I cannot guarantee that this issue did not exist in 3.5 as the version that I upgraded from was an internal branch with a mostly undocumented patch history.  I've assumed that the issue did not exist in the 3.5 POI release and hence have classified as regression bug.
Comment 1 Phil Persad 2013-04-25 16:59:17 UTC
Created attachment 30226 [details]
Patch for reading text from tables

Apparently, this is caused during the construction of PPDrawing objects, specifically, the portion where the textboxWrappers field is populated.

The first EscherContainerRecord of type 0xf003 (SpgrContainer) is found within the container of type 0xf002.  Then all containers of type 0xf004 are found within the SpgrContainer.  However, it is possible that the SpgrContainer might also have other containers of type 0xf003, which in turn contain 0xf004s which eventually contain text.  This is the case when a slide contains a table which contains text.

I've written a patch which traverses an additional layer by looking for 0xf004s within 0xf003s within the SpgrContainer.  I was torn between specifically going one layer deeper and performing a recursive search.  While I suspect that the recursive search may be more correct, the former approach is safer (particularly as I'm not an expert on the data format).  Some feedback on which approach is preferable would be appreciated.
Comment 2 Trejkaz (pen name) 2013-04-28 23:53:51 UTC
*** Bug 54736 has been marked as a duplicate of this bug. ***
Comment 3 Tim Allison 2013-09-27 15:46:58 UTC
Fixed r1526960.

Added attached file as test case.

Thank you!