Bug 46568 - PPTX text extraction works incorrectly, spaces line carriages removed in some cases
Summary: PPTX text extraction works incorrectly, spaces line carriages removed in some...
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: 3.5-dev
Hardware: PC Windows XP
: P2 critical (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2009-01-20 10:48 UTC by sreeni
Modified: 2009-04-20 11:06 UTC (History)
0 users

PPTX file to be extracted (322.70 KB, application/vnd.openxmlformats-officedocument.presentationml.presentation)
2009-01-20 10:53 UTC, sreeni

Note You need to log in before you can comment on or make changes to this bug.
Description sreeni 2009-01-20 10:48:10 UTC
The PPTX issue manifests itself when a document is being decomposed and
searched for a string.  For some reason, some whitespace and line carriages are
being deleted.

If you try to match a Friday that is concatenated with another string (such as
"otherFriday"), it will fail.  Note that a regular expression match will work, however.  This
behavior has been observed in 3 of 8 randomly selected pptx downloaded from the
internet.  However, document identification seems to work just fine, so the
only way that some one using the new POI engine would be affected is if they
were decomposing attachments and searching for a simple string in them (and
they would only be affected on PowerPoint 2007 documents).  As noted above,
regular expression matching is a workaround that could be employed.
Comment 1 sreeni 2009-01-20 10:53:22 UTC
Created attachment 23143 [details]
PPTX file to be extracted

Please use this PPTX to extract the text.  The spaces and carriage returns are removed.
Comment 2 Yegor Kozlov 2009-04-20 11:06:44 UTC
Fixed in r766775
CTTextLineBreak were not properly processed resulting in missing line carriages.