Created attachment 22388 [details] Contains JUnit test class and XLS document used for testing. The text contained in a TextBox inserted/created in an excel 2003 document is not extracted. Find in attachments the JUnit test class and the document used for testing. We expected to extract the words "testdoc" and "test phrase". Notes on the attached documents: - the document "classic.TextInTextBox.xls" contains the words "testdoc" and "test phrase" in a TextBox inserted in the document. "TestUnitPoi35Filter.java" is the JUnit class.
Created attachment 23191 [details] Contains JUnit test class and a DOC document used for testing.
With 3.2-FINAL to 3.5-beta1 versions also not extracts the contents of the text boxes in word 97 documents. As in the previous comment, we have uploaded a JUnit test, that reproduces the error with WordExtractor and the ExtractorFactory.
I get the same problem with the event based parsers, for both the 97-2003 formats and the 2007/xslx formats. If anyone can give an idea what code to add, I may be able to put it in, at least into the event-based one, and post the code. Also would like to get hidden text and revision marks, as settable options, and can write the code for it if someone can point me in the right direction.
This is still failing in POI 3.15 final. Neither the ExcelExtractor nor the WordExtractor currently check TextBox objects for text. Patches to add this functionality are welcome! Added failing unit test in r1761841.