Created attachment 22381 [details] Contains JUnit test class and documents used for testing. The text contained in a TextBox inserted/created in a word 2007 document is not extracted. Find in attachments the JUnit test class and the documents used for testing. We expected to extract the words "testdoc" and "test phrase". Notes on the attached documents: - the documents "classic_TextInTextBox.docx" and "form_TextInTextBox.docx" contain the word "testdoc" in a TextBox inserted in the document. "TestUnitPoi35Filter.java" is the JUnit class.
With 3.2-FINAL to 3.5-beta1 versions also not extracts the contents of the text boxes in word 97 documents. As in the previous comment, we have uploaded a JUnit test, that reproduces the error with WordExtractor and the ExtractorFactory.
Just looked into this. The general issue was fixed in 3.9. There is a formatting issue, though, that the test doc brings out -- new line incorrectly inserted between runs: testdoc extracted as test\ndoc Closing this issue and opening new issue for new line.