Bug 63576 doesn't completely fix the issue with capitalized text. Method WordExtractor.getParagraphText() still returns text in lower letters. You can use the example doc file from Bug 63576. --------- import java.io.FileInputStream; import java.io.IOException; import org.apache.poi.hwpf.extractor.WordExtractor; public class WordTextExtractorDoc { public static void main(String[] args) { try { WordExtractor wordExtDoc = new WordExtractor(new FileInputStream("capitalized.doc")); System.out.println(wordExtDoc.getText()); wordExtDoc.close(); } catch (IOException e) { e.printStackTrace(); } } } --------- Output is: --------- The following word is: CAPITALIZED. -- The following word is: capitalized. --------- I expect the last line also with "CAPITALIZED". Tested with version 5.3.2.
Created attachment 38462 [details] example
Sorry copied old source code in comment above. I've uploaded the correct example as java file.
see https://bz.apache.org/bugzilla/show_bug.cgi?id=63576 for related issue
The sample in the original description appears incorrect in that the sample code does not use getParagraphText(). I can confirm that getParagraphText() does not capitalize the text. It works a completely different way from getText(). getParagraphText() ignores the character runs. I don't know much about the HWPF code but the H is for Horrible (check the history of the POI and the HWPF API). Someone else might have a look but in 2023, I no longer care about .doc format and the POI support for it. It is an anachronism as far as I am concerned. XWPFWordExtractor does not expose a getParagraphText() method so this issue affects only the HWPF WordExtractor.