Bug 66422 - WordExtractor.getParagraphText() - capitalized text
Summary: WordExtractor.getParagraphText() - capitalized text
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2023-01-14 17:16 UTC by Franz Seidl
Modified: 2023-02-10 21:15 UTC (History)
0 users

example (533 bytes, text/plain)
2023-01-14 17:18 UTC, Franz Seidl

Note You need to log in before you can comment on or make changes to this bug.
Description Franz Seidl 2023-01-14 17:16:23 UTC
Bug 63576 doesn't completely fix the issue with capitalized text.

Method WordExtractor.getParagraphText() still returns text in lower letters.

You can use the example doc file from Bug 63576.

import java.io.FileInputStream;
import java.io.IOException;

import org.apache.poi.hwpf.extractor.WordExtractor;

public class WordTextExtractorDoc {

	public static void main(String[] args) {
		try {
			WordExtractor wordExtDoc = new WordExtractor(new FileInputStream("capitalized.doc"));
		} catch (IOException e) {



Output is:
The following word is: CAPITALIZED.

The following word is: capitalized.

I expect the last line also with "CAPITALIZED".

Tested with version 5.3.2.
Comment 1 Franz Seidl 2023-01-14 17:18:51 UTC
Created attachment 38462 [details]
Comment 2 Franz Seidl 2023-01-14 17:19:53 UTC
Sorry copied old source code in comment above.

I've uploaded the correct example as java file.
Comment 3 PJ Fanning 2023-02-10 13:07:39 UTC
see https://bz.apache.org/bugzilla/show_bug.cgi?id=63576 for related issue
Comment 4 PJ Fanning 2023-02-10 21:15:50 UTC
The sample in the original description appears incorrect in that the sample code does not use getParagraphText().

I can confirm that getParagraphText() does not capitalize the text. It works a completely different way from getText(). getParagraphText() ignores the character runs. I don't know much about the HWPF code but the H is for Horrible (check the history of the POI and the HWPF API). Someone else might have a look but in 2023, I no longer care about .doc format and the POI support for it. It is an anachronism as far as I am concerned.

XWPFWordExtractor does not expose a getParagraphText() method so this issue affects only the HWPF WordExtractor.