Bug 66422 - WordExtractor.getParagraphText() - capitalized text
Summary: WordExtractor.getParagraphText() - capitalized text
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-01-14 17:16 UTC by Franz Seidl
Modified: 2023-02-10 21:15 UTC (History)
0 users



Attachments
example (533 bytes, text/plain)
2023-01-14 17:18 UTC, Franz Seidl
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Franz Seidl 2023-01-14 17:16:23 UTC
Bug 63576 doesn't completely fix the issue with capitalized text.

Method WordExtractor.getParagraphText() still returns text in lower letters.

You can use the example doc file from Bug 63576.


---------
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.poi.hwpf.extractor.WordExtractor;

public class WordTextExtractorDoc {

	public static void main(String[] args) {
		try {
			WordExtractor wordExtDoc = new WordExtractor(new FileInputStream("capitalized.doc"));
			System.out.println(wordExtDoc.getText());
			wordExtDoc.close();
		} catch (IOException e) {
			e.printStackTrace();
		}

	}

}
---------

Output is:
---------
The following word is: CAPITALIZED.


--
The following word is: capitalized.
---------

I expect the last line also with "CAPITALIZED".

Tested with version 5.3.2.
Comment 1 Franz Seidl 2023-01-14 17:18:51 UTC
Created attachment 38462 [details]
example
Comment 2 Franz Seidl 2023-01-14 17:19:53 UTC
Sorry copied old source code in comment above.

I've uploaded the correct example as java file.
Comment 3 PJ Fanning 2023-02-10 13:07:39 UTC
see https://bz.apache.org/bugzilla/show_bug.cgi?id=63576 for related issue
Comment 4 PJ Fanning 2023-02-10 21:15:50 UTC
The sample in the original description appears incorrect in that the sample code does not use getParagraphText().

I can confirm that getParagraphText() does not capitalize the text. It works a completely different way from getText(). getParagraphText() ignores the character runs. I don't know much about the HWPF code but the H is for Horrible (check the history of the POI and the HWPF API). Someone else might have a look but in 2023, I no longer care about .doc format and the POI support for it. It is an anachronism as far as I am concerned.

XWPFWordExtractor does not expose a getParagraphText() method so this issue affects only the HWPF WordExtractor.