Bug 58858 - hidden characters not removed
Summary: hidden characters not removed
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 critical (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2016-01-14 09:15 UTC by sebastian.a.aguirre
Modified: 2023-08-07 15:02 UTC (History)
1 user (show)

sample doc file to test (30.00 KB, application/msword)
2016-01-14 09:15 UTC, sebastian.a.aguirre

Note You need to log in before you can comment on or make changes to this bug.
Description sebastian.a.aguirre 2016-01-14 09:15:35 UTC
Created attachment 33442 [details]
sample doc file to test

After reading the file and turning it into a String the hidden characters are not removed.
This happens in XWPF as well.

For reading the file I'm using a very simple method.

File file = new File("file.doc");
FileInputStream fis;
fis = new FileInputStream(file);
HWPFDocument doc = new HWPFDocument(fis);
WordExtractor ex = new WordExtractor(doc);
String toReturn = ex.getText();

Same thing happens when using XWPF, very simple code.

XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
String toReturn = ex.getText();

I'm attaching a file you can use as sample.
You can show/hide the hidden characters with ctrl+shift+8