Bug 44431

Summary: HWPFDocument.write destroys fields
Product: POI Reporter: dnapoletano <domenico.napoletano>
Component: HWPFAssignee: POI Developers List <dev>
Severity: normal CC: domenico.napoletano
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: Other   
OS: other   
Attachments: Sample document for testing writing
Test doc after reading and rewriting

Description dnapoletano 2008-02-15 04:11:42 UTC
Trying to open and resave a Word document with

InputStream is = new FileInputStream("/home/esempio.doc");
HWPFDocument docInput = new HWPFDocument(is);
OutputStream os = new FileOutputStream("/home/TEST_POI.doc");

all fields in document (TOC items, STYLEREF and so on) are destroyed and
converted to plain text; for example, a FILENAME field becomes "STYLEREF
TitoloDocumento \* MERGEFORMAT esempio.doc".

The problem may perhaps reside in control characters handling: in fact, fields
in MS Word are represented within normal text, as a sequence like

0x13 <field info> 0x14 <field value> 0x15

and text in POI saved document becomes

<field info> <field value>

The same problem affects also text extraction: a text portion like

File name is [esempio.doc]

in which "[esempio.doc]" represents a filename field, becomes

File name is STYLEREF TitoloDocumento \* MERGEFORMAT esempio.doc

in extracted text.
I've partially solved this latter issue using the Java method (s is the text
portion to clean)

private static String rimuoviCampi(String s) {
	s = s.replaceAll("\\x13[^\\x13\\x14]*\\x14", "");
	s = s.replaceAll("\\x15", "");
	s = s.trim();
	return s;

but it remains unsolved in document saving.

Thanks in advance

Comment 1 Nick Burch 2008-11-12 06:46:15 UTC
There has been some hwpf work on fields that is in 3.2. Any chance you could re-test and see if it's now fixed?
Comment 2 dnapoletano 2008-11-14 08:02:37 UTC
Created attachment 22872 [details]
Sample document for testing writing
Comment 3 dnapoletano 2008-11-14 08:03:11 UTC
Created attachment 22873 [details]
Test doc after reading and rewriting
Comment 4 dnapoletano 2008-11-14 08:07:05 UTC
Sorry, but it's still detroying fields...

I've tested last POI version source code (TRUNK revision) with attached document, trying to read and write the document as is, with the two lines

HWPFDocument doc = new HWPFDocument (new FileInputStream ("/home/jars/Desktop/FieldsTest.doc"));
      doc.write(new FileOutputStream ("/home/jars/Desktop/FieldsTest after.doc"));

where the document (contained in the FIRST attachment; the SECOND attachment contains the resaved document) contains:

1) a "num page" field, rendered *correctly*

2) a "num pages" field, rendered *correctly*

3) a "style ref" field, RENDERED AS TEXT: the original text 

   STYLEREF test

   with style "TitoloDocumento", becomes

   "TitoloDocumento"STYLEREF test

4) a "file name" field, RENDERED AS TEXT: the original text (the bare file name with extension) becomes

FILENAME FieldsTest.doc

5) a "TOC" field, RENDERED AS TEXT: the original TOC content

Heading paragraph in next page	2
Another heading paragraph in further page	3


TOC \f \o "1-9" \t "Intestazione 1;1" Heading paragraph in next page	2
Another heading paragraph in further page	3ยง
Comment 5 Yegor Kozlov 2011-06-24 08:17:41 UTC
The problem is still reproducible in trunk ( as of r1138799)

Comment 6 Sergey Vladimirov 2011-07-24 18:41:59 UTC
Seems to be fixed in trunk. Some formatting is missing thought, but fields are in place, including headers and footers.