Bug 51604 - replace text corrupts doc file
Summary: replace text corrupts doc file
Status: REOPENED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.8-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-03 02:43 UTC by sanmoy
Modified: 2017-01-21 19:07 UTC (History)
0 users



Attachments
test input file for hwpf (26.00 KB, application/msword)
2011-08-04 00:30 UTC, sanmoy
Details
test output file for hwpf (20.00 KB, application/msword)
2011-08-11 05:26 UTC, sanmoy
Details
comparison result (40.00 KB, image/png)
2011-08-11 05:27 UTC, sanmoy
Details
error-snap-shot (22.17 KB, image/png)
2011-08-11 09:25 UTC, sanmoy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description sanmoy 2011-08-03 02:43:00 UTC
I had written this simple piece of code   

FileInputStream fileInputStream = new FileInputStream(new File("C:\\in.doc")); 
FileOutputStream fileOutputStream = new FileOutputStream(new File("C:\\out.doc")); 
HWPFDocument hwpfDocument = new HWPFDocument(fileInputStream); 
Range range = hwpfDocument.getRange(); 
int numParagraph = range.numParagraphs(); 
for (int i = 0; i < numParagraph; i++) 
{ 
Paragraph paragraph = range.getParagraph(i); 
int numCharRuns = paragraph.numCharacterRuns(); 
for (int j = 0; j < numCharRuns; j++) 
{ 
CharacterRun charRun = paragraph.getCharacterRun(j); 
String text = charRun.text(); 
charRun.replaceText(text, "added"); 
} 
} 
hwpfDocument.write(fileOutputStream); 

After the execution, the output file becomes corrupted ( tried to open with office 2007) , input file was properly opening. input file is a very basic file, with 4-5 simple lines

No exception thrown. 

had written similar code using XWPFDocument, and it works fine
Comment 1 sanmoy 2011-08-03 02:54:44 UTC
One addition, 

fileOutputStream.close() is present in the original code, by mistake, which i didn't copy while raising the defect. 

Please remember, the following code is working perfectly 

XWPFDocument xwpfDocument = new XWPFDocument(inputStream);
 
List<XWPFParagraph> paragraphs = xwpfDocument.getParagraphs();
for(XWPFParagraph xParagraph:paragraphs)
{
for(XWPFRun xwpfRun : xParagraph.getRuns())
{
xwpfRun.setText("replace", 0);
}
}
xwpfDocument.write(outputStream);
outputStream.close();
Comment 2 Yegor Kozlov 2011-08-03 06:58:05 UTC
Which version of POI? Please try with the latest build from trunk, there have been quite a lot of updates recently.

If the problem is still there, please attach the problematic file. Without the input .doc file we can't do much to help you.

Yegor
Comment 3 sanmoy 2011-08-04 00:30:42 UTC
Created attachment 27346 [details]
test input file for hwpf

I have tested with 6-Jul-2011 nightly build. poi-bin-3.8-beta3-20110606.tar
Comment 4 sanmoy 2011-08-04 00:40:19 UTC
Now I have tested with yesterday's nightly build poi-3.8-beta4-20110803 from http://encore.torchbox.com/poi-svn-build/
Comment 5 Jacob 2011-08-05 16:15:49 UTC
Very interested in this bug.  Having the same issue. Can't use XWPF since contract requires us to maintain original version.
Comment 6 Sergey Vladimirov 2011-08-09 05:26:05 UTC
Fixed in r1155211 (3.8-beta4-20110810 or later). Please, test.
Comment 7 sanmoy 2011-08-11 05:26:24 UTC
Created attachment 27369 [details]
test output file for hwpf

I have tested with 3.8-beta4-20110810, but the defect still exists.

Now, I have added a check, when the line contains the string "Header" ( please see the input doc ) then it will replace the text. 
 if(text.contains("Header"))
charRun.replaceText(text, "added");

And I have compared the output file with "beyond compare" ( a file comparison tool ), it shows the text has been replaced properly, but the document format is corrupted. I don't know much about the doc format so cannot comment. I have attached the output file, hope it will help.
Comment 8 sanmoy 2011-08-11 05:27:38 UTC
Created attachment 27370 [details]
comparison result

comparison result
Comment 9 Sergey Vladimirov 2011-08-11 08:23:29 UTC
Sanmoy,

I don't see the difference between in.doc and out.doc, except "Header" -> "added" change. What exaclty is broken?

Sergey
Comment 10 sanmoy 2011-08-11 09:25:46 UTC
Created attachment 27371 [details]
error-snap-shot

please try to open the output file with MS office 2003 or 2007, it will display file corrupted
Comment 11 Sergey Vladimirov 2011-08-11 18:43:51 UTC
Sanmoy,

I did fix a couple of issues (FIB and stylesheets processing) that may be reason of why file is not opening by Microsoft Office. Please try with next night build or trunk version.

Result file is still not passed binary file validation tool thought :(

Sergey
Comment 12 Sergey Vladimirov 2011-08-16 10:20:13 UTC
Sanmoy,

Please, check the latest trunk version or 3.8-beta4 or later.
Comment 13 sanmoy 2011-08-20 10:50:33 UTC
the defect has been fixed .. thanks
Comment 14 sanmoy 2011-09-04 16:49:18 UTC
/**
* Replace (all instances of) a piece of text with another...
*
* @param pPlaceHolder
*            The text to be replaced (e.g., "${organization}")
* @param pValue
*            The replacement text (e.g., "Apache Software Foundation")
*/
public void replaceText(String pPlaceHolder, String pValue) 

The replaceText API will not work if the String pValue contains the String pPlaceHolder

For example if pPlaceHolder="abcd" and pValue="abcd" or "abcdef" or "12abcdef" this code will go to a infinite loop

Modify the original testcode charRun.replaceText(text, text); that is, try to replace the original value with itself, it will not work, it will fall into a infinite loop. For your convenience, I am copying the original code again. Please test it with the attached files

FileInputStream fileInputStream = new FileInputStream(new File("C:\\in.doc")); 
FileOutputStream fileOutputStream = new FileOutputStream(new
File("C:\\out.doc")); 
HWPFDocument hwpfDocument = new HWPFDocument(fileInputStream); 
Range range = hwpfDocument.getRange(); 
int numParagraph = range.numParagraphs(); 
for (int i = 0; i < numParagraph; i++) 
{ 
Paragraph paragraph = range.getParagraph(i); 
int numCharRuns = paragraph.numCharacterRuns(); 
for (int j = 0; j < numCharRuns; j++) 
{ 
CharacterRun charRun = paragraph.getCharacterRun(j); 
String text = charRun.text(); 
charRun.replaceText(text, text); 
} 
} 
hwpfDocument.write(fileOutputStream); 
fileOutputStream.close();

I have tested with the latest nightly buid 3.8-beta5-20110904

I have debugged the poi code and found the problem in this following logic 
String text = text();
int offset = text.indexOf(pPlaceHolder);
text is returning the replaced value and if the replaced value contains the original String, offset will always be >=0 and it will keep on increasing

public void replaceText(String pPlaceHolder, String pValue)  {
	boolean keepLooking = true;
	while (keepLooking){

	String text = text();
	int offset = text.indexOf(pPlaceHolder);
	if (offset >= 0)
		replaceText(pPlaceHolder, pValue, offset);
	else
		keepLooking = false;
		}
	}
Comment 15 applanc 2012-02-14 17:13:38 UTC
Hi, I have tested poi-bin-3.8-beta5-20111217.tar.gz with a MS office 2003 (input file) using "charRun.replaceText("XYZ", "ABC");"
and the output file is corrupted. 
without replacement the output is ok, so I guess that problem come from replacement logic.

Also it worked out with poi-bin-3.8-beta4 if length of  pValue was equal to the length of PlaceHolder.. otherwise the output file was also corrupted ! 
Any idea !
Thanks