Bug 46610

Summary: [PATCH] Problems accessing documents containing unicode
Product: POI Reporter: Benjamin Engele <benjamin.engele>
Component: HWPFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: max.valjanski
Priority: P2    
Version: 3.5-dev   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Attachments: A word file that triggers the exception.
Patch for Exception triggered by utf.doc
Triggers a different cause of the Exception
Patch for Exception triggered by utf2.doc
Patch that fixes all problems with paragraph positions I had
Unicode patch
Unicode patch v.2
MSWord file that shows broken paragraph problem
unit test case

Description Benjamin Engele 2009-01-27 02:07:45 UTC
Problem is caused by unicode in the word document.
Documents that reproduce the problem are attached.

Code to reproduce:
HWPFDocument doc = new HWPFDocument(new FileInputStream(args[0]));
Range globalRange = doc.getRange();
for (int i = 0; i < globalRange.numParagraphs(); i++) {
	Paragraph p = globalRange.getParagraph(i);
	System.out.println(p.text());
	for (int j = 0; j < p.numCharacterRuns(); j++) {
		CharacterRun characterRun = p.getCharacterRun(j);
		characterRun.text();
	}
}
Comment 1 Benjamin Engele 2009-01-27 02:10:53 UTC
Created attachment 23178 [details]
A word file that triggers the exception.
Comment 2 Benjamin Engele 2009-01-27 02:32:18 UTC
Created attachment 23179 [details]
Patch for Exception triggered by utf.doc
Comment 3 Benjamin Engele 2009-01-27 02:34:45 UTC
Created attachment 23180 [details]
Triggers a different cause of the Exception
Comment 4 Benjamin Engele 2009-01-27 05:57:11 UTC
Created attachment 23181 [details]
Patch for Exception triggered by utf2.doc

Logic that calculates char index from byte index in BytePropertyNode rewritten.
Old approach to check if start index is in a unicode text piece and divide indexes by 2 in that case is wrong.
Comment 5 Benjamin Engele 2009-01-27 05:59:19 UTC
The root problem of this defect also causes other problems like paragraphs and character runs at wrong positions.
Comment 6 Benjamin Engele 2009-01-27 13:13:54 UTC
Patch for Exception triggered by utf2.doc doesn't resolve all problems with utf2.doc: The last paragraph is misplaced. This is happens because of another error in translating byte positions from FormatedDiskPage to char positions in the TextPiece.

Some more notes:
Writing wasn't tested and not changed. It is probably now more broken then it was before. BytePropertyNode.getStartBytes() and getEndBytes() definitely needs to be fixed, they still uses the wrong approach to calculate the byte index from the char index.

IMHO BytePropertyNode.isUnicode() should be removed as soon as get[Start/End]Bytes() has been fixed. Don't think the information that the tart of the node is in a unicode text piece is useful.
Comment 7 Benjamin Engele 2009-01-27 13:26:44 UTC
Created attachment 23184 [details]
Patch that fixes all problems with paragraph positions I had
Comment 8 Maxim Valyanskiy 2009-06-16 05:32:40 UTC
This patch greatly improves text extraction for Cyrillic documents on 3.5beta5.  Unfortunately it breaks few test cases (TestRangeDelete, TestRangeInsertion, TestRangeProperties and TestSectionTable).

Also patch fails to apply on 3.5beta6 and current trunk.
Comment 9 Maxim Valyanskiy 2009-06-18 07:30:35 UTC
I modifed Benjamin Engele patch:

1) Patch ported to current svn trunk (trivial)

2) Corrected getStartBytes()/getEndBytes() methods in BytePropertyNode. This fixes TestRangeDelete, TestRangeInsertion and TestSectionTable tests.

One test is still broken - TestRangeProperties
Comment 10 Maxim Valyanskiy 2009-06-18 07:32:41 UTC
Created attachment 23829 [details]
Unicode patch
Comment 11 Benjamin Engele 2009-06-18 08:09:50 UTC
Actually I didn't look at the test cases so I am no big help finding out why they fail... Happy to see that you managed to solve most test failures.
Comment 12 Maxim Valyanskiy 2009-06-19 04:57:16 UTC
New version:

Bugfixed CPtoFC and remove FCtoCP methods of SectionTable. Now we pass all unit-tests successful
Comment 13 Maxim Valyanskiy 2009-06-19 04:58:24 UTC
Created attachment 23833 [details]
Unicode patch v.2
Comment 14 Maxim Valyanskiy 2009-06-19 04:59:37 UTC
Created attachment 23834 [details]
MSWord file that shows broken paragraph problem
Comment 15 Yegor Kozlov 2009-06-19 05:44:59 UTC
Thanks for researching it. Is the patch ready to be committed?

Yegor
Comment 16 Maxim Valyanskiy 2009-06-19 05:53:30 UTC
Created attachment 23835 [details]
unit test case

src/scratchpad/testcases/org/apache/poi/hwpf/TestBug46610.java
Comment 17 Maxim Valyanskiy 2009-06-19 05:56:48 UTC
Yes, it is ready. This patch does not break existing unit tests and fixes few problems in text extraction. I do not have real world application to test writing. 

Please add attached unit test and put test files into src/scratchpad/testcases/org/apache/poi/hwpf/data/

utf.doc as Bug46610_1.doc
utf2.doc as Bug46610_2.doc
perl_o_fytbole_.doc as Bug46610_3.doc
Comment 18 Yegor Kozlov 2009-06-19 06:51:06 UTC
Benjamin and Maxim,

Thanks for researching this issue and providing the fix. The patch was applied in r786505

Yegor