Bug 47496

Summary: Strange MS Word file reading behavior
Product: POI Reporter: Andremoniy <andremoniy>
Component: HWPFAssignee: POI Developers List <dev>
Status: RESOLVED LATER    
Severity: critical CC: andremoniy
Priority: P1    
Version: 3.5-dev   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Attachments: number of paragraph is 16

Description Andremoniy 2009-07-08 02:53:36 UTC
There is very strange behavior. There are some doc-files in russian encoding which contain particularly only tables. When reading them, HWPF returns a half of document in the correct representation (each logical element in a appropriate paragraph... character run...), and the second half of document is represented in one paragraph and in the one character run. One of the effects of this behavior is incorrect work of the TableIterator which returns only one half of the all document's tables.
The debugging shows, that there are some strange breakthroughs in start<->end values, when reading Plex of CPs. Here are printout of debug info (derived from manually injected code lines in recompiled PlexOfCps class):
1. Creating TextPieceTable (in ComplexFileTable analyzing):
-----------------------------
start = 16474 size=448 sizeOfStruct=8
-----------------------------
Start -> 0 to end <-256
Start -> 256 to end <-1280
Start -> 1280 to end <-2048
Start -> 2048 to end <-3072
Start -> 3072 to end <-3840
Start -> 3840 to end <-4864
...
Start -> 25856 to end <-26368
Start -> 26368 to end <-27136
Start -> 27136 to end <-27648
Start -> 27648 to end <-28928
Start -> 28928 to end <-29184
Start -> 29184 to end <-58063 <--- !!! HERE !!!

2. Creating PAPBinTable:
-----------------------------
start = 7117 size=5020 sizeOfStruct=4
-----------------------------
Start -> 2048 to end <-2338
Start -> 2338 to end <-2546
Start -> 2546 to end <-2556
...
Start -> 59556 to end <-59694
Start -> 59694 to end <-59708
Start -> 59708 to end <-60402
Start -> 60402 to end <-264814  <--- !!! HERE !!!
Start -> 264814 to end <-264828
Start -> 264828 to end <-265600
Start -> 265600 to end <-265604
...
Start -> 320214 to end <-321000
Start -> 321000 to end <-321936
Start -> 321936 to end <-321950


Unfortunately, I can't attach this document files because of private information containing in this files.
Comment 1 weizi 2009-07-24 02:02:54 UTC
Created attachment 24031 [details]
number of paragraph is 16

Number of paragraph is 16 in this doc .
HWPFDocument daDoc = new HWPFDocument(new FileInputStream("test.doc"));
			Range wordRange = daDoc.getRange();
			wordRange.numParagraphs()is 14
A property claimed to start before zero, at -256! Resetting it to zero, and hoping for the best
papformatteddiskpag_70=-256 -> 54 = true
A property claimed to start before zero, at -256! Resetting it to zero, and hoping for the best
papformatteddiskpag_70=54 -> 66 = true
papformatteddiskpag_70=66 -> 67 = true
papformatteddiskpag_70=67 -> 77 = true
papformatteddiskpag_70=77 -> 81 = true
papformatteddiskpag_70=81 -> 95 = true
papformatteddiskpag_70=95 -> 113 = true
papformatteddiskpag_70=113 -> 145 = true
papformatteddiskpag_70=145 -> 173 = true
papformatteddiskpag_70=173 -> 198 = true
papformatteddiskpag_70=198 -> 217 = true
papformatteddiskpag_70=217 -> 230 = true
papformatteddiskpag_70=230 -> 243 = true
papformatteddiskpag_70=243 -> 1052 = true   --here
papformatteddiskpag_70=1052 -> 1078 = true
papformatteddiskpag_70=1078 -> 1117 = true
in the range method of findRange has List rpl parameter, size of rpl is 16.rpl[13]._cpStart=243;rpl[13]._cpEnd=1052;range._end=336.rpl[13]._cpEnd=1052>range._end=336;return 14.
Comment 2 weizi 2009-08-27 01:18:26 UTC
This reasion is that address of TextPiece is logic sequence in TextPieceTable .But address of PAPBinTable or address of CHPBinTable is physical address ,this address is not always sequence .Address of PAPBinTable or address of CHPBinTable is out-of-correspondence position address of TextPiece.
Comment 3 inthendsun 2009-09-20 17:56:04 UTC
anybody know how to fix the num of paragraph ?
Comment 4 Nick Burch 2011-02-25 16:53:49 UTC
I believe that the HWPF unicode related fixes in the last 18 months should have fixed these problems. Please re-open the bug if you're still hitting the issues with a recent nightly / 3.8 beta 1 (which is due out soon).