There is very strange behavior. There are some doc-files in russian encoding which contain particularly only tables. When reading them, HWPF returns a half of document in the correct representation (each logical element in a appropriate paragraph... character run...), and the second half of document is represented in one paragraph and in the one character run. One of the effects of this behavior is incorrect work of the TableIterator which returns only one half of the all document's tables. The debugging shows, that there are some strange breakthroughs in start<->end values, when reading Plex of CPs. Here are printout of debug info (derived from manually injected code lines in recompiled PlexOfCps class): 1. Creating TextPieceTable (in ComplexFileTable analyzing): ----------------------------- start = 16474 size=448 sizeOfStruct=8 ----------------------------- Start -> 0 to end <-256 Start -> 256 to end <-1280 Start -> 1280 to end <-2048 Start -> 2048 to end <-3072 Start -> 3072 to end <-3840 Start -> 3840 to end <-4864 ... Start -> 25856 to end <-26368 Start -> 26368 to end <-27136 Start -> 27136 to end <-27648 Start -> 27648 to end <-28928 Start -> 28928 to end <-29184 Start -> 29184 to end <-58063 <--- !!! HERE !!! 2. Creating PAPBinTable: ----------------------------- start = 7117 size=5020 sizeOfStruct=4 ----------------------------- Start -> 2048 to end <-2338 Start -> 2338 to end <-2546 Start -> 2546 to end <-2556 ... Start -> 59556 to end <-59694 Start -> 59694 to end <-59708 Start -> 59708 to end <-60402 Start -> 60402 to end <-264814 <--- !!! HERE !!! Start -> 264814 to end <-264828 Start -> 264828 to end <-265600 Start -> 265600 to end <-265604 ... Start -> 320214 to end <-321000 Start -> 321000 to end <-321936 Start -> 321936 to end <-321950 Unfortunately, I can't attach this document files because of private information containing in this files.
Created attachment 24031 [details] number of paragraph is 16 Number of paragraph is 16 in this doc . HWPFDocument daDoc = new HWPFDocument(new FileInputStream("test.doc")); Range wordRange = daDoc.getRange(); wordRange.numParagraphs()is 14 A property claimed to start before zero, at -256! Resetting it to zero, and hoping for the best papformatteddiskpag_70=-256 -> 54 = true A property claimed to start before zero, at -256! Resetting it to zero, and hoping for the best papformatteddiskpag_70=54 -> 66 = true papformatteddiskpag_70=66 -> 67 = true papformatteddiskpag_70=67 -> 77 = true papformatteddiskpag_70=77 -> 81 = true papformatteddiskpag_70=81 -> 95 = true papformatteddiskpag_70=95 -> 113 = true papformatteddiskpag_70=113 -> 145 = true papformatteddiskpag_70=145 -> 173 = true papformatteddiskpag_70=173 -> 198 = true papformatteddiskpag_70=198 -> 217 = true papformatteddiskpag_70=217 -> 230 = true papformatteddiskpag_70=230 -> 243 = true papformatteddiskpag_70=243 -> 1052 = true --here papformatteddiskpag_70=1052 -> 1078 = true papformatteddiskpag_70=1078 -> 1117 = true in the range method of findRange has List rpl parameter, size of rpl is 16.rpl[13]._cpStart=243;rpl[13]._cpEnd=1052;range._end=336.rpl[13]._cpEnd=1052>range._end=336;return 14.
This reasion is that address of TextPiece is logic sequence in TextPieceTable .But address of PAPBinTable or address of CHPBinTable is physical address ,this address is not always sequence .Address of PAPBinTable or address of CHPBinTable is out-of-correspondence position address of TextPiece.
anybody know how to fix the num of paragraph ?
I believe that the HWPF unicode related fixes in the last 18 months should have fixed these problems. Please re-open the bug if you're still hitting the issues with a recent nightly / 3.8 beta 1 (which is due out soon).