WordExtractor.getParagraphText() extracts incomplete and broken text data from attached document. Hovever, WordExtractor.getTextFromPieces() extracts complete correct text (the same as in MS Office). It seems that there is a problem in paragraph to text mapping. Problem exists on few documents from the same source, text extraction from many other documents works fine. POI version poi-3.6-beta1-20091002 (svn trunk)
Created attachment 24433 [details] document
Paragraph offsets (FC) in PAPX in this file are 2048 bytes larger than real character data in text pieces. Hm.
Fixed by workaround in r982238
This file seems so very wrong to me. OpenOffice or LibreOffice can't even show it correctly. More detailed, it have 2 TextPieces: TextPiece from 0 to 1199 (PieceDescriptor (pos: 2048; unicode)) TextPiece from 1199 to 2377 (PieceDescriptor (pos: 4608; unicode)) but all CHPX are reffers to second text piece: * CHPX from 1024 to 1037 (in bytes 4096 to 4122) * CHPX from 1037 to 1038 (in bytes 4122 to 4124) * ... * CHPX from 2142 to 2377 (in bytes 6494 to 11776) as well as PAPX: * PAPX from 1185 to 1199 (in bytes 4418 to 4478) * PAPX from 2142 to 2377 (in bytes 6494 to 12102) so it just bad file, AFAIK. Apart from that, there is a table without single row or cell. I.e. there is a PAPX with inTable=true, but no end cells marks.
Sergey, can it be "autosaved" file? I seen some strange format violations in such files
Maxim, No, it doesn't look like quick-saved: [FIB] ... .fComplex = false ... [/FIB] Although it was quick-saved 15 times, currently it states as fully-saved file. Also there is no additional grpprl(s) in CPL section, i.e. there is no SPRM(s) quicksave data.