|Summary:||Broken paragraph to text mapping in some documents|
|Product:||POI||Reporter:||Maxim Valyanskiy <max.valjanski>|
|Component:||HWPF||Assignee:||POI Developers List <dev>|
Description Maxim Valyanskiy 2009-10-28 07:00:02 UTC
WordExtractor.getParagraphText() extracts incomplete and broken text data from attached document. Hovever, WordExtractor.getTextFromPieces() extracts complete correct text (the same as in MS Office). It seems that there is a problem in paragraph to text mapping. Problem exists on few documents from the same source, text extraction from many other documents works fine. POI version poi-3.6-beta1-20091002 (svn trunk)
Comment 2 Maxim Valyanskiy 2010-08-04 07:52:38 UTC
Paragraph offsets (FC) in PAPX in this file are 2048 bytes larger than real character data in text pieces. Hm.
Comment 4 Sergey Vladimirov 2011-07-11 16:58:17 UTC
This file seems so very wrong to me. OpenOffice or LibreOffice can't even show it correctly. More detailed, it have 2 TextPieces: TextPiece from 0 to 1199 (PieceDescriptor (pos: 2048; unicode)) TextPiece from 1199 to 2377 (PieceDescriptor (pos: 4608; unicode)) but all CHPX are reffers to second text piece: * CHPX from 1024 to 1037 (in bytes 4096 to 4122) * CHPX from 1037 to 1038 (in bytes 4122 to 4124) * ... * CHPX from 2142 to 2377 (in bytes 6494 to 11776) as well as PAPX: * PAPX from 1185 to 1199 (in bytes 4418 to 4478) * PAPX from 2142 to 2377 (in bytes 6494 to 12102) so it just bad file, AFAIK. Apart from that, there is a table without single row or cell. I.e. there is a PAPX with inTable=true, but no end cells marks.
Comment 5 Maxim Valyanskiy 2011-07-11 19:43:03 UTC
Sergey, can it be "autosaved" file? I seen some strange format violations in such files
Comment 6 Sergey Vladimirov 2011-07-12 10:40:03 UTC
Maxim, No, it doesn't look like quick-saved: [FIB] ... .fComplex = false ... [/FIB] Although it was quick-saved 15 times, currently it states as fully-saved file. Also there is no additional grpprl(s) in CPL section, i.e. there is no SPRM(s) quicksave data.