Bug 48075 - Broken paragraph to text mapping in some documents
Summary: Broken paragraph to text mapping in some documents
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.6-dev
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2009-10-28 07:00 UTC by Maxim Valyanskiy
Modified: 2011-07-12 10:40 UTC (History)
0 users

document (35.00 KB, application/msword)
2009-10-28 07:01 UTC, Maxim Valyanskiy

Note You need to log in before you can comment on or make changes to this bug.
Description Maxim Valyanskiy 2009-10-28 07:00:02 UTC
WordExtractor.getParagraphText() extracts incomplete and broken text data from attached document. Hovever, WordExtractor.getTextFromPieces() extracts complete correct text (the same as in MS Office).

It seems that there is a problem in paragraph to text mapping.

Problem exists on few documents from the same source, text extraction from many other documents works fine.

POI version poi-3.6-beta1-20091002 (svn trunk)
Comment 1 Maxim Valyanskiy 2009-10-28 07:01:05 UTC
Created attachment 24433 [details]
Comment 2 Maxim Valyanskiy 2010-08-04 07:52:38 UTC
Paragraph offsets (FC) in PAPX in this file are 2048 bytes larger than real character data in text pieces. Hm.
Comment 3 Maxim Valyanskiy 2010-08-04 08:45:04 UTC
Fixed by workaround in r982238
Comment 4 Sergey Vladimirov 2011-07-11 16:58:17 UTC
This file seems so very wrong to me. OpenOffice or LibreOffice can't even show it correctly.

More detailed, it have 2 TextPieces:

TextPiece from 0 to 1199 (PieceDescriptor (pos: 2048; unicode))
TextPiece from 1199 to 2377 (PieceDescriptor (pos: 4608; unicode))

but all CHPX are reffers to second text piece:

* CHPX from 1024 to 1037 (in bytes 4096 to 4122)
* CHPX from 1037 to 1038 (in bytes 4122 to 4124)
* ...
* CHPX from 2142 to 2377 (in bytes 6494 to 11776)

as well as PAPX:
* PAPX from 1185 to 1199 (in bytes 4418 to 4478)
* PAPX from 2142 to 2377 (in bytes 6494 to 12102)

so it just bad file, AFAIK.

Apart from that, there is a table without single row or cell. I.e. there is a PAPX with inTable=true, but no end cells marks.
Comment 5 Maxim Valyanskiy 2011-07-11 19:43:03 UTC
Sergey, can it be "autosaved" file? I seen some strange format violations in such files
Comment 6 Sergey Vladimirov 2011-07-12 10:40:03 UTC

No, it doesn't look like quick-saved:

         .fComplex                 = false

Although it was quick-saved 15 times, currently it states as fully-saved file. Also there is no additional grpprl(s) in CPL section, i.e. there is no SPRM(s) quicksave data.