Bug 48075

Summary:	Broken paragraph to text mapping in some documents
Product:	POI	Reporter:	Maxim Valyanskiy <max.valjanski>
Component:	HWPF	Assignee:	POI Developers List <dev>
Status:	RESOLVED FIXED
Severity:	normal
Priority:	P2
Version:	3.6-dev
Target Milestone:	---
Hardware:	PC
OS:	Linux
Attachments:	document

Description Maxim Valyanskiy 2009-10-28 07:00:02 UTC

WordExtractor.getParagraphText() extracts incomplete and broken text data from attached document. Hovever, WordExtractor.getTextFromPieces() extracts complete correct text (the same as in MS Office).

It seems that there is a problem in paragraph to text mapping.

Problem exists on few documents from the same source, text extraction from many other documents works fine.

POI version poi-3.6-beta1-20091002 (svn trunk)

Comment 1 Maxim Valyanskiy 2009-10-28 07:01:05 UTC

Created attachment 24433 [details]
document

Comment 2 Maxim Valyanskiy 2010-08-04 07:52:38 UTC

Paragraph offsets (FC) in PAPX in this file are 2048 bytes larger than real character data in text pieces. Hm.

Comment 3 Maxim Valyanskiy 2010-08-04 08:45:04 UTC

Fixed by workaround in r982238

Comment 4 Sergey Vladimirov 2011-07-11 16:58:17 UTC

This file seems so very wrong to me. OpenOffice or LibreOffice can't even show it correctly.

More detailed, it have 2 TextPieces:

TextPiece from 0 to 1199 (PieceDescriptor (pos: 2048; unicode))
TextPiece from 1199 to 2377 (PieceDescriptor (pos: 4608; unicode))

but all CHPX are reffers to second text piece:

* CHPX from 1024 to 1037 (in bytes 4096 to 4122)
* CHPX from 1037 to 1038 (in bytes 4122 to 4124)
* ...
* CHPX from 2142 to 2377 (in bytes 6494 to 11776)

as well as PAPX:
* PAPX from 1185 to 1199 (in bytes 4418 to 4478)
* PAPX from 2142 to 2377 (in bytes 6494 to 12102)

so it just bad file, AFAIK.

Apart from that, there is a table without single row or cell. I.e. there is a PAPX with inTable=true, but no end cells marks.

Comment 5 Maxim Valyanskiy 2011-07-11 19:43:03 UTC

Sergey, can it be "autosaved" file? I seen some strange format violations in such files

Comment 6 Sergey Vladimirov 2011-07-12 10:40:03 UTC

Maxim,

No, it doesn't look like quick-saved:

[FIB]
...
         .fComplex                 = false
...
[/FIB]

Although it was quick-saved 15 times, currently it states as fully-saved file. Also there is no additional grpprl(s) in CPL section, i.e. there is no SPRM(s) quicksave data.