Bug 50955 - An error occurred while retrieving the text file.
Summary: An error occurred while retrieving the text file.
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.8-dev
Hardware: All All
: P2 normal with 1 vote (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
: 51946 60936 60942 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-03-22 12:48 UTC by Arthur
Modified: 2017-05-02 23:53 UTC (History)
2 users (show)



Attachments
File that does not parse. (589.00 KB, application/msword)
2011-03-22 12:51 UTC, Arthur
Details
Still have the problem in trunk (r1175705). (90.50 KB, application/msword)
2011-09-26 10:15 UTC, pqueixalos
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arthur 2011-03-22 12:48:52 UTC
When attempt to extract text from a file error output:

java.lang.IllegalStateException: Told we're for characters 0 -> 173225, but actually covers 173211 characters!
	at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:50)
	at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:95)
	at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:54)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:68)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:42)

Here's the source code, which I am trying to extract text from a file:

public Boolean parseFile(String pathToFile) {
        InputStream isr = null;
        try {
            isr = new FileInputStream(pathToFile);
            WordExtractor word = new WordExtractor(isr);
            String fileContent = "";
            String[] paragraphes = word.getParagraphText();
            for (String paragraph : paragraphes) {
                fileContent += " " + paragraph;
            }
            AddDataToIndex.class.newInstance().doAddData(fileContent, pathToFile);
            return true;
        } catch (OldWordFileFormatException ex) {
            return parseWord6(pathToFile);

        } catch (Exception ex) {
            Vars.logger.fatal(ex);
            return false;
        } finally {
            try {
                isr.close();
            } catch (IOException ex) {
                Vars.logger.fatal(ex);
            }
        }
    }

    private Boolean parseWord6(String pathToFile) {
        FileInputStream fis = null;
        try {
            File docFile = new File(pathToFile);
            fis = new FileInputStream(docFile.getAbsolutePath());
            POIFSFileSystem pfs = new POIFSFileSystem(fis);
            HWPFOldDocument doc = new HWPFOldDocument(pfs);
            Word6Extractor docExtractor = new Word6Extractor(doc);
            return true;
        } catch (Exception ex) {
            Vars.logger.fatal("Error: ", ex);
            return false;
        } finally {
            try {
                fis.close();
            } catch (IOException ex) {
                Vars.logger.fatal("Error", ex);
            }
        }
    }

File, which I tried to parse - attached.
Comment 1 Arthur 2011-03-22 12:51:46 UTC
Created attachment 26789 [details]
File that does not parse.

File that does not parse.
Comment 2 Yegor Kozlov 2011-06-24 08:27:32 UTC
Still have the problem in trunk.

Yegor
Comment 3 Sergey Vladimirov 2011-07-09 14:54:40 UTC
This problem is related to text character-detection in Word95. It doesn't have native unicode text support and stores all files in 8-bit encoding (not Windows-1252).

Need to figure a way to correctly extract source text encoding from Word 95 files.
Comment 4 pqueixalos 2011-09-26 10:15:20 UTC
Created attachment 27597 [details]
Still have the problem in trunk (r1175705).

[...]
at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:73)
at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:111)
at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70)
[...]
Comment 5 Sergey Vladimirov 2012-11-05 15:54:37 UTC
*** Bug 51946 has been marked as a duplicate of this bug. ***
Comment 6 Muhammad Fazal 2014-08-26 05:23:17 UTC
Any updated on this bug, since i am getting the same error.

Please suggest if some workaround available. I am wondering this bus is in new state since 2012.
Comment 7 Tim Allison 2017-03-31 18:37:10 UTC
*** Bug 60936 has been marked as a duplicate of this bug. ***
Comment 8 Tim Allison 2017-03-31 18:37:44 UTC
*** Bug 60942 has been marked as a duplicate of this bug. ***
Comment 9 Tim Allison 2017-03-31 18:44:14 UTC
I figured out how to read the old font table which includes codepage info.  This doesn't solve all of our problems, but it helps.  Via testing with OpenOffice, I found that I can't have two different codepages in one document...that may be a feature of OpenOffice and not reality, but this hack/heuristic works with all files attached here, TIKA-2313 and files generated with OpenOffice.

So, the current temporary solution is to read through the font table and pick the codepage that isn't "default" or "symbol."

Ideally, we'd be able to map each run to a font table.  If anyone has recommendations, let me know.


Side note:
I also fixed a bug in PapInTable:

-   if ( papx.getGrpprl() == null || papx.getGrpprl().length == 0 )
+   if ( papx.getGrpprl() == null || papx.getGrpprl().length <= 2 )

The issue is that there were some grpprls with size 1 in the old docs, and this caused an array out of bounds exception when copying because we start at offset 2.

Commit to come shortly.
Comment 10 Tim Allison 2017-03-31 19:51:34 UTC
Gah.  Bug51944.doc shows that the charset can be UTF-16LE.  The text reads correctly when we decode the bytes with UTF-16LE, but this file makes clear that we are correct not to check for the isUnicode byte and then do the /= 2, etc.

Solution will have to wait until next week...argh...
Comment 11 Tim Allison 2017-04-03 16:19:01 UTC
Turns out that 51944.doc is not UTF-16LE.  It looks from this file and 2 other files from our common crawl corpus like this is actually Big5, but MS appears to zero-pad ascii characters.  

Has anyone worked with this?  Do we have something in our codebase that deals with this already?

If not, we may need some extra code to imitate MS's big5 en/decoding...not within the scope of this ticket.

It looks from ~1300 Word 6.0 files in our corpus, that the proposed solution works.  Unfortunately, there are only a few handfuls of files that aren't encoded with WIN-1252.
Comment 12 Tim Allison 2017-04-04 02:18:43 UTC
r1790061

If anyone has a chance to review this before the next release, that'd be great.

The current heuristic looks for a non-default/symbol codepage in the font table and then applies that. 

I was able to find only one file in ~1300 where this heuristic fails, and I'll open a follow up issue for that.

The other item that I worked towards fixing is that we need special handling for Big5. MS Word 6.0 stored e.g. 7C B7 in reverse order B7 7C, and it zero padded ascii characters.  Even if we flip the bytes, new String(byte[], "Big5") doesn't strip out the zero-padding in the ascii.

There remains the basic problem that TextPiece stores data in a StringBuilder, and the actual conversion of bytes to chars isn't straight forward.

For example, if we assume that Big5 requires 2x the number of bytes, all is well with storage, but then it contains the 0 padding, and our code assumes that the StringBuilder contains an actual strings, not this zero-padded stuff...so we'd have to strip those out.  From a storage perspective, and "closer to MSWord" perspective, this is probably better.  If we count the number of bytes read per # of chars, we get a mismatch.  There's no apparent easy solution to this.

Finally, I couldn't find a way of linking runs or text pieces to fonts. In the few files I found with multiple non-default encodings, the font encoding offset for the FFn in the runs was always 0, even though the actual font used was not 0, if we go by the codepage info.
Comment 13 Nick Burch 2017-04-04 08:22:51 UTC
I'd suggest raising a new bug for the Big5 stuff, and attach an example file there. Since we only need to support read (not write), I have some ideas on how we might solve it, but best tracked in another bug
Comment 14 Tim Allison 2017-04-04 12:20:33 UTC
Thank you, Nick.  I'd greatly appreciate your help!

See Bug 60953.
Comment 15 Tim Allison 2017-04-04 12:22:16 UTC
>I was able to find only one file in ~1300 where this heuristic fails, and I'll open a follow up issue for that.

See Bug 60952
Comment 16 Andreas Beeker 2017-05-02 23:53:13 UTC
After patching HPSF (see #61062 ) - the DocumentSummary heuristics don't work anymore, therefore I've reverted the changes to the codepage guessing with r1793601