61045 – RecordFormatException: Expected to find a ContinueRecord in order to read remaining 1 of 51 chars

Bug 61045 - RecordFormatException: Expected to find a ContinueRecord in order to read remaining 1 of 51 chars

Summary: RecordFormatException: Expected to find a ContinueRecord in order to read rem...

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HSSF (show other bugs)
Version:	3.16-dev
Hardware:	PC All

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-04-26 20:21 UTC by Tim Allison
Modified:	2017-06-20 18:12 UTC (History)
CC List:	0 users

Attachments
triggering doc from govdocs1 (10.50 KB, application/vnd.ms-excel) 2017-04-26 20:21 UTC, Tim Allison	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tim Allison 2017-04-26 20:21:24 UTC

Created attachment 34955 [details]
triggering doc from govdocs1

In our regression corpus, we have a few handfuls of cases of this stack trace:
org.apache.poi.hssf.record.RecordFormatException: Expected to find a ContinueRecord in order to read remaining 1 of 51 chars
    at org.apache.poi.hssf.record.RecordInputStream.readStringCommon(RecordInputStream.java:404)
    at org.apache.poi.hssf.record.RecordInputStream.readCompressedUnicode(RecordInputStream.java:363)
    at org.apache.poi.hssf.record.FormatRecord.<init>(FormatRecord.java:57)
    at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:84)
    at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:345)
    at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:307)
    at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:273)
    at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:175)
    at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:136)

In the file I examined, it looked like the record length is correct, but that the XLUnicodeString's length is off by one, and there is actually no continue record.

Excel is able to open this file without complaint.

I'm attaching a triggering file from govdocs1.

Comment 1 Tim Allison 2017-04-26 20:44:53 UTC

Offending 50 (not 51!) characters:
_("$"* #,##0_);_("$"* \(#,##0\);_("$"* "-"_);_(@_)

This is the same as the built-in (without escapes?):
_($* #,##0_);_($* (#,##0);_($* "-"_);_(@_)


The longest built-in custom format in my Excel is:
_($* #,##0.00_);_($* (#,##0.00);_($* "-"??_);_(@_)

which coincidentally has 50 characters.

I wonder if we're hitting a built in limitation on format size?

When I re-save the file, I get the proper 50 character length for the string:
2a 00 32 00 00

instead of the earlier 2a 00 33 00 00

Comment 2 Nick Burch 2017-04-26 21:58:38 UTC

An easy way to make the format string longer is to add colours, eg add [Red] to part of them. (Example at https://support.office.com/en-us/article/Format-negative-percentages-to-make-them-easy-to-find-d86b3d58-1cb4-40b3-a43b-8e1dae822112 ). Maybe try adding something like that to + / - / 0 rules to push the size out, and see if there's really a 50 limit or not?

Comment 3 Tim Allison 2017-04-27 15:21:02 UTC

Thank you, Nick.  It turns out that there isn't a hard limit of 50.  Adding [Red] can make for a format string > 50.

In our regression corpus, there are 13 examples of requested 51, there are only 50 bytes available; and there's one example from TIKA-2154 that shows some even greater, um, flexibility in RecordFormat.

If we add a custom readStringCommon to RecordFormat, all works[1]


    private String readStringCommon(RecordInputStream ris, int requestedLength, boolean pIsCompressedEncoding) {
        // Sanity check to detect garbage string lengths
        if (requestedLength < 0 || requestedLength > 0x100000) { // 16 million chars?
            throw new IllegalArgumentException("Bad requested string length (" + requestedLength + ")");
        }
        char[] buf = null;
        boolean isCompressedEncoding = pIsCompressedEncoding;
        int availableChars = isCompressedEncoding ? ris.remaining() : ris.remaining() / LittleEndianConsts.SHORT_SIZE;
        //everything worked out.  Great!
        int remaining = ris.remaining();
        if (requestedLength == availableChars) {
            buf = new char[requestedLength];
        } else {
            //sometimes in older Excel 97 .xls files,
            //the requested length is wrong.
            //Read all available characters.
            buf = new char[availableChars];
        }
        for (int i = 0; i < buf.length; i++) {
            char ch;
            if (isCompressedEncoding) {
                ch = (char) ris.readUByte();
            } else {
                ch = (char) ris.readShort();
            }
            buf[i] = ch;
        }

        //TIKA-2154's file shows that even in a unicode string
        //there can be a remaining byte (without proper final '00')
        //that should be read as a byte
        if (ris.available() == 1) {
            char[] tmp = new char[buf.length+1];
            System.arraycopy(buf, 0, tmp, 0, buf.length);
            tmp[buf.length] = (char)ris.readUByte();
            buf = tmp;
        }
        String ret = new String(buf);

        //swallow what's left
        while (ris.available() > 0) {
            ris.readByte();
        }
        return new String(buf);
    }

[1]  Well, not quite all, turns out that a DimensionsRecord can have an extra short in these files, too...argh...

Comment 4 Tim Allison 2017-04-27 15:22:52 UTC

If I added logging to warn that things have gone wonky in FormatRecord, are we willing to add this special handling code to deal with these non-standard files that Excel is able to read without complaint?

Comment 5 Tim Allison 2017-06-20 18:12:57 UTC

No objections were raised, I fixed this in r1799360