Bug 61045

Summary: RecordFormatException: Expected to find a ContinueRecord in order to read remaining 1 of 51 chars
Product: POI Reporter: Tim Allison <tallison>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.16-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: triggering doc from govdocs1

Description Tim Allison 2017-04-26 20:21:24 UTC
Created attachment 34955 [details]
triggering doc from govdocs1

In our regression corpus, we have a few handfuls of cases of this stack trace:
org.apache.poi.hssf.record.RecordFormatException: Expected to find a ContinueRecord in order to read remaining 1 of 51 chars
    at org.apache.poi.hssf.record.RecordInputStream.readStringCommon(RecordInputStream.java:404)
    at org.apache.poi.hssf.record.RecordInputStream.readCompressedUnicode(RecordInputStream.java:363)
    at org.apache.poi.hssf.record.FormatRecord.<init>(FormatRecord.java:57)
    at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:84)
    at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:345)
    at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:307)
    at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:273)
    at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:175)
    at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:136)

In the file I examined, it looked like the record length is correct, but that the XLUnicodeString's length is off by one, and there is actually no continue record.

Excel is able to open this file without complaint.

I'm attaching a triggering file from govdocs1.
Comment 1 Tim Allison 2017-04-26 20:44:53 UTC
Offending 50 (not 51!) characters:
_("$"* #,##0_);_("$"* \(#,##0\);_("$"* "-"_);_(@_)

This is the same as the built-in (without escapes?):
_($* #,##0_);_($* (#,##0);_($* "-"_);_(@_)


The longest built-in custom format in my Excel is:
_($* #,##0.00_);_($* (#,##0.00);_($* "-"??_);_(@_)

which coincidentally has 50 characters.

I wonder if we're hitting a built in limitation on format size?

When I re-save the file, I get the proper 50 character length for the string:
2a 00 32 00 00

instead of the earlier 2a 00 33 00 00
Comment 2 Nick Burch 2017-04-26 21:58:38 UTC
An easy way to make the format string longer is to add colours, eg add [Red] to part of them. (Example at https://support.office.com/en-us/article/Format-negative-percentages-to-make-them-easy-to-find-d86b3d58-1cb4-40b3-a43b-8e1dae822112 ). Maybe try adding something like that to + / - / 0 rules to push the size out, and see if there's really a 50 limit or not?
Comment 3 Tim Allison 2017-04-27 15:21:02 UTC
Thank you, Nick.  It turns out that there isn't a hard limit of 50.  Adding [Red] can make for a format string > 50.

In our regression corpus, there are 13 examples of requested 51, there are only 50 bytes available; and there's one example from TIKA-2154 that shows some even greater, um, flexibility in RecordFormat.

If we add a custom readStringCommon to RecordFormat, all works[1]


    private String readStringCommon(RecordInputStream ris, int requestedLength, boolean pIsCompressedEncoding) {
        // Sanity check to detect garbage string lengths
        if (requestedLength < 0 || requestedLength > 0x100000) { // 16 million chars?
            throw new IllegalArgumentException("Bad requested string length (" + requestedLength + ")");
        }
        char[] buf = null;
        boolean isCompressedEncoding = pIsCompressedEncoding;
        int availableChars = isCompressedEncoding ? ris.remaining() : ris.remaining() / LittleEndianConsts.SHORT_SIZE;
        //everything worked out.  Great!
        int remaining = ris.remaining();
        if (requestedLength == availableChars) {
            buf = new char[requestedLength];
        } else {
            //sometimes in older Excel 97 .xls files,
            //the requested length is wrong.
            //Read all available characters.
            buf = new char[availableChars];
        }
        for (int i = 0; i < buf.length; i++) {
            char ch;
            if (isCompressedEncoding) {
                ch = (char) ris.readUByte();
            } else {
                ch = (char) ris.readShort();
            }
            buf[i] = ch;
        }

        //TIKA-2154's file shows that even in a unicode string
        //there can be a remaining byte (without proper final '00')
        //that should be read as a byte
        if (ris.available() == 1) {
            char[] tmp = new char[buf.length+1];
            System.arraycopy(buf, 0, tmp, 0, buf.length);
            tmp[buf.length] = (char)ris.readUByte();
            buf = tmp;
        }
        String ret = new String(buf);

        //swallow what's left
        while (ris.available() > 0) {
            ris.readByte();
        }
        return new String(buf);
    }

[1]  Well, not quite all, turns out that a DimensionsRecord can have an extra short in these files, too...argh...
Comment 4 Tim Allison 2017-04-27 15:22:52 UTC
If I added logging to warn that things have gone wonky in FormatRecord, are we willing to add this special handling code to deal with these non-standard files that Excel is able to read without complaint?
Comment 5 Tim Allison 2017-06-20 18:12:57 UTC
No objections were raised, I fixed this in r1799360