Created attachment 34955 [details] triggering doc from govdocs1 In our regression corpus, we have a few handfuls of cases of this stack trace: org.apache.poi.hssf.record.RecordFormatException: Expected to find a ContinueRecord in order to read remaining 1 of 51 chars at org.apache.poi.hssf.record.RecordInputStream.readStringCommon(RecordInputStream.java:404) at org.apache.poi.hssf.record.RecordInputStream.readCompressedUnicode(RecordInputStream.java:363) at org.apache.poi.hssf.record.FormatRecord.<init>(FormatRecord.java:57) at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:84) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:345) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:307) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:273) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:175) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:136) In the file I examined, it looked like the record length is correct, but that the XLUnicodeString's length is off by one, and there is actually no continue record. Excel is able to open this file without complaint. I'm attaching a triggering file from govdocs1.
Offending 50 (not 51!) characters: _("$"* #,##0_);_("$"* \(#,##0\);_("$"* "-"_);_(@_) This is the same as the built-in (without escapes?): _($* #,##0_);_($* (#,##0);_($* "-"_);_(@_) The longest built-in custom format in my Excel is: _($* #,##0.00_);_($* (#,##0.00);_($* "-"??_);_(@_) which coincidentally has 50 characters. I wonder if we're hitting a built in limitation on format size? When I re-save the file, I get the proper 50 character length for the string: 2a 00 32 00 00 instead of the earlier 2a 00 33 00 00
An easy way to make the format string longer is to add colours, eg add [Red] to part of them. (Example at https://support.office.com/en-us/article/Format-negative-percentages-to-make-them-easy-to-find-d86b3d58-1cb4-40b3-a43b-8e1dae822112 ). Maybe try adding something like that to + / - / 0 rules to push the size out, and see if there's really a 50 limit or not?
Thank you, Nick. It turns out that there isn't a hard limit of 50. Adding [Red] can make for a format string > 50. In our regression corpus, there are 13 examples of requested 51, there are only 50 bytes available; and there's one example from TIKA-2154 that shows some even greater, um, flexibility in RecordFormat. If we add a custom readStringCommon to RecordFormat, all works[1] private String readStringCommon(RecordInputStream ris, int requestedLength, boolean pIsCompressedEncoding) { // Sanity check to detect garbage string lengths if (requestedLength < 0 || requestedLength > 0x100000) { // 16 million chars? throw new IllegalArgumentException("Bad requested string length (" + requestedLength + ")"); } char[] buf = null; boolean isCompressedEncoding = pIsCompressedEncoding; int availableChars = isCompressedEncoding ? ris.remaining() : ris.remaining() / LittleEndianConsts.SHORT_SIZE; //everything worked out. Great! int remaining = ris.remaining(); if (requestedLength == availableChars) { buf = new char[requestedLength]; } else { //sometimes in older Excel 97 .xls files, //the requested length is wrong. //Read all available characters. buf = new char[availableChars]; } for (int i = 0; i < buf.length; i++) { char ch; if (isCompressedEncoding) { ch = (char) ris.readUByte(); } else { ch = (char) ris.readShort(); } buf[i] = ch; } //TIKA-2154's file shows that even in a unicode string //there can be a remaining byte (without proper final '00') //that should be read as a byte if (ris.available() == 1) { char[] tmp = new char[buf.length+1]; System.arraycopy(buf, 0, tmp, 0, buf.length); tmp[buf.length] = (char)ris.readUByte(); buf = tmp; } String ret = new String(buf); //swallow what's left while (ris.available() > 0) { ris.readByte(); } return new String(buf); } [1] Well, not quite all, turns out that a DimensionsRecord can have an extra short in these files, too...argh...
If I added logging to warn that things have gone wonky in FormatRecord, are we willing to add this special handling code to deal with these non-standard files that Excel is able to read without complaint?
No objections were raised, I fixed this in r1799360