Bug 56893 - Overflow in UnicodeString results in corrupted file when setCellValue() is called with a string larger than 32767
Summary: Overflow in UnicodeString results in corrupted file when setCellValue() is ca...
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 3.9-FINAL
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2014-08-27 13:15 UTC by Mirjan Merruko
Modified: 2015-03-01 21:58 UTC (History)
0 users

Small JUnit test which demonstrates the problem with setCellValue(String) (1.18 KB, text/x-java)
2014-08-29 18:50 UTC, Mirjan Merruko

Note You need to log in before you can comment on or make changes to this bug.
Description Mirjan Merruko 2014-08-27 13:15:57 UTC

In the following snippet from org.apache.poi.hssf.record.common.UnicodeString there is a cast from int to a short. In my case I had a string with a length of 65571 which when cast to short would be 35.

public void setString(String string)
        field_3_string = string;
        // scan for characters greater than 255 ... if any are
        // present, we have to use 16-bit encoding. Otherwise, we
        // can use 8-bit encoding
        boolean useUTF16 = false;
        int strlen = string.length();

        for ( int j = 0; j < strlen; j++ )
            if ( string.charAt( j ) > 255 )
                useUTF16 = true;
        if (useUTF16)
          //Set the uncompressed bit
          field_2_optionflags = highByte.setByte(field_2_optionflags);
        else field_2_optionflags = highByte.clearByte(field_2_optionflags);

Now as setCellValue(String value) in HSSFCell first creates an HSSFRichTextString then calls setCellValue(RichTextString value) this makes the check below valid while in fact it's not.

if(hvalue.length() > SpreadsheetVersion.EXCEL97.getMaxTextLength()){
  throw new IllegalArgumentException("The maximum length of cell contents (text) is 32,767 characters");

As I'm not very familiar with the concepts here, excuse me if I'm wrong, but it would seem that the cast could be removed from the setCharCount call. A UnicodeString wouldn't need to enforce the limit set by the Excel standard. That should be just enough to avoid this overflow problem, otherwise checking the size of the original string in setCellValue(String value) might be enough.
Comment 1 Nick Burch 2014-08-28 05:59:14 UTC
Are you able to create a short junit unit test which shows how to get round the check in HSSFRichTextString?

(We need to decide if we should remove the cast, or add an additional check, the unit test showing how to trigger it should help with that)
Comment 2 Mirjan Merruko 2014-08-29 18:50:41 UTC
Created attachment 31951 [details]
Small JUnit test which demonstrates the problem with setCellValue(String)
Comment 3 Dominik Stadler 2014-12-22 09:41:43 UTC
The call to HSSFRichTextString(value) and UnicodeString(value) is done before setCellValue(), so we potentially cut off via the case in there before we do the actual check currently.
Comment 4 Dominik Stadler 2015-03-01 21:58:55 UTC
It seems UnicodeString is used for multiple items listed in the spec. 

According to the spec under "2.5.294 XLUnicodeString" the length is specified as 2 bytes.

Another usage is XLUnicodeRichExtendedString under "2.5.293 XLUnicodeRichExtendedString", this one allows continuation records, however the lenght-information still only allows 2 bytes.

So it seems there is a limit of 65535 characters in the record-definitions.

SpreadsheetVersion.EXCEL97, which we use to verify text-length in other places has 32767, not sure if this is somewhere in the spec or imposed because of other issues. At least formula-text seems to be limited to this value by the spec.