Bug 38230

Summary: [PATCH] UnicodeString#fillFields invalid read of non US characters >=128 and <=255
Product: POI Reporter: Perolo Silantico <per.sil>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: major CC: c.gosch
Priority: P2 Keywords: PatchAvailable
Version: 3.0-dev   
Target Milestone: ---   
Hardware: All   
OS: other   
Attachments: poi-UnicodeString-typecast.2006-01-11.diff

Description Perolo Silantico 2006-01-11 23:28:56 UTC
I had a problem reading HSSFCell values with german specific letters (umlauts).
Most probably the same difficulties apply to all characters from integer value
128 to 255.

They all have ended up with high byte having all bits set to 1. It has turned
out this is a type cast problem on J2SE 1.4.2(06). 

Casting from byte to char seems to take the highest bit of the byte to fill the
high byte of the char value. German umlaut ä (&auml;) uses 0xe4 or 11100100.
Converting this value to char results in 1111111111100100.

See this small code:
----------------------
public class ByteConverterTest {
    public static void main(String[] args) {
        byte umlautChar = (byte)0xe4;  // the German umlaut &auml; ä
        char badEncoded = (char)umlautChar;
        char goodEncoded = (char)( (short)0xff & (short)umlautChar );
        
        System.out.println("Badly converted umlaut uses hex value: " +
Integer.toHexString(badEncoded));
        System.out.println("Good converted umlaut uses hex value: " +
Integer.toHexString(goodEncoded) + "\n");
    }
}
----------------------

Output is:
----------------------
Badly converted umlaut uses hex value: ffe4
Good converted umlaut uses hex value: e4
----------------------

Attached you will find a patch to resolve this issue with the class
UnicodeString. The function fillFields uses this type of inproper type cast.
Perhaps ofer classes do as well.

Reproducible: Always (see test code)
Plattform: Windows 2k, Linux 2.6.x
JVM: J2SE 1.4.2(06) and J2SE 1.4.2(10)


For those who are experiencing the same problem but do not want to wait for this
patch making its way to CVS, you can use the following code to convert your cell
value to proper Java string:
----------------------
String cellValue = cell.getRichStringCellValue().getString();
// clean invalid type casts
if (cellValue != null) {
    char[] buffer = cellValue.toCharArray();
    StringBuffer newValue = new StringBuffer(buffer.length);
    for (int i=0; i<cellValue.length; i++) {
        char charValue = buffer[i];
        short numValue = (short)charValue;

        // strip high byte if all bits are set to 1 
       if ((numValue & 0xff00) == 0xff00)
            charValue = (char)(numValue & 0xff);

        newValue.append(charValue);
    }
        
    cellValue = newValue.toString();
}

----------------------


I have tried to find a previously entered bug report on this subject but failed.
I am sorry if i have missed it.
Comment 1 Perolo Silantico 2006-01-11 23:36:16 UTC
Created attachment 17394 [details]
poi-UnicodeString-typecast.2006-01-11.diff


patch to correct issue with type cast from byte to char.
Comment 2 Jason Height 2006-01-17 10:17:48 UTC
Wow. Good catch. Confirmed that this happens with other versions of the JDK ie
1.5.0_03

Suggested change is Ok. I have tidied it up a bit and fixed all occurrances of
the cast to char (ie RecordInputStream and UnnicodeString)

Committed to SVN.

Jason