Because there is a danger of mixing any mistakes or problems or confuses about the transformation between byte array and String object(because of platform and locale and any bugs:D),I request to refactor the operations in sources about it. The rest is only an idea,maybe there is better one,please discuss about it :). 1.Refactor the class org.apache.poi.util.StringUtil like; public static String getFromUnicodeHigh(final byte[] string, @final int offset, final int len){ try{ return new String(string,offset,len*2,"UTF-16LE"); } catch(UnsupportedEncodingException e){ throw new InternalError();/*unreachable*/ } } public static String getFromUnicode(final byte[] string, @final int offset, final int len){ try{ return new String(string,offset,len*2,"UTF-16BE"); } catch(UnsupportedEncodingException e){ throw new InternalError();/*unreachable*/ } } //add new method public static String getFromCompressedUnicode(final byte[] string, @final int offset, final int len){ try{ return new String(string,offset,len,"ISO-8859-1"); } catch(UnsupportedEncodingException e){ throw new InternalError();/*unreachable*/ } } public static void putCompressedUnicode(final String input, final byte[] output,final int offset){ try{ byte[] bytes = input.getBytes("ISO-8859-1"); System.arraycopy(output,offset,bytes,0,bytes.length); } catch(UnsupportedEncodingException e){ throw new InternalError();/*unreachable*/ } } public static void putUncompressedUnicode(final String input, final byte[] output,final int offset){ try{ byte[] bytes = input.getBytes("UTF-16LE"); System.arraycopy(output,offset,bytes,0,bytes.length); } catch(UnsupportedEncodingException e){ throw new InternalError();/*unreachable*/ } } public static void putUncompressedUnicodeHigh(final String input, final byte[] output,final int offset){ try{ byte[] bytes = input.getBytes("UTF-16BE"); System.arraycopy(output,offset,bytes,0,bytes.length); } catch(UnsupportedEncodingException e){ throw new InternalError();/*unreachable*/ } } 2.Replace all phrases "new String(args)" to "StringUtil.getFromCompressedUnicode(args)"
Created attachment 5765 [details] Sample to show that System Encoder/Decoder works like StringUtil want to do.Please execute main and see console.
We need a 6 way to transform from/to byte array to/from String Object in POI. 1. Encode char seq 16Bit Uncompressed Unicode -> byte array with 16Bit Unicode on LittleEndian(like "UTF-16LE"builtin Encoder) 2. Encode char seq 16Bit Uncompressed Unicode -> byte array with 16Bit Unicode on BigEndian(like "UTF-16BE"builtin Encoder) 3. Encode char seq 8Bit Compressed Unicode -> byte array with 8Bit Compressed Unicode(like "ISO-8859-1"builtin Encoder) 4. Decode byte array with 16Bit Unicode on LittleEndian ->char seq 16Bit Uncomressed Unicode(like "UTF-16LE"builtin Decoder) 5. Decode byte array with 16Bit Unicode on BigEndian ->char seq 16Bit Uncomressed Unicode(like "UTF-16BE"builtin Decoder) 6. Decode byte array with 8Bit Compressed Unicode ->char seq 8Bit Comressed Unicode(like "ISO-8859-1"builtin Decoder) And now,in POI,there are doing with the ways like; 1.call StringUtil#getFromUnicodeHigh() 2.perhaps call StringUtil#getFromUnicode() 3.call constractor of java.lang.String() with no Decode Indicator 4.call StringUtil#putUncompressedUnicode() 5.perhaps call StringUtil#putUncompressedUnicodeHigh() 6.call StringUtil#putCompressedUnicode() So,It seems for me,there is some confusion and problems. (1)StringUtil#getFromUnicode()&#putUncompressedUnicodeHigh() has bugs,so these are always not works as you expect(see Bug 18837) (2)To do 1 you should use StringUtil#getFromUnicodeHigh() but to do 4 you should use putUncompressedUnicode().the name of metods shows some confuse. (3)The way to do 3 is isolated from others. (4)The functions of thease method is already existed at J2SE API(Constractor of java.lang.String).The implementations of these methods may be "Reinvention of wheel". I feel,to refine thease places may makes your burden a little lighter. But the priority of this matter is low,because the change works no effect at runtime,but only refine the landscape of sources... I feel and afraid,this may be one of the "broken window" in POI.
I totally support refactoring this. It looks like you know the most about it and have the know-how to do it. Send patches...They will be applied!
After fixing Bug 18947,I'll submit a patch for implementations in StringUtil. ...I could only 4 testcases to find the problems about Bug 18947,please help me! I found the easy way to change "Default Encoding" at runtime on Sun JRE(see the submitted testcase at Bug 18947),so perhaps the way help your works to find and solve any "Problems on Encoding" after now... And,if you can,please change the names of method XXXUnicode and XXXUnicodeHigh. because of section(2). Renaming methods is a easy work over any IDE(I can do it easily on Eclipse JDT). But if I do so,many points will cange by the work,so it's difficurt to submit as a patch.And I feel,I shouldn't give new names at these methods,but you do:D
I disagree. I think submitting patches would be the perfect way to fix the problem and that you are the perfect person to rename methods.
OK,I'll make a PATCH. I'll rename these methods to; #getFromCompressedUnicode() #getFromUnicodeLE() #getFromUnicodeBE() #putCompressedUnicode() #putUnicodeLE() #putUnicodeBE() If there are opinions against it,please tell me your objections:D
Created attachment 6270 [details] PATCH for this matter. test passed in my env.
I submitted a PATCH. Because it is only a refactor,so there is no any new testcases. Plz check and eval it. Line feeds in StringUtil has broken now,so I fixed it with this works.
BTW,there is some calling get/setUnicodeBE() in hssf,but there may be no place to store the Unicode-character-sequences with BigEndian in XLS format... I guess the places may contain some problems without showing as any bugs with testcases now exist. But I didn't know what the places do.I made the PATCH only by replacing with the "REFACTOR" command of "Eclipse JDT":D
this patch looks funny... We should work on this though. Maybe Tetsuya can help.
Fixed Weird LineFeed. Toshiaki, please supply the "cvs diff -u" style patch again for other parts.