When I research about encoding problems in POI,I found a bug in org.apache.poi.util.StringUtil#getFromUnicode(). Now it breaks double bytes character codes,because the method is doing not same as StringUtil#getFromUnicodeHigh().
Created attachment 5721 [details] Testcase to find this problem.
Created attachment 5722 [details] PATCH to fix the problem.
I feel, the methods #getFromUnicode() and #getFromUnicodeHigh() don't have to do like now,a little difficult. The easy way to make response is; return String(string,offset,len*2,"UTF-16BE"); and return string(string,offset,len*2,"UTF-16LE"); How about it:D?
>return String(string,offset,len*2,"UTF-16BE"); return new String(string,offset,len*2,"UTF-16BE"); >return string(string,offset,len*2,"UTF-16LE"); return new String(string,offset,len*2,"UTF-16LE"); BTW, XP! StringUtils#putUncompressedUnicodeHigh() contains a bug clearly... public static void putUncompressedUnicodeHigh(final String input, final byte[] output, final int offset) { int strlen = input.length(); for (int k = 0; k < strlen; k++) { char c = input.charAt(k); > output[offset + (2 * k)] = (byte) (c >> 8); > output[offset + (2 * k)] = (byte) c; //what is this??? } } I feel it is no need any testcase to fix it...
we once did use the new string with UTF-16LE etc.. It didn't work. I believe the UTF-8 and UTF-16 constants are a misnomer. They really just mean "16 bit or 8 bit characterset"
Oh,really? But in my local testcase,it works correctly reading/writing Excel file. I feel,maybe you mistook how to use it... "UTF-16" Encoder/Decoder causes some special effects. When decode byte arrays to String,check the head 2bytes indicator of Endian and decide the rest is Big Endian"UTF-16BE" or Little Endian"UTF-16LE",if the head is not indicator bytes,JVM decide to decode all byte arrays (contains the head 2bytes)as Big Endian "UTF-16BE". And when encode String to byte arrays, JVM allways add 2byte as the indicator of Endian to the head of byte arrays, and encode the rest with Big Endian "UTF- 16BE". So if you use "UTF-16" as charset indicator,you must care about thease side effects."UTF-16" works NOT SYMMETRICAL when decode and encode String,especially byte array is encoded by "UTF-16LE" with no Endian indicator bytes like the character sequences in Excel files. But,"UTF-16BE" and "UTF-16LE" don't do like so.Thease rules are simply do like StringUtil is now doing(This is in a part of J2SE API specification,so we don't have to care it is depend on env).So we can use thease encoding indicator as a rules to encode/decode byte arrays with no such effects.It works byte array contains; [16BitUnicode high byte][16BitUnicode low byte]...->"UTF16-BE" [16BitUnicode low byte][16BitUnicode high byte]...->"UTF16-LE" I see the character sequences in Excel is 16Bit Unicode with Little Endian, except the string is COMPRESSED_UNICODE(8bit). So in theory,it will works correctly,and I checked it works good. I'll submit a sample patch of StringUtil,so please test it in your env. BTW, I feel many people usinig ASCII characters as natural language often misunderstand the thing,"UTF-8" is not 8bit character encoding. This rule encode 1 charcter to 1~3bytes.ASCII character is encoded to 1byte like "ISO-8859-1",but many Japanese and other DBCS on 16Bit Unicode character is encoded to 2 or 3bytes per 1char.The length of byte array is variable,depend on each caracter's code."8" don't means "8bit per 1char":D
Created attachment 5735 [details] Patches to show,charset indicator works good.
The patch I submitted at 2003-04-09 04:03 is to fix simple bug at org.apache.poi.util.StringUtil#getFromUnicode(). And the comment I submitted at 2003-04-09 09:22 is reporting another bug. Please evaluate thease,it is not related to my LONG comment...
please put the new file in a directory preserving zip relative to jakarta-poi module and I'll then apply it (provided the unit tests pass, etc).
Created attachment 5786 [details] Patches and testcases are zipped.(patching StringUtil and added 4 testcase to TestStringUtil)
doh... I dropped the ball on this one.
The patch was applied long ago. I'm closing it this bug. Please try the latest POI 3.5-beta4. Yegor