Bug 18837 - [PATCH]StringUtil#getFromUnicode() has a bug
Summary: [PATCH]StringUtil#getFromUnicode() has a bug
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 2.0-pre3
Hardware: Other other
: P3 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-04-09 04:01 UTC by Toshiaki Kamoshida
Modified: 2008-12-25 10:33 UTC (History)
0 users



Attachments
Testcase to find this problem. (693 bytes, text/plain)
2003-04-09 04:02 UTC, Toshiaki Kamoshida
Details
PATCH to fix the problem. (633 bytes, patch)
2003-04-09 04:03 UTC, Toshiaki Kamoshida
Details | Diff
Patches to show,charset indicator works good. (1.09 KB, patch)
2003-04-09 14:40 UTC, Toshiaki Kamoshida
Details | Diff
Patches and testcases are zipped.(patching StringUtil and added 4 testcase to TestStringUtil) (6.34 KB, application/zip)
2003-04-11 03:32 UTC, Toshiaki Kamoshida
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Toshiaki Kamoshida 2003-04-09 04:01:01 UTC
When I research about encoding problems in POI,I found a bug in
org.apache.poi.util.StringUtil#getFromUnicode().

Now it breaks double bytes character codes,because the method is doing not same 
as StringUtil#getFromUnicodeHigh().
Comment 1 Toshiaki Kamoshida 2003-04-09 04:02:41 UTC
Created attachment 5721 [details]
Testcase to find this problem.
Comment 2 Toshiaki Kamoshida 2003-04-09 04:03:30 UTC
Created attachment 5722 [details]
PATCH to fix the problem.
Comment 3 Toshiaki Kamoshida 2003-04-09 04:09:51 UTC
I feel, the methods #getFromUnicode() and #getFromUnicodeHigh() don't have to 
do like now,a little difficult.

The easy way to make response is;
return String(string,offset,len*2,"UTF-16BE");
and
return string(string,offset,len*2,"UTF-16LE");

How about it:D?
Comment 4 Toshiaki Kamoshida 2003-04-09 09:22:18 UTC
>return String(string,offset,len*2,"UTF-16BE");
return new String(string,offset,len*2,"UTF-16BE");
>return string(string,offset,len*2,"UTF-16LE");
return new String(string,offset,len*2,"UTF-16LE");

BTW,
XP!
StringUtils#putUncompressedUnicodeHigh() contains a bug clearly...

public static void putUncompressedUnicodeHigh(final String input,
        final byte[] output,
        final int offset) {
        int strlen = input.length();
        for (int k = 0; k < strlen; k++) {
            char c = input.charAt(k);
>           output[offset + (2 * k)] = (byte) (c >> 8);
>           output[offset + (2 * k)] = (byte) c;
//what is this???
        }
}

I feel it is no need any testcase to fix it...
Comment 5 Andy Oliver 2003-04-09 12:33:12 UTC
we once did use the new string with UTF-16LE etc..  It didn't work.  I believe
the UTF-8 and UTF-16 constants are a misnomer.  They really just mean "16 bit or
8 bit characterset"
Comment 6 Toshiaki Kamoshida 2003-04-09 14:32:35 UTC
Oh,really?
But in my local testcase,it works correctly reading/writing Excel file.

I feel,maybe you mistook how to use it...

"UTF-16" Encoder/Decoder causes some special effects.
When decode byte arrays to String,check the head 2bytes indicator of Endian and 
decide the rest is Big Endian"UTF-16BE" or Little Endian"UTF-16LE",if the head 
is not indicator bytes,JVM decide to decode all byte arrays (contains the head 
2bytes)as Big Endian "UTF-16BE".

And when encode String to byte arrays, JVM allways add 2byte as the indicator 
of Endian to the head of byte arrays, and encode the rest with Big Endian "UTF-
16BE".

So if you use "UTF-16" as charset indicator,you must care about thease side 
effects."UTF-16" works NOT SYMMETRICAL when decode and encode String,especially 
byte array is encoded by "UTF-16LE" with no Endian indicator bytes like the 
character sequences in Excel files.

But,"UTF-16BE" and "UTF-16LE" don't do like so.Thease rules are simply do like 
StringUtil is now doing(This is in a part of J2SE API specification,so we don't 
have to care it is depend on env).So we can use thease encoding indicator as a 
rules to encode/decode byte arrays with no such effects.It works byte array 
contains;
[16BitUnicode high byte][16BitUnicode low byte]...->"UTF16-BE"
[16BitUnicode low byte][16BitUnicode high byte]...->"UTF16-LE"

I see the character sequences in Excel is 16Bit Unicode with Little Endian,
except the string is COMPRESSED_UNICODE(8bit).
So in theory,it will works correctly,and I checked it works good.

I'll submit a sample patch of StringUtil,so please test it in your env.

BTW,
I feel many people usinig ASCII characters as natural language often 
misunderstand the thing,"UTF-8" is not 8bit character encoding.
This rule encode 1 charcter to 1~3bytes.ASCII character is encoded to 1byte 
like "ISO-8859-1",but many Japanese and other DBCS on 16Bit Unicode character 
is encoded to 2 or 3bytes per 1char.The length of byte array is variable,depend 
on each caracter's code."8" don't means "8bit per 1char":D
Comment 7 Toshiaki Kamoshida 2003-04-09 14:40:22 UTC
Created attachment 5735 [details]
Patches to show,charset indicator works good.
Comment 8 Toshiaki Kamoshida 2003-04-09 15:25:03 UTC
The patch I submitted at 2003-04-09 04:03 is to fix simple bug at 
org.apache.poi.util.StringUtil#getFromUnicode().
And the comment I submitted at 2003-04-09 09:22 is reporting another bug.

Please evaluate thease,it is not related to my LONG comment...
Comment 9 Andy Oliver 2003-04-11 03:06:52 UTC
please put the new file in a directory preserving zip relative to jakarta-poi
module and I'll then apply it (provided the unit tests pass, etc).  
Comment 10 Toshiaki Kamoshida 2003-04-11 03:32:46 UTC
Created attachment 5786 [details]
Patches and testcases are zipped.(patching StringUtil and added 4 testcase to TestStringUtil)
Comment 11 Andy Oliver 2003-07-24 16:10:20 UTC
doh...  I dropped the ball on this one.
Comment 12 Yegor Kozlov 2008-12-25 10:33:54 UTC
The patch was applied long ago. I'm closing it this bug.
Please try the latest POI 3.5-beta4.

Yegor