Bug 10548 - [PATCH] Unicode Support for excel sheetname.
Summary: [PATCH] Unicode Support for excel sheetname.
Status: RESOLVED INVALID
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 2.0-dev
Hardware: PC Linux
: P3 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
: 10777 (view as bug list)
Depends on:
Blocks:
 
Reported: 2002-07-08 09:14 UTC by SioLam Patrick Lee
Modified: 2004-11-16 19:05 UTC (History)
1 user (show)



Attachments
unicode name deserializing offered by Patrick Lee (3.29 KB, patch)
2002-07-12 16:19 UTC, Sergei Kozello
Details | Diff
Let user to choose Unicode or no himself (1.15 KB, patch)
2002-07-12 16:20 UTC, Sergei Kozello
Details | Diff
Allowing lowlevel to choose Unicode or not (1.10 KB, patch)
2002-07-12 16:21 UTC, Sergei Kozello
Details | Diff
The code for fixing unicode sheet name and unittests for it. (44.73 KB, patch)
2002-07-20 22:42 UTC, Sergei Kozello
Details | Diff
Unit test for testing BoundSheetRecord (6.71 KB, text/plain)
2002-07-20 22:45 UTC, Sergei Kozello
Details
Tool file for BoundSheetRecordTest (3.36 KB, text/plain)
2002-07-20 22:47 UTC, Sergei Kozello
Details
attachment of Unicode for sheetname & the refactored SSTDeserializer & UnicodeString class (28.14 KB, patch)
2002-07-22 03:28 UTC, SioLam Patrick Lee
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description SioLam Patrick Lee 2002-07-08 09:14:59 UTC
Hi all,
I have made some modifications on BoundSheetRecord.java (an ugly one
though) to support unicode(Chinese in my case) in sheetname.  Could
somebody review it, please.  I am willing to modify and/or refactor it.

For read in unicode support, this patch extends  protected void
fillFields(byte [] data, short size, int offset) so that it will interpret BIFF8
structure as needed.  It 'REUSEs' the SSTDeserializer.manufactureStrings() as it
correctly interpret the BIFF8  structure.

'setSheetname' also modified to set the field4_compressed_unicode_flag
depending on whether sheetname is 16bit encoding string.

To write out unicode string,     public int serialize(int offset, byte []data)
is extended.

Attached below is the code.

Thanks
Patrick Lee


? unicodeSheetname.patch
Index: src/java/org/apache/poi/hssf/record/BoundSheetRecord.java
===================================================================
RCS file:
/home/cvspublic/jakarta-poi/src/java/org/apache/poi/hssf/record/BoundSheetRecord.java,v
retrieving revision 1.4
diff -u -r1.4 BoundSheetRecord.java
--- src/java/org/apache/poi/hssf/record/BoundSheetRecord.java	1 Mar 2002 13:27:10 -0000	1.4
+++ src/java/org/apache/poi/hssf/record/BoundSheetRecord.java	8 Jul 2002 09:01:22 -0000
@@ -54,7 +54,7 @@
  */
 
 package org.apache.poi.hssf.record;
-
+import org.apache.poi.util.BinaryTree;
 import org.apache.poi.util.LittleEndian;
 import org.apache.poi.util.StringUtil;
 
@@ -117,6 +117,16 @@
         }
     }
 
+    /**
+     *  lifted from SSTDeserializer
+     */
+
+    private void arraycopy( byte[] src, int src_position,
+                            byte[] dst, int dst_position,
+                            int length )
+    {
+        System.arraycopy( src, src_position, dst, dst_position, length );
+    }
     protected void fillFields(byte [] data, short size, int offset)
     {
         field_1_position_of_BOF         = LittleEndian.getInt(data,
@@ -125,8 +135,28 @@
                 4 + offset);
         field_3_sheetname_length        = data[ 6 + offset ];
         field_4_compressed_unicode_flag = data[ 7 + offset ];
-        field_5_sheetname               = new String(data, 8 + offset,
-                LittleEndian.ubyteToInt( field_3_sheetname_length));
+        //field_5_sheetname               = new String(data, 8 + offset,
+        //        LittleEndian.ubyteToInt( field_3_sheetname_length));
+        BinaryTree tempBT = new BinaryTree();
+        SSTDeserializer deserializer;
+        deserializer = new SSTDeserializer(        tempBT);
+        int length = LittleEndian.ubyteToInt( field_3_sheetname_length);
+        if ((field_4_compressed_unicode_flag & 0x01)==1) {
+          byte [] newData = new byte[length*2 +3];
+          arraycopy(data,7+offset,newData,2,length*2+1);
+          LittleEndian.putShort(newData,0,(short)data[6+offset]);
+//          System.out.println("calling manufactureStrings!");
+          deserializer.manufactureStrings(newData,0, (short)(length *2+3));
+//          System.out.println("returned from manufactureStrings!");
+          field_5_sheetname = ((UnicodeString)tempBT.get(new
Integer(0))).getString();
+
+          tempBT=null;
+        }
+        else {
+          field_5_sheetname =   new String(data, 8 + offset,
+              LittleEndian.ubyteToInt( field_3_sheetname_length));
+        }
+//        System.out.println("f_5_sn is "+field_5_sheetname);
     }
 
     /**
@@ -175,13 +205,39 @@
     }
 
     /**
+     * Check if String use 16-bit encoding character
+     * Lifted from SSTRecord.addString
+     */
+    public boolean is16bitString(String string)
+    {
+            // scan for characters greater than 255 ... if any are
+            // present, we have to use 16-bit encoding. Otherwise, we
+            // can use 8-bit encoding
+            boolean useUTF16 = false;
+            int strlen = string.length();
+
+            for ( int j = 0; j < strlen; j++ )
+            {
+                if ( string.charAt( j ) > 255 )
+                {
+                    useUTF16 = true;
+                    break;
+                }
+            }
+            return useUTF16 ;
+   }
+    /**
      * Set the sheetname for this sheet.  (this appears in the tabs at the bottom)
      * @param sheetname the name of the sheet
      */
 
     public void setSheetname(String sheetname)
     {
+        boolean is16bit = is16bitString(sheetname);
+        setSheetnameLength((byte) sheetname.length() );
+        setCompressedUnicodeFlag((byte ) (is16bit?1:0));
         field_5_sheetname = sheetname;
+
     }
 
     /**
@@ -263,20 +319,34 @@
     {
         LittleEndian.putShort(data, 0 + offset, sid);
         LittleEndian.putShort(data, 2 + offset,
-                              ( short ) (0x08 + getSheetnameLength()));
+                              ( short ) (0x08 + getSheetnameLength()*
(getCompressedUnicodeFlag()==0?1:2)));
         LittleEndian.putInt(data, 4 + offset, getPositionOfBof());
         LittleEndian.putShort(data, 8 + offset, getOptionFlags());
         data[ 10 + offset ] = getSheetnameLength();
         data[ 11 + offset ] = getCompressedUnicodeFlag();
 
-        // we assume compressed unicode (bein the dern americans we are ;-p)
-        StringUtil.putCompressedUnicode(getSheetname(), data, 12 + offset);
+        if (getCompressedUnicodeFlag()==0){
+          // we assume compressed unicode (bein the dern americans we are ;-p)
+          StringUtil.putCompressedUnicode(getSheetname(), data, 12 + offset);
+        }
+        else {
+          try {
+            StringUtil.putUncompressedUnicode(getSheetname(), data, 12 + offset);
+  //          String unicodeString = new
String(getSheetname().getBytes("Unicode"),"Unicode");
+  //          StringUtil.putUncompressedUnicode(unicodeString, data, 12 + offset);
+          }
+          catch (Exception e){
+            System.out.println("encoding exception in
BoundSheetRecord.serialize!");
+          }
+
+
+        }
         return getRecordSize();
     }
 
     public int getRecordSize()
     {
-        return 12 + getSheetnameLength();
+        return 12 + getSheetnameLength()* (getCompressedUnicodeFlag()==0?1:2);
     }
 
     public short getSid()
Comment 1 Sergei Kozello 2002-07-12 16:19:59 UTC
Created attachment 2331 [details]
unicode name deserializing offered by Patrick Lee
Comment 2 Sergei Kozello 2002-07-12 16:20:58 UTC
Created attachment 2332 [details]
Let user to choose Unicode or no himself
Comment 3 Sergei Kozello 2002-07-12 16:21:40 UTC
Created attachment 2333 [details]
Allowing lowlevel to choose Unicode or not
Comment 4 Sergei Kozello 2002-07-12 16:24:40 UTC
Alowing User to choose if the sheet name will be Unicode or Compressed
So he can use it in usermodel as
hssfWorkbook.setSheetName(0, "UnicodeName", HSSFWorkbook.ENCODING_UTF_16 );
or
hssfWorkbook.setSheetName(0, "NotUnicodeName", 
HSSFWorkbook.ENCODING_COMPRESSED_UNICODE );
Comment 5 Sergei Kozello 2002-07-13 19:53:27 UTC
The getting and putting Unicode string now is simple.
Check it and use it. ;)


Index: src/java/org/apache/poi/hssf/record/BoundSheetRecord.java
===================================================================
RCS file: /home/cvspublic/jakarta-
poi/src/java/org/apache/poi/hssf/record/BoundSheetRecord.java,v
retrieving revision 1.4
diff -r1.4 BoundSheetRecord.java
57a58,61
> import java.io.*;
> import java.io.UnsupportedEncodingException;
> 
> import org.apache.poi.util.BinaryTree;
59a64
> import sun.awt.image.ByteInterleavedRaster;
118a124,134
>     
>     /**
>      *  UTF8:
>      *	sid + len + bof + flags + len(str) + unicode +   str
> 	 *	 2  +  2  +  4  +   2   +    1     +    1    + len(str)
> 	 * 
> 	 * 	UNICODE:
>      *	sid + len + bof + flags + len(str) + unicode +   str
> 	 *	 2  +  2  +  4  +   2   +    1     +    1    + 2 * len(str)
> 	 * 
>      */
122,130c138,150
<         field_1_position_of_BOF         = LittleEndian.getInt(data,
<                 0 + offset);
<         field_2_option_flags            = LittleEndian.getShort(data,
<                 4 + offset);
<         field_3_sheetname_length        = data[ 6 + offset ];
<         field_4_compressed_unicode_flag = data[ 7 + offset ];
<         field_5_sheetname               = new String(data, 8 + offset,
<                 LittleEndian.ubyteToInt( field_3_sheetname_length));
<     }
---
>         field_1_position_of_BOF         = LittleEndian.getInt(data, 0 + 
offset);	// bof
>         field_2_option_flags            = LittleEndian.getShort(data, 4 + 
offset);	// flags
>         field_3_sheetname_length        = data[ 6 + offset ];			
			// len(str)
>         field_4_compressed_unicode_flag = data[ 7 + offset ];			
			// unicode
> 
> 		int nameLength = LittleEndian.ubyteToInt( 
field_3_sheetname_length );
>         if ( ( field_4_compressed_unicode_flag & 0x01 ) == 1 ) {
> 			field_5_sheetname = StringUtil.getFromUnicode( data, 8 
+ offset, nameLength );
>         }
>         else {
> 			field_5_sheetname = new String( data, 8 + offset, 
nameLength );
>         }
> 	}
172c192
<     public void setCompressedUnicodeFlag(byte flag)
---
>     public void setCompressedUnicodeFlag( byte flag )
181,182c201,202
< 
<     public void setSheetname(String sheetname)
---
>     
>     public void setSheetname( String sheetname )
218c238,252
<         return field_3_sheetname_length;
---
> 		return field_3_sheetname_length;
>     }
> 
>     /**
>      * get the length of the raw sheetname in characters
>      * the length depends on the unicode flag
>      * 
>      * @return number of characters in the raw sheet name
>      */
> 
>     public byte getRawSheetnameLength()
>     {
> 		return (byte)( ( ( field_4_compressed_unicode_flag & 0x01 ) == 
1 )
> 						? 2 * field_3_sheetname_length
> 						: field_3_sheetname_length );
265,266c299
<         LittleEndian.putShort(data, 2 + offset,
<                               ( short ) (0x08 + getSheetnameLength()));
---
>         LittleEndian.putShort( data, 2 + offset, (short)( 8 + 
getRawSheetnameLength() ) );
269c302
<         data[ 10 + offset ] = getSheetnameLength();
---
>         data[ 10 + offset ] = (byte)( getSheetnameLength() );
270a304,309
>         
>         if ( ( field_4_compressed_unicode_flag & 0x01 ) == 1 )
> 	        StringUtil.putUncompressedUnicode( getSheetname(), data, 12 + 
offset );
> 	    else
> 	        StringUtil.putCompressedUnicode( getSheetname(), data, 12 + 
offset );
> 		
272,273d310
<         // we assume compressed unicode (bein the dern americans we are ;-p)
<         StringUtil.putCompressedUnicode(getSheetname(), data, 12 + offset);
274a312,332
>         
> 		/*
> 		byte[] fake = new byte[] {	(byte)0x85, 0x00, 		
	// sid
> 		    							0x1a, 
0x00, 			// length
> 		    							0x3C, 
0x09, 0x00, 0x00, // bof
> 		    							0x00, 
0x00, 			// flags
> 		    							0x09, 	
				// len( str )
> 		    							0x01, 	
				// unicode
> 		    							// <str>
> 		    							0x21, 
0x04, 0x42, 0x04, 0x40, 0x04, 0x30, 0x04, 0x3D, 
> 		    							0x04, 
0x38, 0x04, 0x47, 0x04, 0x3A, 0x04, 0x30, 0x04   
> 		    							// 
</str>
> 		    						};
> 		    						
> 		    						sid + len + bof 
+ flags + len(str) + unicode +   str
> 		    						 2  +  2  +  4  
+   2   +    1     +    1    + len(str)
> 		
> 		System.arraycopy( fake, 0, data, offset, fake.length );
> 		
> 		return fake.length;
> 		*/
279,280c337,339
<         return 12 + getSheetnameLength();
<     }
---
>         // return 30;
>         return 12 + getRawSheetnameLength();
> 	}
Comment 6 Andy Oliver 2002-07-15 01:52:29 UTC
Thank you ever so much for this patch.  In the future please create a single
patch file if possible (makes it easier to apply and inspect), and add yourself
to the @author tags of any class you modify (share the credit, share the blame). 

I attempted to apply the patch however I recieved the following unit test error
after applying the patch:

testSheetFunctionsErrorN/A

java.lang.NullPointerException
at
org.apache.poi.hssf.usermodel.TestFormulas.testSheetFunctions(TestFormulas.java:782)
0.090

Please try running the "./build.sh clean compile test"  -- this will do a clean
build and execute the unit tests.  Let me know if you cannot replicate the problem.

Thanks, -Andy
Comment 7 Sergei Kozello 2002-07-20 22:42:37 UTC
Created attachment 2422 [details]
The code for fixing unicode sheet name and unittests for it.
Comment 8 Sergei Kozello 2002-07-20 22:45:28 UTC
Created attachment 2423 [details]
Unit test for testing BoundSheetRecord
Comment 9 Sergei Kozello 2002-07-20 22:47:20 UTC
Created attachment 2424 [details]
Tool file for BoundSheetRecordTest
Comment 10 Sergei Kozello 2002-07-20 22:49:29 UTC
In the attaches: patches for the StringUtil and BoundSheetRecord with the unit 
tests for them.
Comment 11 Andy Oliver 2002-07-21 03:18:35 UTC
so I applied (new) #1 but not #2 and #3 yet.  

A few issues:

1. I'm not sure we want individual listeners to wrap the records.  I'm asking
glen for his opinon.  It does not seem like a bad idea to me, but its 11:14p so
maybe I'm just tired :-)

2. Regardles of that I don't like NameListener as the name because there is a
NameRecord and pleanty of other things like it.  "SheetNameListener" strikes me
as less ambiguous (confusing).

3. The Unit test should be either rewritten or moved/renamed.  It tests the
NameListener not the BoundSheetRecord.  Meaning there should be a test that
directly tests the bound sheet.  We have a few unit tests that test
meta-functionality (like "Does POI support formulas and specific ones") for
entire subsystems, but the rest are 1-1 with the class they test.

Thanks for your work.  I'll let you know what Glen and Avik think on #1 (I'll
ask avik on the list or maybe he'll see this).  

-Andy
Comment 12 Andy Oliver 2002-07-21 13:11:57 UTC
*** Bug 10777 has been marked as a duplicate of this bug. ***
Comment 13 SioLam Patrick Lee 2002-07-22 03:24:51 UTC
Hi all

I have refined the Unicode support for sheetname patch.  Including in this patch
are refactoring of  SSTDeserializer & UnicodeString class to bring more code
relating to BIFF8 format from the former class to the latter class where it
should belong to. Please review it and consider for inclusion into the project

Thanks
Patrick Lee

Note: I have wrongly resubmit this as bug 10976
Comment 14 SioLam Patrick Lee 2002-07-22 03:28:39 UTC
Created attachment 2430 [details]
attachment of Unicode for sheetname & the refactored  SSTDeserializer & UnicodeString class
Comment 15 Andy Oliver 2002-07-28 22:44:28 UTC
I had to back out this code, it is the cause of the current (suspected) size
problems.  To replicate run the HSSF test pattern

(java org.apache.poi.hssf.dev.HSSF /tmp/outputfile.xls write)

then 

(java org.apache.poi.hssf.dev.BiffViewer /tmp/outputfile.xls) 

or write any file out and then read it.  POI throws some Record Format
exceptions/etc.

Comment 16 Andy Oliver 2002-07-28 22:47:11 UTC
I had to back out this code, it is the cause of the current (suspected) getSize()
problems (http://nagoya.apache.org/bugzilla/show_bug.cgi?id=10393).  To
replicate run the HSSF test pattern

(java org.apache.poi.hssf.dev.HSSF /tmp/outputfile.xls write)

then 

(java org.apache.poi.hssf.dev.BiffViewer /tmp/outputfile.xls) 

or write any file out and then read it.  POI throws some Record Format
exceptions/etc.