Hi all, I have made some modifications on BoundSheetRecord.java (an ugly one though) to support unicode(Chinese in my case) in sheetname. Could somebody review it, please. I am willing to modify and/or refactor it. For read in unicode support, this patch extends protected void fillFields(byte [] data, short size, int offset) so that it will interpret BIFF8 structure as needed. It 'REUSEs' the SSTDeserializer.manufactureStrings() as it correctly interpret the BIFF8 structure. 'setSheetname' also modified to set the field4_compressed_unicode_flag depending on whether sheetname is 16bit encoding string. To write out unicode string, public int serialize(int offset, byte []data) is extended. Attached below is the code. Thanks Patrick Lee ? unicodeSheetname.patch Index: src/java/org/apache/poi/hssf/record/BoundSheetRecord.java =================================================================== RCS file: /home/cvspublic/jakarta-poi/src/java/org/apache/poi/hssf/record/BoundSheetRecord.java,v retrieving revision 1.4 diff -u -r1.4 BoundSheetRecord.java --- src/java/org/apache/poi/hssf/record/BoundSheetRecord.java 1 Mar 2002 13:27:10 -0000 1.4 +++ src/java/org/apache/poi/hssf/record/BoundSheetRecord.java 8 Jul 2002 09:01:22 -0000 @@ -54,7 +54,7 @@ */ package org.apache.poi.hssf.record; - +import org.apache.poi.util.BinaryTree; import org.apache.poi.util.LittleEndian; import org.apache.poi.util.StringUtil; @@ -117,6 +117,16 @@ } } + /** + * lifted from SSTDeserializer + */ + + private void arraycopy( byte[] src, int src_position, + byte[] dst, int dst_position, + int length ) + { + System.arraycopy( src, src_position, dst, dst_position, length ); + } protected void fillFields(byte [] data, short size, int offset) { field_1_position_of_BOF = LittleEndian.getInt(data, @@ -125,8 +135,28 @@ 4 + offset); field_3_sheetname_length = data[ 6 + offset ]; field_4_compressed_unicode_flag = data[ 7 + offset ]; - field_5_sheetname = new String(data, 8 + offset, - LittleEndian.ubyteToInt( field_3_sheetname_length)); + //field_5_sheetname = new String(data, 8 + offset, + // LittleEndian.ubyteToInt( field_3_sheetname_length)); + BinaryTree tempBT = new BinaryTree(); + SSTDeserializer deserializer; + deserializer = new SSTDeserializer( tempBT); + int length = LittleEndian.ubyteToInt( field_3_sheetname_length); + if ((field_4_compressed_unicode_flag & 0x01)==1) { + byte [] newData = new byte[length*2 +3]; + arraycopy(data,7+offset,newData,2,length*2+1); + LittleEndian.putShort(newData,0,(short)data[6+offset]); +// System.out.println("calling manufactureStrings!"); + deserializer.manufactureStrings(newData,0, (short)(length *2+3)); +// System.out.println("returned from manufactureStrings!"); + field_5_sheetname = ((UnicodeString)tempBT.get(new Integer(0))).getString(); + + tempBT=null; + } + else { + field_5_sheetname = new String(data, 8 + offset, + LittleEndian.ubyteToInt( field_3_sheetname_length)); + } +// System.out.println("f_5_sn is "+field_5_sheetname); } /** @@ -175,13 +205,39 @@ } /** + * Check if String use 16-bit encoding character + * Lifted from SSTRecord.addString + */ + public boolean is16bitString(String string) + { + // scan for characters greater than 255 ... if any are + // present, we have to use 16-bit encoding. Otherwise, we + // can use 8-bit encoding + boolean useUTF16 = false; + int strlen = string.length(); + + for ( int j = 0; j < strlen; j++ ) + { + if ( string.charAt( j ) > 255 ) + { + useUTF16 = true; + break; + } + } + return useUTF16 ; + } + /** * Set the sheetname for this sheet. (this appears in the tabs at the bottom) * @param sheetname the name of the sheet */ public void setSheetname(String sheetname) { + boolean is16bit = is16bitString(sheetname); + setSheetnameLength((byte) sheetname.length() ); + setCompressedUnicodeFlag((byte ) (is16bit?1:0)); field_5_sheetname = sheetname; + } /** @@ -263,20 +319,34 @@ { LittleEndian.putShort(data, 0 + offset, sid); LittleEndian.putShort(data, 2 + offset, - ( short ) (0x08 + getSheetnameLength())); + ( short ) (0x08 + getSheetnameLength()* (getCompressedUnicodeFlag()==0?1:2))); LittleEndian.putInt(data, 4 + offset, getPositionOfBof()); LittleEndian.putShort(data, 8 + offset, getOptionFlags()); data[ 10 + offset ] = getSheetnameLength(); data[ 11 + offset ] = getCompressedUnicodeFlag(); - // we assume compressed unicode (bein the dern americans we are ;-p) - StringUtil.putCompressedUnicode(getSheetname(), data, 12 + offset); + if (getCompressedUnicodeFlag()==0){ + // we assume compressed unicode (bein the dern americans we are ;-p) + StringUtil.putCompressedUnicode(getSheetname(), data, 12 + offset); + } + else { + try { + StringUtil.putUncompressedUnicode(getSheetname(), data, 12 + offset); + // String unicodeString = new String(getSheetname().getBytes("Unicode"),"Unicode"); + // StringUtil.putUncompressedUnicode(unicodeString, data, 12 + offset); + } + catch (Exception e){ + System.out.println("encoding exception in BoundSheetRecord.serialize!"); + } + + + } return getRecordSize(); } public int getRecordSize() { - return 12 + getSheetnameLength(); + return 12 + getSheetnameLength()* (getCompressedUnicodeFlag()==0?1:2); } public short getSid()
Created attachment 2331 [details] unicode name deserializing offered by Patrick Lee
Created attachment 2332 [details] Let user to choose Unicode or no himself
Created attachment 2333 [details] Allowing lowlevel to choose Unicode or not
Alowing User to choose if the sheet name will be Unicode or Compressed So he can use it in usermodel as hssfWorkbook.setSheetName(0, "UnicodeName", HSSFWorkbook.ENCODING_UTF_16 ); or hssfWorkbook.setSheetName(0, "NotUnicodeName", HSSFWorkbook.ENCODING_COMPRESSED_UNICODE );
The getting and putting Unicode string now is simple. Check it and use it. ;) Index: src/java/org/apache/poi/hssf/record/BoundSheetRecord.java =================================================================== RCS file: /home/cvspublic/jakarta- poi/src/java/org/apache/poi/hssf/record/BoundSheetRecord.java,v retrieving revision 1.4 diff -r1.4 BoundSheetRecord.java 57a58,61 > import java.io.*; > import java.io.UnsupportedEncodingException; > > import org.apache.poi.util.BinaryTree; 59a64 > import sun.awt.image.ByteInterleavedRaster; 118a124,134 > > /** > * UTF8: > * sid + len + bof + flags + len(str) + unicode + str > * 2 + 2 + 4 + 2 + 1 + 1 + len(str) > * > * UNICODE: > * sid + len + bof + flags + len(str) + unicode + str > * 2 + 2 + 4 + 2 + 1 + 1 + 2 * len(str) > * > */ 122,130c138,150 < field_1_position_of_BOF = LittleEndian.getInt(data, < 0 + offset); < field_2_option_flags = LittleEndian.getShort(data, < 4 + offset); < field_3_sheetname_length = data[ 6 + offset ]; < field_4_compressed_unicode_flag = data[ 7 + offset ]; < field_5_sheetname = new String(data, 8 + offset, < LittleEndian.ubyteToInt( field_3_sheetname_length)); < } --- > field_1_position_of_BOF = LittleEndian.getInt(data, 0 + offset); // bof > field_2_option_flags = LittleEndian.getShort(data, 4 + offset); // flags > field_3_sheetname_length = data[ 6 + offset ]; // len(str) > field_4_compressed_unicode_flag = data[ 7 + offset ]; // unicode > > int nameLength = LittleEndian.ubyteToInt( field_3_sheetname_length ); > if ( ( field_4_compressed_unicode_flag & 0x01 ) == 1 ) { > field_5_sheetname = StringUtil.getFromUnicode( data, 8 + offset, nameLength ); > } > else { > field_5_sheetname = new String( data, 8 + offset, nameLength ); > } > } 172c192 < public void setCompressedUnicodeFlag(byte flag) --- > public void setCompressedUnicodeFlag( byte flag ) 181,182c201,202 < < public void setSheetname(String sheetname) --- > > public void setSheetname( String sheetname ) 218c238,252 < return field_3_sheetname_length; --- > return field_3_sheetname_length; > } > > /** > * get the length of the raw sheetname in characters > * the length depends on the unicode flag > * > * @return number of characters in the raw sheet name > */ > > public byte getRawSheetnameLength() > { > return (byte)( ( ( field_4_compressed_unicode_flag & 0x01 ) == 1 ) > ? 2 * field_3_sheetname_length > : field_3_sheetname_length ); 265,266c299 < LittleEndian.putShort(data, 2 + offset, < ( short ) (0x08 + getSheetnameLength())); --- > LittleEndian.putShort( data, 2 + offset, (short)( 8 + getRawSheetnameLength() ) ); 269c302 < data[ 10 + offset ] = getSheetnameLength(); --- > data[ 10 + offset ] = (byte)( getSheetnameLength() ); 270a304,309 > > if ( ( field_4_compressed_unicode_flag & 0x01 ) == 1 ) > StringUtil.putUncompressedUnicode( getSheetname(), data, 12 + offset ); > else > StringUtil.putCompressedUnicode( getSheetname(), data, 12 + offset ); > 272,273d310 < // we assume compressed unicode (bein the dern americans we are ;-p) < StringUtil.putCompressedUnicode(getSheetname(), data, 12 + offset); 274a312,332 > > /* > byte[] fake = new byte[] { (byte)0x85, 0x00, // sid > 0x1a, 0x00, // length > 0x3C, 0x09, 0x00, 0x00, // bof > 0x00, 0x00, // flags > 0x09, // len( str ) > 0x01, // unicode > // <str> > 0x21, 0x04, 0x42, 0x04, 0x40, 0x04, 0x30, 0x04, 0x3D, > 0x04, 0x38, 0x04, 0x47, 0x04, 0x3A, 0x04, 0x30, 0x04 > // </str> > }; > > sid + len + bof + flags + len(str) + unicode + str > 2 + 2 + 4 + 2 + 1 + 1 + len(str) > > System.arraycopy( fake, 0, data, offset, fake.length ); > > return fake.length; > */ 279,280c337,339 < return 12 + getSheetnameLength(); < } --- > // return 30; > return 12 + getRawSheetnameLength(); > }
Thank you ever so much for this patch. In the future please create a single patch file if possible (makes it easier to apply and inspect), and add yourself to the @author tags of any class you modify (share the credit, share the blame). I attempted to apply the patch however I recieved the following unit test error after applying the patch: testSheetFunctionsErrorN/A java.lang.NullPointerException at org.apache.poi.hssf.usermodel.TestFormulas.testSheetFunctions(TestFormulas.java:782) 0.090 Please try running the "./build.sh clean compile test" -- this will do a clean build and execute the unit tests. Let me know if you cannot replicate the problem. Thanks, -Andy
Created attachment 2422 [details] The code for fixing unicode sheet name and unittests for it.
Created attachment 2423 [details] Unit test for testing BoundSheetRecord
Created attachment 2424 [details] Tool file for BoundSheetRecordTest
In the attaches: patches for the StringUtil and BoundSheetRecord with the unit tests for them.
so I applied (new) #1 but not #2 and #3 yet. A few issues: 1. I'm not sure we want individual listeners to wrap the records. I'm asking glen for his opinon. It does not seem like a bad idea to me, but its 11:14p so maybe I'm just tired :-) 2. Regardles of that I don't like NameListener as the name because there is a NameRecord and pleanty of other things like it. "SheetNameListener" strikes me as less ambiguous (confusing). 3. The Unit test should be either rewritten or moved/renamed. It tests the NameListener not the BoundSheetRecord. Meaning there should be a test that directly tests the bound sheet. We have a few unit tests that test meta-functionality (like "Does POI support formulas and specific ones") for entire subsystems, but the rest are 1-1 with the class they test. Thanks for your work. I'll let you know what Glen and Avik think on #1 (I'll ask avik on the list or maybe he'll see this). -Andy
*** Bug 10777 has been marked as a duplicate of this bug. ***
Hi all I have refined the Unicode support for sheetname patch. Including in this patch are refactoring of SSTDeserializer & UnicodeString class to bring more code relating to BIFF8 format from the former class to the latter class where it should belong to. Please review it and consider for inclusion into the project Thanks Patrick Lee Note: I have wrongly resubmit this as bug 10976
Created attachment 2430 [details] attachment of Unicode for sheetname & the refactored SSTDeserializer & UnicodeString class
I had to back out this code, it is the cause of the current (suspected) size problems. To replicate run the HSSF test pattern (java org.apache.poi.hssf.dev.HSSF /tmp/outputfile.xls write) then (java org.apache.poi.hssf.dev.BiffViewer /tmp/outputfile.xls) or write any file out and then read it. POI throws some Record Format exceptions/etc.
I had to back out this code, it is the cause of the current (suspected) getSize() problems (http://nagoya.apache.org/bugzilla/show_bug.cgi?id=10393). To replicate run the HSSF test pattern (java org.apache.poi.hssf.dev.HSSF /tmp/outputfile.xls write) then (java org.apache.poi.hssf.dev.BiffViewer /tmp/outputfile.xls) or write any file out and then read it. POI throws some Record Format exceptions/etc.