Bug 27394

Summary: A space may enter behind a character when the enclosed alphanumerics characters are read.
Product: POI Reporter: Kenji Sasaki <sasaki>
Component: HSSFAssignee: POI Developers List <dev>
Severity: normal    
Priority: P3    
Version: 2.0-FINAL   
Target Milestone: ---   
Hardware: Other   
OS: Windows XP   
Attachments: Excel file for this problem.
Excel file for this problem.
Excel file for this problem.

Description Kenji Sasaki 2004-03-03 05:33:56 UTC
If the special characters (①, ②, etc.) of WINDOWS are taken, a space may enter
behind a character.

A reappearance procedure is written to below.

1, CellType prepares two or more "Numeric" cells and cells of a "String".
2, Special characters, such as ①, ②, and ③, are inserted in the cell.
3, A value is taken by " HSSFCell.getStringCellValue() ".
4, System.out.println("[" + value + "]");
5, There is a part where a space enters between value and "]". (Ex.[① ])

I do not understand the reappearance pattern.
Comment 1 Michael Zalewski 2004-03-05 04:36:34 UTC
I am not sure which characters are being reported as broken. But I have noticed 
a similar problem:

1.  Using Excel, set a sheet name with one of the non-ISO 8859-1 characters in 
Windows ANSII Code Page 1252. For example, this includes s caron and z caron 
(but not a caron). Caron is the little cap diacritic drawn above the character. 
Don't use any non CP 1252 characters.

2. Try to read the sheet name from POI. The character will show as a little 
square (because it is an invalid code point).

3. If you use Biff Viewer, you will see that the sheet name has been stored by 
Excel as a compressed 8 bit string.

4. If you look at the string that HSSF returns for the sheet name, you will see 
that, for example, lower case z caron is converted by HSSF into \u009e, which 
is actually not a valid code point. The proper representation of z caron is 

When I read the original report, I cannot tell which 'special characters' are 
being used. (They appear as little boxes in my browser). I suspect that 
the 'special characters' may be those in ASCII range 0x80 - 0x9f, which cannot 
be translated to Unicode simply by assuming that the high order byte is zero.

Also a question for sasaki: He should report what his windows code page is (if 
it is not 1252). Attaching a sample spreadsheet would probably help.

Did I even get this right? The problem is that a special character is entered 
inside a cell, with Excel. When HSSF reads the value of that cell, it displays 
the character as a little box, or possibly a space.
Comment 2 Kenji Sasaki 2004-03-05 07:01:43 UTC
A summary is changed.
Comment 3 Kenji Sasaki 2004-03-05 07:19:40 UTC
I,m sorry.The character which I showed above had broken.

The special character whitch I said is Enclosed Alphanumerics characters 
in Unicode range \u2460 - \u24FF.

The code page whitch I use is Win 932.  

Comment 4 Kenji Sasaki 2004-03-05 07:28:24 UTC
Created attachment 10671 [details]
Excel file for this problem.
Comment 5 Kenji Sasaki 2004-03-05 07:29:25 UTC
Created attachment 10672 [details]
Excel file for this problem.
Comment 6 Kenji Sasaki 2004-03-05 07:31:09 UTC
Created attachment 10673 [details]
Excel file for this problem.
Comment 7 Kenji Sasaki 2004-03-05 07:35:25 UTC
I'm sorry. All the above attached files are the same files.
Comment 8 Michael Zalewski 2004-03-06 01:40:07 UTC
The string contains 'Far East Info', which I don't think is handled properly by 

Here is what the String element looks like inside the SST record (beginning at 
offset 0x08)

01 00         Length of String = 1 character
05            Flags Far East Info, Unicode Characters
10 00 00 00   Length of  Far East Info = 16 bytes
60 24         Unicode code points = \u2460 = ①
01 00 0C 00   Far East Info (Undocumented 16 bytes)
05 00 35 00
00 00 00 00
00 00 00 00

I looked at the class org.apache.poi.hssf.record.UnicodeString, and I believe 
this class does not understand Far East information Strings (Option = 0x04 
through 0x07).

This class assumes that the first byte of the String will be at offset 3 from 
the beginning of the SST element. But it's not that simple. The options flag at 
offset 2 may indicate that the String contains 'Far East Information' (if the 
bit at 0x04 is set). In this case, the length of the Far East Information is at 
offset 3, and the first character of the string begins at offset 7.

Also, the code page present in the spread sheet is 1200.
Comment 9 Yegor Kozlov 2008-12-29 08:48:36 UTC
The reported problem is not reproducible with the latest trunk.
Please try the latest 3.5-beta4 or download daily builds from http://encore.torchbox.com/poi-svn-build/