Summary: | A space may enter behind a character when the enclosed alphanumerics characters are read. | ||
---|---|---|---|
Product: | POI | Reporter: | Kenji Sasaki <sasaki> |
Component: | HSSF | Assignee: | POI Developers List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | normal | ||
Priority: | P3 | ||
Version: | 2.0-FINAL | ||
Target Milestone: | --- | ||
Hardware: | Other | ||
OS: | Windows XP | ||
Attachments: |
Excel file for this problem.
Excel file for this problem. Excel file for this problem. |
Description
Kenji Sasaki
2004-03-03 05:33:56 UTC
I am not sure which characters are being reported as broken. But I have noticed a similar problem: 1. Using Excel, set a sheet name with one of the non-ISO 8859-1 characters in Windows ANSII Code Page 1252. For example, this includes s caron and z caron (but not a caron). Caron is the little cap diacritic drawn above the character. Don't use any non CP 1252 characters. 2. Try to read the sheet name from POI. The character will show as a little square (because it is an invalid code point). 3. If you use Biff Viewer, you will see that the sheet name has been stored by Excel as a compressed 8 bit string. 4. If you look at the string that HSSF returns for the sheet name, you will see that, for example, lower case z caron is converted by HSSF into \u009e, which is actually not a valid code point. The proper representation of z caron is \u0173. When I read the original report, I cannot tell which 'special characters' are being used. (They appear as little boxes in my browser). I suspect that the 'special characters' may be those in ASCII range 0x80 - 0x9f, which cannot be translated to Unicode simply by assuming that the high order byte is zero. Also a question for sasaki: He should report what his windows code page is (if it is not 1252). Attaching a sample spreadsheet would probably help. Did I even get this right? The problem is that a special character is entered inside a cell, with Excel. When HSSF reads the value of that cell, it displays the character as a little box, or possibly a space. A summary is changed. I,m sorry.The character which I showed above had broken. The special character whitch I said is Enclosed Alphanumerics characters in Unicode range \u2460 - \u24FF. The code page whitch I use is Win 932. Created attachment 10671 [details]
Excel file for this problem.
Created attachment 10672 [details]
Excel file for this problem.
Created attachment 10673 [details]
Excel file for this problem.
I'm sorry. All the above attached files are the same files. The string contains 'Far East Info', which I don't think is handled properly by HSSF. Here is what the String element looks like inside the SST record (beginning at offset 0x08) 01 00 Length of String = 1 character 05 Flags Far East Info, Unicode Characters 10 00 00 00 Length of Far East Info = 16 bytes 60 24 Unicode code points = \u2460 = ① 01 00 0C 00 Far East Info (Undocumented 16 bytes) 05 00 35 00 00 00 00 00 00 00 00 00 I looked at the class org.apache.poi.hssf.record.UnicodeString, and I believe this class does not understand Far East information Strings (Option = 0x04 through 0x07). This class assumes that the first byte of the String will be at offset 3 from the beginning of the SST element. But it's not that simple. The options flag at offset 2 may indicate that the String contains 'Far East Information' (if the bit at 0x04 is set). In this case, the length of the Far East Information is at offset 3, and the first character of the string begins at offset 7. Also, the code page present in the spread sheet is 1200. The reported problem is not reproducible with the latest trunk. Please try the latest 3.5-beta4 or download daily builds from http://encore.torchbox.com/poi-svn-build/ Yegor |