Excel saves these values differently depending on which other characters are in the cell, which means if it only says(CASE 1): ÜÖÄüöä and you try to read the cell: HSSFCell cell = (HSSFCell)cellI.next(); if(cell.getCellType() == HSSFCell.CELL_TYPE_STRING ) { cell_value = cell.getStringCellValue(); } Then the resulting String is comprised of chars, all with the integer value 65533. On the other hand, if you write other characters into the cell(CASE 2), sometimes they are read correctly. This is because normally the chars(as read in hexidecimal values) are seperated by 00(CASE 2) chars, but somtimes not. And when they are not(CASE 1), the usermodel and eventmodel can't read them. I hope I missed out on something about character sets or what do I know, if not, this is a nasty bug.
What is the value of HSSFCell.getEncoding() on the Short.MAX_VALUE returning cells? In the case where you have umlauts it should be 1 and HSSF should read it as 16-bit characters. If not...that is a most nasty bug in 1.5.1. Glen, correct me if I'm wrong.
Hello again, I made the good old: System.out.println( "encoding: " + cell.getEncoding() ); and got the result for every single cell(well all 7 anyway): encoding: 0
I may have left out an important piece of information: Basically, I: Create my Excel sheets using HSSF on a weblogic Download them Change the sheets Upload them Read them using HSSF. Actually, if I DON'T make any changes, they are read just fine, but as soon as I press the save button in Excel, the format changes a bit, and HSSF is not able to read the cells with the characters with ascii values above 128. I use Microsoft Excel 2000. /Tom PS. this change is what I tried to explain in the initial bug report, except then I didn't know excaclty what happened.
Correction, the cells are read ok, I meant it's the characters IN the cells that are wrong.
eeenteresting... So it sounds like Excel is not updating the encoding flag when it changes to 16 bit. Can you attach to this bug 1. A sheet created with Excel containing the characters 2. An identical sheet created with HSSF containing those characters 3. An sheet created with HSSF and modified with Excel containing those characters (the broken case) All 3 sheets should be identical. This will help myself or someone else try and figure out the problem. If you'd like to give it a go, what we'll do is run org.apache.poi.hssf.BiffViewer on them and compare the results via "diff" (or whatever the windoze equivilent is) -Andy
Created attachment 2551 [details] The file made with HSSF and saved in excel
Created attachment 2552 [details] The file made with HSSF, has not been opened in excel, notice it is smaller.
Created attachment 2553 [details] File made with excel
I think what I've observed is that Excel 2000 treats fields like <Über> as normal text, but if there's a Trademark in there, excel changes from 8 to the 16 bit flag, so <Über™> would be treated as 16 bit. Whereas HSSF reads both <Über>(Excel 8 bit) and <Über™>(Excel 16 bit) as 16 bit fields, and so it reads out a REALLY big value for the Ü in <Über>(Excel 8 bit). Correct? /Tom PS. hope HTML can show "™"...if not, then it's supposed to be a trademark(ascii 0153).
By the way, this bug also occurs in your latest releases.
QUESTION 1: Is there anyway for me to tell if there is any progress in tracking/fixing this bug? QUESTION 2: can I help speed up the process of fixing this bug in some way, more then I allready have?
It would be nice if someone who cares would simply write: Yes, my friend, we are working on this bug. or: No we are not working on your bug, so stick it up your **s. Because I'd like to know how YOU feel...!? Thanks in advance, /Tom
One thing I relize I didn't ask for. . Can you give me the source you used to generate the HSSF version? If you can supply it in the form of a junit test it would be perfect. (I'd apply that in advance of fixing it!) The fact that I haven't closed the bug or marked it as fixed implies that I will look at it when I have time. (If you'd like it fixed faster, I can give you a mailing address and you can contribute to paying my mortgage.)
Thanks, that's all I wanted to know. The source is very simple: HSSFWorkbook wb = new HSSFWorkbook(); HSSFSheet sheet = wb.createSheet(); wb.setSheetName( sheetcounter, sheetName ); HSSFRow row = sheet.createRow((short)rowcounter); HSSFCell cell = row.createCell((short)0); cell.setEncoding( HSSFCell.ENCODING_UTF_16 ); cell.setCellType( HSSFCell.CELL_TYPE_STRING ); cell.setCellValue(iETI.getLanguage()); /*multiple cells...*/ byte[] bytes = wb.getBytes(); POIFSFileSystem fs = new POIFSFileSystem(); fs.createDocument(new ByteArrayInputStream(bytes), "Workbook"); ByteArrayOutputStream byteos = new ByteArrayOutputStream(); fs.writeFilesystem(byteos); byteos.close(); return byteos; The thing is, which I've explained, the error occurs when you write out the file, manually change it, and read the file again with HSSF. Then the above mentioned errors occur. HSSF has propblems recognizing if the cells are saved as 8 or 16 bit, depending on if the characters in the cell are between ascii: 0- 128 or 129-159 or 160-255.
So basically, the error that should be fixed is the following, you have 3 different ascii sets: A) 0-128 B) 129-159 C) 160-255 Cells containing A, should be read as an 8 bit cell. Cells containing A & B should be read as 16 bit Cells containing A & C should be read as 16 bit Cells containing A & B & C should be read as 16 bit Cells containing B & C should be read as 16 bit But HSSF reads cells containing A & C as an 8 bit cell. Which is wrong, because Excel handles these as 16 bit.
cool, can you submit a patch changing this behavior? Start at org.apache.poi.hssf.record.SSTRecord
Hi Thomas and Andy, I have looked at a recent CVS snapshot. I do not believe that there is a problem when reading in the strings from the attached files. I have traced through the SSTDeserializer class using the BiffViewer and the source code attached below to read the workbook and have found that I can correctly read all cells. Both the 1st and last attachments above correctly read the Uber cell as 8bit and the tmUber as 16bit. It is only the second attachment where the Uber is read as 16bit. Interestingly the tm character is unicode \u2122 rather than ascii 0153 (which you mention in the bug report), I guess the character set that the sheet was originally created in is something other that ISOLatin-1 I postulate that the only problem here is the fact that a supposed 8bit string has been written out as 16bit (ie the second attachment). As such we would need to look at the exact code that created the second attachment (The code that is attached to the bug doesnt have the values that were being allocated to the cell values). I think that the problem would become evident quickly. Jason <source code> import java.io.*; import org.apache.poi.hssf.usermodel.*; public class Tester { public Tester() { try { HSSFWorkbook wb = new HSSFWorkbook(new FileInputStream("c:/at1.xls")); HSSFSheet sheet = wb.getSheetAt(0); HSSFRow row = sheet.getRow(0); for (int i=0; i<row.getPhysicalNumberOfCells();i++) { HSSFCell cell = row.getCell((short)i); System.out.println("Cell "+i+"="+cell.getStringCellValue()); } } catch (Exception ex) { ex.printStackTrace(); } } public static void main(String[] args) { Tester tester1 = new Tester(); } }
Just a quick comment: Thank you very much for your help, I know I've been annoying, but being an end user, I feel it is my duty. I appreciate how Andy carefully and fondly responds to every one of my blundering e-mails in particular. To the point: I've run a few test programs myself and you're right Jason, it does write out the correct values on NT, well, actually it writes out Ö▄Í─³÷õ, not ÜÖÄüöä, in my cmdprompt, but I'm not picky. In the ascii table the values are the correct ones. BUT, on my BEA WebLogic Application Server 6.1, the seperate "case C" chars are read as 65533, which is something I'll look into, and report back to you with if I find out what the ... is going on. Guess I'll begin with char sets. I really do appreciate the help, and the brilliant software. Who knows, maybe I'll contribute with something constructive yet.
By the way, I ALLWAYS set the value useUTF16-thing to true, but I'd like to know where you decide what's READ, because it doesn't matter how you save it, Excel changes the format when you're playing around... I'm looking into it myself right now, the values are wrong when read from the Binary tree, so the next thing I'll try to figure out is if they're wrong when they're put() into it. I'll need a bit more time for that though, since I have to read up on the filesystem.
What I've narrowed it down to, is that in the SSTDeserializer, when you call processString, and say: UnicodeString string = new UnicodeString(UnicodeString.sid, (short) unicodeStringBuffer.length, unicodeStringBuffer ); String chars = string.getString(); for( int i = 0; i < chars.length(); i++ ) { System.out.print((int)chars.charAt(i) + " "); }chars = null; then if it is a there are chars between 160-255(keep in mind the bytes actually have the int value -1 & -64), then they all get converted to the value 65533 I don't know exactly where the conversion takes place, and I don't have time to look any further today, but if you know, please tell me. I'd like to solve this bug. Tomorrow I'll have a good look at the UnicodeString class.
wooops, decided to snoop around, just a little bit longer, and now I think I've located the error, so now I guess I just have to solve it. In UnicodeString the field_2_optionflags is 0, and the toString conversion is carried out in fillFields(byte [] data, short size), which results in the 65533 chars on my WebLogic server. PS. Sorry to anyone who feels I'm spamming their e-mail account.
thanks you for your patience and insights, I've narrowed it down to one line now, in the 1.5.1 class: UnicodeString in the function: fillFields the line: field_3_string = new String(data, 3, getCharCount()); makes a new String regardless of the ISO character set standard, and since I use "ISO-8859-1" it doesn't work. Adding: field_3_string = new String(data, 3, getCharCount(), "ISO-8859-1"); makes it work though, guess I'll have to make a few tests now, to see if anything else is messed up. Thanks again, Thomas
Try this with a recent nightly build. I think its fixed.
Is it fixed? can i close this bug? Thomas?
no response