11322 – can't read

Bug 11322 - can't read

Summary: can't read

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HSSF (show other bugs)
Version:	1.5.1
Hardware:	Sun other

Importance:	P3 major (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2002-07-31 10:04 UTC by Thomas Hessellund Nielsen
Modified:	2004-11-16 19:05 UTC (History)
CC List:	0 users

Attachments
The file made with HSSF and saved in excel (13.50 KB, application/vnd.ms-excel) 2002-08-01 06:47 UTC, Thomas Hessellund Nielsen	Details
The file made with HSSF, has not been opened in excel, notice it is smaller. (5.50 KB, application/vnd.ms-excel) 2002-08-01 06:48 UTC, Thomas Hessellund Nielsen	Details
File made with excel (13.50 KB, application/vnd.ms-excel) 2002-08-01 06:49 UTC, Thomas Hessellund Nielsen	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Thomas Hessellund Nielsen 2002-07-31 10:04:38 UTC

Excel saves these values differently depending on which other characters are in 
the cell, which means if it only says(CASE 1):

ÜÖÄüöä

and you try to read the cell:

	HSSFCell cell = (HSSFCell)cellI.next();
	if(cell.getCellType() == HSSFCell.CELL_TYPE_STRING )
	{	cell_value = cell.getStringCellValue();	}

Then the resulting String is comprised of chars, all with the integer value 
65533.

On the other hand, if you write other characters into the cell(CASE 2), 
sometimes they are read correctly.

This is because normally the chars(as read in hexidecimal values) are seperated 
by 00(CASE 2) chars, but somtimes not. And when they are not(CASE 1), the 
usermodel and eventmodel can't read them.

I hope I missed out on something about character sets or what do I know, if 
not, this is a nasty bug.

Comment 1 Andy Oliver 2002-07-31 12:41:03 UTC

What is the value of HSSFCell.getEncoding() on the Short.MAX_VALUE returning
cells?  In the case where you have umlauts it should be 1 and HSSF should read
it as 16-bit characters.  If not...that is a most nasty bug in 1.5.1.  
Glen, correct me if I'm wrong.

Comment 2 Thomas Hessellund Nielsen 2002-07-31 12:58:33 UTC

Hello again, I made the good old:

System.out.println( "encoding: " + cell.getEncoding() ); 

and got the result for every single cell(well all 7 anyway):

encoding: 0

Comment 3 Thomas Hessellund Nielsen 2002-07-31 13:10:11 UTC

I may have left out an important piece of information:

Basically, I:
Create my Excel sheets using HSSF on a weblogic
Download them
Change the sheets
Upload them
Read them using HSSF.

Actually, if I DON'T make any changes, they are read just fine, but as soon as 
I press the save button in Excel, the format changes a bit, and HSSF is not 
able to read the cells with the characters with ascii values above 128.

I use Microsoft Excel 2000.

/Tom

PS. this change is what I tried to explain in the initial bug report, except 
then I didn't know excaclty what happened.

Comment 4 Thomas Hessellund Nielsen 2002-07-31 13:11:36 UTC

Correction, the cells are read ok, I meant it's the characters IN the cells 
that are wrong.

Comment 5 Andy Oliver 2002-07-31 18:13:27 UTC

eeenteresting...  So it sounds like Excel is not updating the encoding flag when
it changes to 16 bit.  Can you attach to this bug

1. A sheet created with Excel containing the characters
2. An identical sheet created with HSSF containing those characters
3. An sheet created with HSSF and modified with Excel containing those characters
   (the broken case)

All 3 sheets should be identical.  This will help myself or someone else try and
figure out the problem.

If you'd like to give it a go, what we'll do is run
org.apache.poi.hssf.BiffViewer on them and compare the results via "diff" (or
whatever the windoze equivilent is)

-Andy

Comment 6 Thomas Hessellund Nielsen 2002-08-01 06:47:33 UTC

Created attachment 2551 [details]
The file made with HSSF and saved in excel

Comment 7 Thomas Hessellund Nielsen 2002-08-01 06:48:32 UTC

Created attachment 2552 [details]
The file made with HSSF, has not been opened in excel, notice it is smaller.

Comment 8 Thomas Hessellund Nielsen 2002-08-01 06:49:00 UTC

Created attachment 2553 [details]
File made with excel

Comment 9 Thomas Hessellund Nielsen 2002-08-01 08:22:44 UTC

I think what I've observed is that Excel 2000 treats fields like 
<Über>
as normal text, but if there's a Trademark in there, excel changes from 8 to 
the 16 bit flag, so 
<Über™>
would be treated as 16 bit.

Whereas HSSF reads both <Über>(Excel 8 bit) and <Über™>(Excel 16 bit) as 16 bit 
fields, and so it reads out a REALLY big value for the Ü in <Über>(Excel 8 bit).

Correct?

/Tom

PS. hope HTML can show "™"...if not, then it's supposed to be a trademark(ascii 
0153).

Comment 10 Thomas Hessellund Nielsen 2002-08-02 06:22:26 UTC

By the way, this bug also occurs in your latest releases.

Comment 11 Thomas Hessellund Nielsen 2002-08-05 12:15:51 UTC

QUESTION 1: Is there anyway for me to tell if there is any progress in 
tracking/fixing this bug?

QUESTION 2: can I help speed up the process of fixing this bug in some way, 
more then I allready have?

Comment 12 Thomas Hessellund Nielsen 2002-08-16 13:13:56 UTC

It would be nice if someone who cares would simply write:
Yes, my friend, we are working on this bug.
or:
No we are not working on your bug, so stick it up your **s.
Because I'd like to know how YOU feel...!?

Thanks in advance,

/Tom

Comment 13 Andy Oliver 2002-08-16 14:11:43 UTC

One thing I relize I didn't ask for. .  Can you give me the source you used to
generate the HSSF version?  If you can supply it in the form of a junit test it
would be perfect.  (I'd apply that in advance of fixing it!)

The fact that I haven't closed the bug or marked it as fixed implies that I will
look at it when I have time.  (If you'd like it fixed faster, I can give you a
mailing address and you can contribute to paying my mortgage.)

Comment 14 Thomas Hessellund Nielsen 2002-08-20 11:40:27 UTC

Thanks, that's all I wanted to know.

The source is very simple:

HSSFWorkbook wb = new HSSFWorkbook();
HSSFSheet sheet = wb.createSheet();
wb.setSheetName( sheetcounter, sheetName );
HSSFRow row = sheet.createRow((short)rowcounter);
HSSFCell cell = row.createCell((short)0);	
cell.setEncoding( HSSFCell.ENCODING_UTF_16 );
cell.setCellType( HSSFCell.CELL_TYPE_STRING );
cell.setCellValue(iETI.getLanguage());
/*multiple cells...*/
byte[] bytes = wb.getBytes();
POIFSFileSystem fs = new POIFSFileSystem();
fs.createDocument(new ByteArrayInputStream(bytes), "Workbook");
ByteArrayOutputStream byteos = new ByteArrayOutputStream();
fs.writeFilesystem(byteos);
byteos.close();
return byteos;

The thing is, which I've explained, the error occurs when you write out the 
file, manually change it, and read the file again with HSSF. Then the above 
mentioned errors occur. HSSF has propblems recognizing if the cells are saved 
as 8 or 16 bit, depending on if the characters in the cell are between ascii: 0-
128 or 129-159 or 160-255.

Comment 15 Thomas Hessellund Nielsen 2002-08-28 07:10:46 UTC

So basically, the error that should be fixed is the following, you have 3 
different ascii sets:

A) 0-128
B) 129-159
C) 160-255

Cells containing A, should be read as an 8 bit cell.
Cells containing A & B should be read as 16 bit
Cells containing A & C should be read as 16 bit
Cells containing A & B & C should be read as 16 bit
Cells containing B & C should be read as 16 bit

But HSSF reads cells containing A & C as an 8 bit cell. Which is wrong, because 
Excel handles these as 16 bit.

Comment 16 Andy Oliver 2002-08-28 13:55:03 UTC

cool, can you submit a patch changing this behavior?  Start at
org.apache.poi.hssf.record.SSTRecord

Comment 17 Jason Height 2002-08-28 23:30:51 UTC

Hi Thomas and Andy,

I have looked at a recent CVS snapshot.

I do not believe that there is a problem when reading in the strings from the 
attached files. I have traced through the SSTDeserializer class using the 
BiffViewer and the source code attached below to read the workbook and have 
found that I can correctly read all cells.

Both the 1st and last attachments above correctly read the Uber cell as 8bit 
and the tmUber as 16bit. It is only the second attachment where the Uber is 
read as 16bit. Interestingly the tm character is unicode \u2122 rather than 
ascii 0153 (which you mention in the bug report), I guess the character set 
that the sheet was originally created in is something other that ISOLatin-1

I postulate that the only problem here is the fact that a supposed 8bit string 
has been written out as 16bit (ie the second attachment). As such we would need 
to look at the exact code that created the second attachment (The code that is 
attached to the bug doesnt have the values that were being allocated to the 
cell values). I think that the problem would become evident quickly.

Jason

<source code>
import java.io.*;
import org.apache.poi.hssf.usermodel.*;

public class Tester {

  public Tester() {
    try {
    HSSFWorkbook wb = new HSSFWorkbook(new FileInputStream("c:/at1.xls"));
    HSSFSheet sheet = wb.getSheetAt(0);
    HSSFRow row = sheet.getRow(0);
    for (int i=0; i<row.getPhysicalNumberOfCells();i++) {
      HSSFCell cell = row.getCell((short)i);
      System.out.println("Cell "+i+"="+cell.getStringCellValue());
    }
    } catch (Exception ex) {
      ex.printStackTrace();
    }
  }
  public static void main(String[] args) {
    Tester tester1 = new Tester();
  }
}

Comment 18 Thomas Hessellund Nielsen 2002-08-29 07:45:19 UTC

Just a quick comment:

Thank you very much for your help, I know I've been annoying, but being an end 
user, I feel it is my duty. I appreciate how Andy carefully and fondly responds 
to every one of my blundering e-mails in particular.

To the point: 
I've run a few test programs myself and you're right Jason, it does write out 
the correct values on NT, well, actually it writes out Ö&#9604;Í&#9472;³÷õ, not ÜÖÄüöä, in 
my cmdprompt, but I'm not picky. In the ascii table the values are the correct 
ones.
BUT, on my BEA WebLogic Application Server 6.1, the seperate "case C" chars are 
read as 65533, which is something I'll look into, and report back to you with 
if I find out what the ... is going on. Guess I'll begin with char sets.

I really do appreciate the help, and the brilliant software. Who knows, maybe 
I'll contribute with something constructive yet.

Comment 19 Thomas Hessellund Nielsen 2002-08-29 10:13:14 UTC

By the way, I ALLWAYS set the value useUTF16-thing to true, but I'd like to 
know where you decide what's READ, because it doesn't matter how you save it, 
Excel changes the format when you're playing around...

I'm looking into it myself right now, the values are wrong when read from the 
Binary tree, so the next thing I'll try to figure out is if they're wrong when 
they're put() into it. I'll need a bit more time for that though, since I have 
to read up on the filesystem.

Comment 20 Thomas Hessellund Nielsen 2002-08-29 13:54:08 UTC

What I've narrowed it down to, is that in the SSTDeserializer, when you call 
processString, and say:

UnicodeString string = new UnicodeString(UnicodeString.sid,
                                         (short) unicodeStringBuffer.length,
                                         unicodeStringBuffer );

String chars = string.getString();
for( int i = 0; i < chars.length(); i++ ) 
{
System.out.print((int)chars.charAt(i) + " ");
}chars = null;

then if it is a there are chars between 160-255(keep in mind the bytes actually 
have the int value -1 & -64), then they all get converted to the value 65533

I don't know exactly where the conversion takes place, and I don't have time to 
look any further today, but if you know, please tell me. I'd like to solve this 
bug.

Tomorrow I'll have a good look at the UnicodeString class.

Comment 21 Thomas Hessellund Nielsen 2002-08-29 14:12:46 UTC

wooops, decided to snoop around, just a little bit longer, and now I think I've 
located the error, so now I guess I just have to solve it.

In UnicodeString the field_2_optionflags is 0, and the toString conversion is 
carried out in fillFields(byte [] data, short size), which results in the 65533 
chars on my WebLogic server.

PS. Sorry to anyone who feels I'm spamming their e-mail account.

Comment 22 Thomas Hessellund Nielsen 2002-08-30 07:24:38 UTC

thanks you for your patience and insights, I've narrowed it down to one line 
now, in the 1.5.1 class:
UnicodeString
in the function:
fillFields

the line:
field_3_string = new String(data, 3, getCharCount());

makes a new String regardless of the ISO character set standard, and since I 
use "ISO-8859-1" it doesn't work. Adding:

field_3_string = new String(data, 3, getCharCount(), "ISO-8859-1");

makes it work though, guess I'll have to make a few tests now, to see if 
anything else is messed up.

Thanks again,

Thomas

Comment 23 Andy Oliver 2002-12-12 12:12:12 UTC

Try this with a recent nightly build.  I think its fixed.

Comment 24 Avik Sengupta 2002-12-25 17:59:38 UTC

Is it fixed? can i close this bug? Thomas?

Comment 25 Andy Oliver 2003-07-24 13:18:27 UTC

no response