|Summary:||I have crosschecked the changes|
|Product:||POI||Reporter:||Sergei Kozello <sergeikozello>|
|Component:||HSSF||Assignee:||POI Developers List <dev>|
The XLS file with Unicode sheet name, its creator and reader. All in one ZIP.
This is the dump of the bad (formed by HSSF) and good (opened and saved by Excel). Good and bad from the view of the event API. %)
The bug is still alive. Here is my thought and suggestions on it.
Unicode support for sheet name, format records and "print titles" of namerecord
Description Sergei Kozello 2002-07-20 22:40:01 UTC
I have found that if I create one Excel sheet without any data and then try to read it with eventmodel. It failes. The MS Excel opening the file correctly.
Comment 1 Sergei Kozello 2002-07-21 10:57:38 UTC
Created attachment 2427 [details] The XLS file with Unicode sheet name, its creator and reader. All in one ZIP.
Comment 2 Sergei Kozello 2002-07-21 10:59:32 UTC
Here is stack from the reader attached above: org.apache.poi.hssf.record.RecordFormatException: Error reading bytes at org.apache.poi.hssf.eventmodel.HSSFEventFactory.genericProcessEvents (HSSFEventFactory.java:241) at org.apache.poi.hssf.eventmodel.HSSFEventFactory.processEvents (HSSFEventFactory.java:139) at ExcelReadingTest.main(ExcelReadingTest.java:111) Exception in thread "main"
Comment 3 Andy Oliver 2002-07-21 13:10:18 UTC
This is bad. This is happening when its trying to get 2 more bytes. This probably means that BoundSheetRecord is returning a bogus size and we're trying to read past the end of the "Workbook" stream.
Comment 4 Andy Oliver 2002-07-21 13:21:25 UTC
One more interesting note, if I open this in Excel then save it back, POI can read it. It also grows to about 14k.
Comment 5 Andy Oliver 2002-07-21 13:21:44 UTC
(from about 6k)
Comment 6 Andy Oliver 2002-07-21 13:42:47 UTC
just checked...this now happens with *any* workbook created by HSSF (I ran the "org.apache.poi.hssf.dev.HSSF output.xls write" target and got the same problem!). They are all openable by both BiffViewer and Excel but not the event api. I can't look into this further right now, perhaps someone can take a gander and figure out what is being done differently. Also check to figure out whether this problem exists in 1.5.x (I don't think it does). Most likely candidates: Glen's SSTRecord stuff Sergi's unicode stuff Why? These most dramatically affect record size calculations (not that the effects need to be dramatic) and have modes of operation that change how size is calculated depending on how they're called. Thats my best guess at the moment.
Comment 7 Glen Stampoultzis 2002-07-21 13:59:15 UTC
Sleep for me then work so I probably will not be able to look at it for another 16 hours. There's a lot of tests for the recordsize calculation but it's a complex beast and it's entirely possible I broke it in some way.
Comment 8 Sergei Kozello 2002-07-25 19:42:24 UTC
Created attachment 2490 [details] This is the dump of the bad (formed by HSSF) and good (opened and saved by Excel). Good and bad from the view of the event API. %)
Comment 9 Sergei Kozello 2002-07-25 19:45:29 UTC
I have maid some dump, but still do not have any idea whay this happens. It seems that the content is correct in both cases, but many extra zero records in one case are suitable for getting them 2 bytes (sid) by 2 bytes (size). In another they ends on the first byte of supposed sid word. %( What do you think about it. Could you explain what are these zero records mean?
Comment 10 Andy Oliver 2002-07-26 12:08:50 UTC
This means that one of the record size calculations is off by 2. One of the "getSize()" in records.. So its miscalculating the record size and misinterperating.
Comment 11 Sergei Kozello 2002-07-26 16:52:21 UTC
But I do not understand a thing: if the record is not last and has incorrect length, then the SID and SIZE of the next record will be shifted and wrong. Am I right? The error in size as I see may be only in the last record with non-zero size. And as you can see in the txt file, for the bad book only 1 byte is available in the input stream, while it wants to get 2. About zero records: why there are so many zero records after the EOF? Please, look at the files. There are not only 2 bytes not enough. The inner structure is correct (as I see) , but the extra zero records are different. :( Why so many zero records and what do they mean? What they for in Excel structure? Sorry, if it is silly question, but I really do not understand and want to understand.
Comment 12 Andy Oliver 2002-07-26 17:04:32 UTC
Hey. I'm swamped at the moment and cannot devote sufficent time to this. Its very serious. In the next few days here is what I'm going to do: 1. Test the last "dev" build (1.7-dev) for this error (I assume it is not present) 2. Revert patches applied 1 by 1 (locally) until it no longer occurs 3. Apply back any patches that I confirm are not the cause 4. Apply that to the CVS repository and ask the patch submitter to correct the problem. In the meantime please continue investigating this. -andy
Comment 13 Glen Stampoultzis 2002-07-27 01:49:27 UTC
FYI... I've looked at the bound sheet record changes and they appear to be okay as far as I can tell.
Comment 14 Glen Stampoultzis 2002-07-27 12:06:40 UTC
Okay, I've just tried this on 1.5.1 and 1.5.0 and got the same problem... This seems to be a long standing issue.
Comment 15 Glen Stampoultzis 2002-07-28 11:41:29 UTC
Just a thought... I noticed the HSSFWorkbook class (line 531) check's if the serialized size is < 4096 and sets it equal to 4096 if it is. Could this be the problem and why is it done?
Comment 16 Andy Oliver 2002-07-28 14:13:53 UTC
That may be where the 0s are coming from but that is correct behavior. The "Workbook" must be at least 4096. (Long boring reason if you'd like to hear it let me know). However the fact that HSSF is reading into this is usually a result of being off by some bytes in calculated size.
Comment 17 Andy Oliver 2002-07-28 22:28:37 UTC
I've confirmed the last set of patches for unicode stuff caused this I'll back that out and post a note to the author to check this code.
Comment 18 Glen Stampoultzis 2002-07-29 00:01:16 UTC
Andy, I really dont understand that at all. How could the last set of changes have cuased this? I have confirmed that this exists in 1.5.0 and 1.5.1.
Comment 19 Andy Oliver 2002-07-29 00:21:48 UTC
okay there were two sets of problems. I was having trouble reading files outputted by POI in biffviewer (from the head). backing out the unicode patches fixed that.
Comment 20 Sergei Kozello 2002-07-29 19:13:59 UTC
Created attachment 2519 [details] The bug is still alive. Here is my thought and suggestions on it.
Comment 21 Sergei Kozello 2002-07-29 19:14:51 UTC
The problem is still alive after the patch was back. In the attach the maker of the BUG and two examples. As I see, it is not the size problem, and not Unicode problem, as Glen maked this problem was before. My assumption and suggestion on this error are below. What do you think about it? My assumption of this error is here: For some reasons we have to make the workbook to be at least 4096 bytes but if we have such workbook we fill the end of it with zeros (many zeros) It is not good: if the length( all zero records ) % 4 = 1 e.g.: any zero record would be readed as 4 bytes at once ( 2 - id and 2 - size ). And the last 1 byte will be readed WRONG ( the id must be 2 bytes ) The suggestion: So we should better to check if the sid is zero and not to read more data The zero sid shows us that rest of the stream data is a fake to make workbook certain size. ( As in the sample HSSFEventFactory.java attached )
Comment 22 Sergei Kozello 2002-07-29 19:15:24 UTC
The attach is a zip file.
Comment 23 Andy Oliver 2002-08-15 20:02:18 UTC
applied, please verify
Comment 24 Sergei Kozello 2002-08-29 08:34:27 UTC
Created attachment 2865 [details] Unicode support for sheet name, format records and "print titles" of namerecord
Comment 25 Sergei Kozello 2002-08-29 08:39:58 UTC
In the last attach the checking for zero id in event model, unicode support for sheet name and format record. Also there is the aproache to the name record "print tiltles". As I found it gets the "Print Titles" and writes them down, but Excel can't see them. :( Could you suggest something on this point? May be it is some info in another stream to tell Excel to get these "Print Titles". I am stuck on this...
Comment 26 Andy Oliver 2002-09-02 02:14:02 UTC
applied, please crosscheck. Also PLEASE use "cvs diff -u" instead of "cvs diff" next time. (it makes it much easier to verify the patch). This broke several (crappy) asserts in the unit tests... We used to check for file length (which was really important at one time because if the file length was off...we'd screwed something up)...its not very important anymore and these asserts should simply be removed as file length will change. -Andy
Comment 27 Sergei Kozello 2002-09-03 19:29:12 UTC
Thanks, I checked and it seems that it is OK. I will continue to find the classes that has strings and can be unicode or UTF8. The next is Header and Footer Records. Or may be you have the best idea to work at? :) Also, please, help me with NameRecord. I do not know how works: the data is in the Excel file, but MS Excel is not displaying/applying it. :(
Comment 28 Andy Oliver 2002-09-04 13:38:41 UTC
list please. I hate using bugzilla as a mail too.