I have found that if I create one Excel sheet without any data and then try to read it with eventmodel. It failes. The MS Excel opening the file correctly.
Created attachment 2427 [details] The XLS file with Unicode sheet name, its creator and reader. All in one ZIP.
Here is stack from the reader attached above: org.apache.poi.hssf.record.RecordFormatException: Error reading bytes at org.apache.poi.hssf.eventmodel.HSSFEventFactory.genericProcessEvents (HSSFEventFactory.java:241) at org.apache.poi.hssf.eventmodel.HSSFEventFactory.processEvents (HSSFEventFactory.java:139) at ExcelReadingTest.main(ExcelReadingTest.java:111) Exception in thread "main"
This is bad. This is happening when its trying to get 2 more bytes. This probably means that BoundSheetRecord is returning a bogus size and we're trying to read past the end of the "Workbook" stream.
One more interesting note, if I open this in Excel then save it back, POI can read it. It also grows to about 14k.
(from about 6k)
just checked...this now happens with *any* workbook created by HSSF (I ran the "org.apache.poi.hssf.dev.HSSF output.xls write" target and got the same problem!). They are all openable by both BiffViewer and Excel but not the event api. I can't look into this further right now, perhaps someone can take a gander and figure out what is being done differently. Also check to figure out whether this problem exists in 1.5.x (I don't think it does). Most likely candidates: Glen's SSTRecord stuff Sergi's unicode stuff Why? These most dramatically affect record size calculations (not that the effects need to be dramatic) and have modes of operation that change how size is calculated depending on how they're called. Thats my best guess at the moment.
Sleep for me then work so I probably will not be able to look at it for another 16 hours. There's a lot of tests for the recordsize calculation but it's a complex beast and it's entirely possible I broke it in some way.
Created attachment 2490 [details] This is the dump of the bad (formed by HSSF) and good (opened and saved by Excel). Good and bad from the view of the event API. %)
I have maid some dump, but still do not have any idea whay this happens. It seems that the content is correct in both cases, but many extra zero records in one case are suitable for getting them 2 bytes (sid) by 2 bytes (size). In another they ends on the first byte of supposed sid word. %( What do you think about it. Could you explain what are these zero records mean?
This means that one of the record size calculations is off by 2. One of the "getSize()" in records.. So its miscalculating the record size and misinterperating.
But I do not understand a thing: if the record is not last and has incorrect length, then the SID and SIZE of the next record will be shifted and wrong. Am I right? The error in size as I see may be only in the last record with non-zero size. And as you can see in the txt file, for the bad book only 1 byte is available in the input stream, while it wants to get 2. About zero records: why there are so many zero records after the EOF? Please, look at the files. There are not only 2 bytes not enough. The inner structure is correct (as I see) , but the extra zero records are different. :( Why so many zero records and what do they mean? What they for in Excel structure? Sorry, if it is silly question, but I really do not understand and want to understand.
Hey. I'm swamped at the moment and cannot devote sufficent time to this. Its very serious. In the next few days here is what I'm going to do: 1. Test the last "dev" build (1.7-dev) for this error (I assume it is not present) 2. Revert patches applied 1 by 1 (locally) until it no longer occurs 3. Apply back any patches that I confirm are not the cause 4. Apply that to the CVS repository and ask the patch submitter to correct the problem. In the meantime please continue investigating this. -andy
FYI... I've looked at the bound sheet record changes and they appear to be okay as far as I can tell.
Okay, I've just tried this on 1.5.1 and 1.5.0 and got the same problem... This seems to be a long standing issue.
Just a thought... I noticed the HSSFWorkbook class (line 531) check's if the serialized size is < 4096 and sets it equal to 4096 if it is. Could this be the problem and why is it done?
That may be where the 0s are coming from but that is correct behavior. The "Workbook" must be at least 4096. (Long boring reason if you'd like to hear it let me know). However the fact that HSSF is reading into this is usually a result of being off by some bytes in calculated size.
I've confirmed the last set of patches for unicode stuff caused this I'll back that out and post a note to the author to check this code.
Andy, I really dont understand that at all. How could the last set of changes have cuased this? I have confirmed that this exists in 1.5.0 and 1.5.1.
okay there were two sets of problems. I was having trouble reading files outputted by POI in biffviewer (from the head). backing out the unicode patches fixed that.
Created attachment 2519 [details] The bug is still alive. Here is my thought and suggestions on it.
The problem is still alive after the patch was back. In the attach the maker of the BUG and two examples. As I see, it is not the size problem, and not Unicode problem, as Glen maked this problem was before. My assumption and suggestion on this error are below. What do you think about it? My assumption of this error is here: For some reasons we have to make the workbook to be at least 4096 bytes but if we have such workbook we fill the end of it with zeros (many zeros) It is not good: if the length( all zero records ) % 4 = 1 e.g.: any zero record would be readed as 4 bytes at once ( 2 - id and 2 - size ). And the last 1 byte will be readed WRONG ( the id must be 2 bytes ) The suggestion: So we should better to check if the sid is zero and not to read more data The zero sid shows us that rest of the stream data is a fake to make workbook certain size. ( As in the sample HSSFEventFactory.java attached )
The attach is a zip file.
applied, please verify
Created attachment 2865 [details] Unicode support for sheet name, format records and "print titles" of namerecord
In the last attach the checking for zero id in event model, unicode support for sheet name and format record. Also there is the aproache to the name record "print tiltles". As I found it gets the "Print Titles" and writes them down, but Excel can't see them. :( Could you suggest something on this point? May be it is some info in another stream to tell Excel to get these "Print Titles". I am stuck on this...
applied, please crosscheck. Also PLEASE use "cvs diff -u" instead of "cvs diff" next time. (it makes it much easier to verify the patch). This broke several (crappy) asserts in the unit tests... We used to check for file length (which was really important at one time because if the file length was off...we'd screwed something up)...its not very important anymore and these asserts should simply be removed as file length will change. -Andy
Thanks, I checked and it seems that it is OK. I will continue to find the classes that has strings and can be unicode or UTF8. The next is Header and Footer Records. Or may be you have the best idea to work at? :) Also, please, help me with NameRecord. I do not know how works: the data is in the Excel file, but MS Excel is not displaying/applying it. :(
list please. I hate using bugzilla as a mail too.