Bug 50939

Summary: RecordFormatException (another)
Product: POI Reporter: Mats Wolpers <MatsWolpers>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.8-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: the smallest file i could manage to show the problem.
Throwing ArrayIndexOutOfBoundsException

Description Mats Wolpers 2011-03-17 03:05:06 UTC
Created attachment 26779 [details]
the smallest file i could manage to show the problem.

Hello,

i have another case of record format exception. i am aware that similar issues have been reported, and that one has been fixed in 3.8.1 beta, but mine is still there with that version.

1. the evidence:
=============
1.1 the file
--------------
i have tried to boil my file down to the absolute minimum, and here it is:


1.2 my code (fragment)
--------------------------
my code that tries to read this is as simple as 
       private static HSSFWorkbook readFile(String filename) throws IOException {
                mfis = new FileInputStream(filename);
                HSSFWorkbook wb = null;
61              try {
62                      wb = new HSSFWorkbook(mfis);
63              } catch ( IOException ioe ) {
                        // blah
                } catch(RecordFormatException rfe){
                        // blahblah
                }
                return wb;
        }

1.3 the stack trace
----------------------
the stack trace is ( copied from eclipse console)
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
        at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
        at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:300)
        at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:270)
        at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:236)
        at org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:442)
        at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:298)
        at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:260)
        at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:204)
        at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:340)
        at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:321)
        at org.panda.giggling.in.ChecklistTransformer.readFile(ChecklistTransformer.java:62)
        at org.panda.giggling.in.ChecklistTransformer.readAndTransform(ChecklistTransformer.java:159)
        at org.panda.giggling.Main.extractChecklistsFromFiles(Main.java:339)
        at org.panda.giggling.Main.main(Main.java:113)
Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (6) bytes
        at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216)
        at org.apache.poi.hssf.record.RecordInputStream.readFully(RecordInputStream.java:288)
        at org.apache.poi.hssf.record.RecordInputStream.readFully(RecordInputStream.java:284)
        at org.apache.poi.hssf.record.chart.ChartEndObjectRecord.<init>(ChartEndObjectRecord.java:44)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57)
        ... 13 more

note: i couldn't persuade eclipse to show the 13 more ( is there a trick to this? please tell me )

2. circumstantial notes
=====================
2.-1 i first observed this using poi 3.7 around december. now i discovered 3.8.1beta and tried again. no discernible difference. the above stack trace is against 3.8.1 beta 
(unless i screwed up, which is at least possible if not likely).
2.0 the recently solved issue was about requesting 2 bytes and getting only one ( i think). my numbers are different so maybe this _is_ a different issue 
2.1  my code runs on a win xt box at work. the same problem can however be observed on my home mac.
2.2 the file was generated using excel 2003.
2.3 i tried eliminating that embedded graphic on "Sheet 2", and the crash went away
2.4 one apparent workaround is to take this file to a mac, open it with neo_office, write a blank in an otherwise empty cell, save, remove the blank and save again, and done. the same exorcism does not work using OpenOffice (or indeed Excel2003) on that same win box that held the original file. All cited programs read the file without apparent effort.
2.5 there are other files, similar in structure, but different in content details that are read (by that same code snippet of mine) without problem. in particular, those files also contain embedded graphics like on "sheet3". all these files were forked off the same template some 6 years ago  but have seen profuse editing since. research into the recent change history of the original (non-stripped) file revealed no obvious reasons for suspicion. ( if "sheet 2" still held the orig data, "sheet3" would show bar graphs presenting some aspect of those data. i earnestly believe you don't need to know more. ) 
2.6 i read in some tickets about BiffViewer analysis. I hope i can be excused for not managing to get ahold of BiffViewer (where is it? how is it run? i'm getting too old for this sort of thing...). I found other purported BiffViewers on the web and tried to look at a working file vs this nonworking file (in the pre-reduced state) but could make no sense of what i saw there.
2.7 and finally, i looked at the same working and the non-working file (as in 2.6) using notepad++ (Plugins > Compare) and found one potential strangeness: the working file sported a line (near the end) of VersionCompatible32="393222000", where the nonworking file said  VersionCompatible32="393B".
The long number is all but ubiquitous according to google, the short number was a zero-hit search. A search in an alleged file format spec for ms excel showed no evidence of the term VersionCompatible so this dead-ended, too.

i'd appreciate developer's comments on whether this is indeed a bug and not some rudeness on the part of my own code, please.
many thanks,
mats.
Comment 1 Mats Wolpers 2011-03-17 03:06:04 UTC
one answer received from nick:


On Wed, 16 Mar 2011, Matthias Wolpers wrote:
1.3 the stack trace
----------------------
      at org.apache.poi.hssf.record.chart.ChartEndObjectRecord.<init>(ChartEndObjectRecord.java:44)

This is the bit that matters. There's a problem with creating one of the chart records for your file

In theory, the record should be padded with 6 bytes at the end, but it looks like in your case it isn't. Can you share the problem file so we can look into it?

Also, if you fancy checking the Microsoft docs on this record (it'll be in the [MS-XLS].pdf file, look it up by the sid of 0x0855) and see what they say about the contents of the record, that'd be handy.

2.6 i read in some tickets about BiffViewer analysis. I hope i can be excused for not managing to get ahold of BiffViewer (where is it? how is it run? i'm getting too old for this sort of thing...). I found other purported BiffViewers on the web and tried to look at a working file vs this nonworking file (in the pre-reduced state) but could make no sense of what i saw there.

It's included in the main POI jar. The class is org.apache.poi.hssf.dev.BiffViewer, and it has a main method - just call it with the name of the file to be analyzed

Nick
Comment 2 Mats Wolpers 2011-03-17 03:35:18 UTC
re MS spec: is this the info you wanted?

quote Excel97-2007BinaryFileFormat(xls)Specification.pdf, page 343f:

ENDOBJECT: Chart Future Record Type End Object (855h)
Introduced in Excel 9 (2000), this BIFF record is an FRT record for Charts that indicates the end of an object's scope for Excel 9 and later objects.
Record Data
Offset	Field Name	Size      Contents
4              rt                     2	     Record type; this matches the BIFF rt in the first two 
                                                     bytes of the record; =0855h
6             grbitFrt             2          FRT flags; must be zero
8	       iObjectKin	        2	    Sanity check for object scope being ended
              d
10	      (unused)	        6	    Reserved; must be zero

(continues with CATLAB: Category Labels (856h))

if it isn't, i'm stuck, for "sid" appears only as substring of words such as "inside" or "consider".
Comment 3 Nick Burch 2011-03-17 06:11:19 UTC
(In reply to comment #2)
> re MS spec: is this the info you wanted?

Yup

> 10          (unused)            6        Reserved; must be zero

This tells us that microsoft does expect there to be the 6 trailing bytes in the record. The stack trace you provided tells us that in your file those 6 bytes aren't there...

Do you know how the file was produced?
Comment 4 Mats Wolpers 2011-03-17 14:23:08 UTC
How was that file produced?

1. Cannot be too sure: as i explained in section 2.5, the file's history is long and difficult to track given the fluctuation we've seen in file handling staff. Obviously, nobody left a check-in tag saying : incidentally, i ate them 6 reserve bytes.

2.1 of the former editors, those that are still available for questioning remember nothing noteworthy. As far as can be told now, this file has never been subjected to anything other than treatment by Excel 2003.
2.2 the template the file was made from is still available, and it is sound in the sense that if i re-create the file with today's content straight from the template, the resulting file can be opened through POI without throwing RFEs. In other words, something happened along the road, but it is impossible to say what with any certainty.

3. One potential source of strangeness is the fact that for a while this file was in the hands of another group who confessed to using the "Extras > Arbeitsmappe freigeben" feature ( apologies: my excel speaks German, if i translate this roughly into "Extras > Share workbook", would that be recognizable? ). These people not only did part of their work in a different LAN (imagine: different building on another campus), they carried computers from the one LAN to the other with the file open but the machine in sleep mode, or similar. To my untrained eye ( i haven't studied this feature, my toe nails curl up at the very thought...) , this is a potential source of trouble.
A trivial "2 users hit the same file on some shared drive" test failed to reproduce my original  symptom.
(please understand i'm stealing this research from company time, and their patience will only stretch that far.)

We found one other file that exhibits the same behaviour (want 6, receive 0, throw), and that second file had also been with that other group, which is all the corroboration i can offer. Inconclusive, of course.

4. This probably is a silly question, but bear with me, please: is there anything that my code could do to manipulate the file without actually opening it? If i could get at its content  at all, life might become easy: delete the sheets i don't need (losing the offending data) and proceed. piece of cake.

5. I still wonder why/how all those office programs handle this file so gracefully: they wouldn't just add the missing 6 bytes of prescribed zero content and proceed, would they? i mean, if even i can think of that, there must be something wrong with the idea. right?
Comment 5 Nick Burch 2011-03-18 10:31:09 UTC
Fixed in r1082936.

I've added code to do what I'm fairly sure Excel must also be doing, which is to add the padding back in if it has gone missing. Thanks for all your help diagnosing this one!
Comment 6 pqueixalos 2011-09-26 09:42:55 UTC
Created attachment 27593 [details]
Throwing ArrayIndexOutOfBoundsException