Bug 33681

Summary: Cannot load 500kb file with usermodel or event model
Product: POI Reporter: Mike Serra <mserra>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: major CC: dave
Priority: P3    
Version: 2.5-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: The file that won't open

Description Mike Serra 2005-02-22 00:41:53 UTC
Whenever I try to load a simple, 500k file, I find that my CPU usage goes up to
98% and my memory slowly gets eaten... very slowly.  The file never loads.  But
when I try to do this with a much smaller file (between 50-100k), I have no
problems.  This happens regardless of whether I use the usermodel or the
eventmodel.  Here is the code I am using for the usermodel:

public class LoadTest {

  public static void main(String[] args) {

    HSSFWorkbook workbook;
    File inputFile = new File("big_test.xls");
    try {
    InputStream fileIn = new FileInputStream(inputFile);
    fileIn = new BufferedInputStream(fileIn);
    workbook = new HSSFWorkbook(fileIn);
    fileIn.close();
    } catch (FileNotFoundException ex) {
    } catch (IOException ex) { }
  }

}

And for the eventmodel, I am using the example from the HSSF howto homepage.
It gets this far:

Encountered workbook
New sheet named: Sheet1
New sheet named: Sheet2
New sheet named: Sheet3

And then it hangs the same way the usermodel did.
I have also had this problem on a Windows XP box, and I've tried it with what
appears to be a recent CVS build (poi-2.1-20050221).  It never works.  What is
going on here?
Comment 1 Mike Serra 2005-02-22 00:52:24 UTC
Created attachment 14339 [details]
The file that won't open
Comment 2 Mike Serra 2005-02-22 00:53:59 UTC
I should add that I think the determining factor is not the file size, it is the
number of rows.  I can easily load a file with more total kb but fewer rows.
The file that will not load has over 1000 rows.

(In reply to comment #0)
> Whenever I try to load a simple, 500k file, I find that my CPU usage goes up to
> 98% and my memory slowly gets eaten... very slowly.  The file never loads.  But
> when I try to do this with a much smaller file (between 50-100k), I have no
> problems.  This happens regardless of whether I use the usermodel or the
> eventmodel.  Here is the code I am using for the usermodel:
> 
> public class LoadTest {
> 
>   public static void main(String[] args) {
> 
>     HSSFWorkbook workbook;
>     File inputFile = new File("big_test.xls");
>     try {
>     InputStream fileIn = new FileInputStream(inputFile);
>     fileIn = new BufferedInputStream(fileIn);
>     workbook = new HSSFWorkbook(fileIn);
>     fileIn.close();
>     } catch (FileNotFoundException ex) {
>     } catch (IOException ex) { }
>   }
> 
> }
> 
> And for the eventmodel, I am using the example from the HSSF howto homepage.
> It gets this far:
> 
> Encountered workbook
> New sheet named: Sheet1
> New sheet named: Sheet2
> New sheet named: Sheet3
> 
> And then it hangs the same way the usermodel did.
> I have also had this problem on a Windows XP box, and I've tried it with what
> appears to be a recent CVS build (poi-2.1-20050221).  It never works.  What is
> going on here?

Comment 3 Avik Sengupta 2005-05-25 20:25:03 UTC
I can verify that this is indeed the case with the attached file. 
Comment 4 Mike Serra 2005-05-25 20:56:18 UTC
(In reply to comment #3)
> I can verify that this is indeed the case with the attached file. 

After reporting this bug, I added that I thought the problem was having too many
rows in the file, rather than simply too large of a file.

Since then I've found that creating a file with an equal number of rows, but
with numeric data rather than string data, causes no problem.  So, I'm pretty
sure the issue here is having a file with too many strings.
Comment 5 James Doe 2005-05-29 22:07:13 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > I can verify that this is indeed the case with the attached file. 
> 
> After reporting this bug, I added that I thought the problem was having too many
> rows in the file, rather than simply too large of a file.
> 
> Since then I've found that creating a file with an equal number of rows, but
> with numeric data rather than string data, causes no problem.  So, I'm pretty
> sure the issue here is having a file with too many strings.

(In reply to comment #4)
> (In reply to comment #3)
> > I can verify that this is indeed the case with the attached file. 
> 
> After reporting this bug, I added that I thought the problem was having too many
> rows in the file, rather than simply too large of a file.
> 
> Since then I've found that creating a file with an equal number of rows, but
> with numeric data rather than string data, causes no problem.  So, I'm pretty
> sure the issue here is having a file with too many strings.

I have verified this problem under the cited scenario as well.  Tracing with JDB
shows the hang occurs when calling the constructor
HSSFWorkbook(java.io.InputStream) [ example: workbook = new
HSSFWorkbook(streamIn); ].  The trace never returns from the constructor call
although I have not yet traced any deeper than the call to the constructor
itself at the time of this writing.
Comment 6 James Doe 2005-05-29 22:08:00 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > I can verify that this is indeed the case with the attached file. 
> 
> After reporting this bug, I added that I thought the problem was having too many
> rows in the file, rather than simply too large of a file.
> 
> Since then I've found that creating a file with an equal number of rows, but
> with numeric data rather than string data, causes no problem.  So, I'm pretty
> sure the issue here is having a file with too many strings.

(In reply to comment #4)
> (In reply to comment #3)
> > I can verify that this is indeed the case with the attached file. 
> 
> After reporting this bug, I added that I thought the problem was having too many
> rows in the file, rather than simply too large of a file.
> 
> Since then I've found that creating a file with an equal number of rows, but
> with numeric data rather than string data, causes no problem.  So, I'm pretty
> sure the issue here is having a file with too many strings.

I have verified this problem under the cited scenario as well.  Tracing with JDB
shows the hang occurs when calling the constructor
HSSFWorkbook(java.io.InputStream) [ example: workbook = new
HSSFWorkbook(streamIn); ].  The trace never returns from the constructor call
although I have not yet traced any deeper than the call to the constructor
itself at the time of this writing.
Comment 7 Avik Sengupta 2005-07-06 13:13:06 UTC
The error occurs in SSTDeserialize .. it seems to spend all its time in
SSTDeserializer.addToStringTable (calling into BinaryTree.put)

Debugging, it looks like the issue is due to duplicate strings. the algorithm in
this method is probably polynomial in the number of duplicates of any string.
When a file contains many instances of the same string, this becomes a bottleneck!

I'm confused... I thought we had problems with duplicate strings only because of
rich text. Are the strings in this file rich text? 
Comment 8 Amol Deshmukh 2005-07-25 18:23:55 UTC
*** Bug 35849 has been marked as a duplicate of this bug. ***
Comment 9 Jason Height 2005-07-28 06:24:54 UTC
Ok i know what is happening here. SST records it seems *CAN* have duplicate 
entries @#^&%!

This means that the current implementation which adds spaces onto the string to
prevent duplicate entries being put into the Binary Tree is called continually!
Which of course is going to raise an exceptionally large number of exceptions &
re-create a large number of strings, effectively killing the whole performance.

As part of the patch to http://issues.apache.org/bugzilla/show_bug.cgi?id=31906
I have implemented the RichText stuff in UnicodeRecord, including comparision
for uniqueness based on the rich text stuff.  After taking out the code in the
SSTDeserializer.addToStringTable which ensured uniqueness, i now get exceptions
being raised with duplicates.

So in actual fact we should allow for duplicates in the SST record. I will look
into this over the next few days.

Jason
Comment 10 Andy Oliver 2005-07-28 08:16:54 UTC
Are you POSITIVE?  I could have sworn we put that in to prevent dupes because
they caused problems.  If this IS true then part of me wants to say
"de-duplciate" and give the referencing records the other index...while that
would help file size -- I guess it wouldn't help write times (depending on the
length of the string I guess)...  

We should still reference existing strings when creating new cells with the same
string...
Comment 11 Jason Height 2005-07-28 08:41:06 UTC
Well to assure myself, i opened the file in excel re-saved it as a different
file and used the MS BIFFVIEW. Sure enough the SST record contains the same
repeated string information.

So it seems that Excel doesnt mind duplicates. So YES i am POSITIVE...95% UNLESS
I have missed something here! Chalk this up to yet-another-excel-oddity.

Yes we will continue to avoid duplicates via the API, but we will need to allow
for duplicates in "template" files read from excel.

I will look into the difficulty of de-duplicating when i try to get this all to
work.

Jason
Comment 12 Andy Oliver 2005-07-28 08:47:30 UTC
And this is a file written by Excel and not some defective implementation?  

Okay...I'm glad you're working on this :-)  We've needed rich strings for a
looong time.
Comment 13 Jason Height 2005-07-29 00:16:24 UTC
Ok so i opened the file, saved it as csv, then opened the csv and re-saved as excel.

The "pure" excel version at the end of this process did not cause errors ie
there were no duplicates in the sst record. Interestingly the file size was
larger than the initial file that contained duplicates!!!

I have sent emails to the reporters asking whether they are sure that the
attached files were created by excel. At this stage ill hold off making any
changes until i hear back from them.

Jason
Comment 14 Jason Height 2005-07-29 00:18:42 UTC
Opps. Saved the file with the excel 95 version as well thats why it was larger.
If i only save excel 97 then file size is reduced but still no duplicates.
Comment 15 Jason Height 2005-08-04 01:20:49 UTC
Ok i am 100% sure that SST record can contain duplicates.

There are even a number of excel files in our test files for the unit tests that
contain duplicates in the SST record.

This makes some sense, look at the fields of the SST record, one is the number
of strings in the record, another is the number of *unique* strings in the record.

The good news is Im almost done getting this all solved.

Jason
Comment 16 Jason Height 2005-08-22 05:23:48 UTC
Now works with the HEAD of CVS, albiet slowly (takes about 30sec on my machine
to read the file and write it back out.

Interestingly 90% of the time is on the writing out side of things.. Hmm time to
look into performance.

Jason