Whenever I try to load a simple, 500k file, I find that my CPU usage goes up to 98% and my memory slowly gets eaten... very slowly. The file never loads. But when I try to do this with a much smaller file (between 50-100k), I have no problems. This happens regardless of whether I use the usermodel or the eventmodel. Here is the code I am using for the usermodel: public class LoadTest { public static void main(String[] args) { HSSFWorkbook workbook; File inputFile = new File("big_test.xls"); try { InputStream fileIn = new FileInputStream(inputFile); fileIn = new BufferedInputStream(fileIn); workbook = new HSSFWorkbook(fileIn); fileIn.close(); } catch (FileNotFoundException ex) { } catch (IOException ex) { } } } And for the eventmodel, I am using the example from the HSSF howto homepage. It gets this far: Encountered workbook New sheet named: Sheet1 New sheet named: Sheet2 New sheet named: Sheet3 And then it hangs the same way the usermodel did. I have also had this problem on a Windows XP box, and I've tried it with what appears to be a recent CVS build (poi-2.1-20050221). It never works. What is going on here?
Created attachment 14339 [details] The file that won't open
I should add that I think the determining factor is not the file size, it is the number of rows. I can easily load a file with more total kb but fewer rows. The file that will not load has over 1000 rows. (In reply to comment #0) > Whenever I try to load a simple, 500k file, I find that my CPU usage goes up to > 98% and my memory slowly gets eaten... very slowly. The file never loads. But > when I try to do this with a much smaller file (between 50-100k), I have no > problems. This happens regardless of whether I use the usermodel or the > eventmodel. Here is the code I am using for the usermodel: > > public class LoadTest { > > public static void main(String[] args) { > > HSSFWorkbook workbook; > File inputFile = new File("big_test.xls"); > try { > InputStream fileIn = new FileInputStream(inputFile); > fileIn = new BufferedInputStream(fileIn); > workbook = new HSSFWorkbook(fileIn); > fileIn.close(); > } catch (FileNotFoundException ex) { > } catch (IOException ex) { } > } > > } > > And for the eventmodel, I am using the example from the HSSF howto homepage. > It gets this far: > > Encountered workbook > New sheet named: Sheet1 > New sheet named: Sheet2 > New sheet named: Sheet3 > > And then it hangs the same way the usermodel did. > I have also had this problem on a Windows XP box, and I've tried it with what > appears to be a recent CVS build (poi-2.1-20050221). It never works. What is > going on here?
I can verify that this is indeed the case with the attached file.
(In reply to comment #3) > I can verify that this is indeed the case with the attached file. After reporting this bug, I added that I thought the problem was having too many rows in the file, rather than simply too large of a file. Since then I've found that creating a file with an equal number of rows, but with numeric data rather than string data, causes no problem. So, I'm pretty sure the issue here is having a file with too many strings.
(In reply to comment #4) > (In reply to comment #3) > > I can verify that this is indeed the case with the attached file. > > After reporting this bug, I added that I thought the problem was having too many > rows in the file, rather than simply too large of a file. > > Since then I've found that creating a file with an equal number of rows, but > with numeric data rather than string data, causes no problem. So, I'm pretty > sure the issue here is having a file with too many strings. (In reply to comment #4) > (In reply to comment #3) > > I can verify that this is indeed the case with the attached file. > > After reporting this bug, I added that I thought the problem was having too many > rows in the file, rather than simply too large of a file. > > Since then I've found that creating a file with an equal number of rows, but > with numeric data rather than string data, causes no problem. So, I'm pretty > sure the issue here is having a file with too many strings. I have verified this problem under the cited scenario as well. Tracing with JDB shows the hang occurs when calling the constructor HSSFWorkbook(java.io.InputStream) [ example: workbook = new HSSFWorkbook(streamIn); ]. The trace never returns from the constructor call although I have not yet traced any deeper than the call to the constructor itself at the time of this writing.
The error occurs in SSTDeserialize .. it seems to spend all its time in SSTDeserializer.addToStringTable (calling into BinaryTree.put) Debugging, it looks like the issue is due to duplicate strings. the algorithm in this method is probably polynomial in the number of duplicates of any string. When a file contains many instances of the same string, this becomes a bottleneck! I'm confused... I thought we had problems with duplicate strings only because of rich text. Are the strings in this file rich text?
*** Bug 35849 has been marked as a duplicate of this bug. ***
Ok i know what is happening here. SST records it seems *CAN* have duplicate entries @#^&%! This means that the current implementation which adds spaces onto the string to prevent duplicate entries being put into the Binary Tree is called continually! Which of course is going to raise an exceptionally large number of exceptions & re-create a large number of strings, effectively killing the whole performance. As part of the patch to http://issues.apache.org/bugzilla/show_bug.cgi?id=31906 I have implemented the RichText stuff in UnicodeRecord, including comparision for uniqueness based on the rich text stuff. After taking out the code in the SSTDeserializer.addToStringTable which ensured uniqueness, i now get exceptions being raised with duplicates. So in actual fact we should allow for duplicates in the SST record. I will look into this over the next few days. Jason
Are you POSITIVE? I could have sworn we put that in to prevent dupes because they caused problems. If this IS true then part of me wants to say "de-duplciate" and give the referencing records the other index...while that would help file size -- I guess it wouldn't help write times (depending on the length of the string I guess)... We should still reference existing strings when creating new cells with the same string...
Well to assure myself, i opened the file in excel re-saved it as a different file and used the MS BIFFVIEW. Sure enough the SST record contains the same repeated string information. So it seems that Excel doesnt mind duplicates. So YES i am POSITIVE...95% UNLESS I have missed something here! Chalk this up to yet-another-excel-oddity. Yes we will continue to avoid duplicates via the API, but we will need to allow for duplicates in "template" files read from excel. I will look into the difficulty of de-duplicating when i try to get this all to work. Jason
And this is a file written by Excel and not some defective implementation? Okay...I'm glad you're working on this :-) We've needed rich strings for a looong time.
Ok so i opened the file, saved it as csv, then opened the csv and re-saved as excel. The "pure" excel version at the end of this process did not cause errors ie there were no duplicates in the sst record. Interestingly the file size was larger than the initial file that contained duplicates!!! I have sent emails to the reporters asking whether they are sure that the attached files were created by excel. At this stage ill hold off making any changes until i hear back from them. Jason
Opps. Saved the file with the excel 95 version as well thats why it was larger. If i only save excel 97 then file size is reduced but still no duplicates.
Ok i am 100% sure that SST record can contain duplicates. There are even a number of excel files in our test files for the unit tests that contain duplicates in the SST record. This makes some sense, look at the fields of the SST record, one is the number of strings in the record, another is the number of *unique* strings in the record. The good news is Im almost done getting this all solved. Jason
Now works with the HEAD of CVS, albiet slowly (takes about 30sec on my machine to read the file and write it back out. Interestingly 90% of the time is on the writing out side of things.. Hmm time to look into performance. Jason