Bug 56440

Summary: removeRow in XSSF is much slower than with HSSF
Product: POI Reporter: ki <tersitain>
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED WONTFIX    
Severity: normal CC: tersitain
Priority: P2    
Version: 3.10-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: test-file

Description ki 2014-04-22 01:55:15 UTC
API: XSSF
file: *.xlsx

When i use removeRow or shiftRows to handle some rows with FormulaCell,
its almost cost 2Sec for one Row.

Do the samething use HSSF is fast.
Comment 1 Nick Burch 2014-04-22 02:07:38 UTC
HSSF is often a tiny bit faster than XSSF, but rarely more than a few times

Are you able to use a profiler to work out where the extra time on XSSF goes?

Also, does it happen for simple files too? i.e. is this related to one specific class of file, or all files?
Comment 2 ki 2014-04-22 02:41:26 UTC
thx for answer me.
Juet the simple file whth the FormulaCell[=IF((COUNT(U224,U225,U226,U227))>0,ROUND(SUM(U224,U225,U226,U227),6),0)].
The file has 2000rows and 100 cols,size is 1.3 MB,not ON ie.

to delete 20rows for xlsx cost 40sec,but xls is 1sec....

the test soure is under..

import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;

public class ReadExcelSheetTest {
    public static void main(final String[] args) throws InvalidFormatException {
        try {
            String templatePath = "C:\\temp\\test.xlsx";
            FileInputStream fis = null;
            fis = new FileInputStream(templatePath);
            Workbook workbook = WorkbookFactory.create(fis);
            Sheet sheet = workbook.getSheet("a");

            for (int i = 20; i >= 1; i--) {
                Row row = sheet.getRow(i);
                if (row != null) {
                    sheet.removeRow(row);
                    System.out.println("-----" + i);
                }
            }
            workbook.write(new FileOutputStream("C:\\temp\\3.xlsx"));
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}
Comment 3 ki 2014-04-22 04:34:39 UTC
thx for answer me.
just the simple file whth the FormulaCell[=IF((COUNT(U224,U225,U226,U227))>0,ROUND(SUM(U224,U225,U226,U227),6),0)].
The file has 2000rows and 100 cols,size is 1.3 MB,not ON ie.

to delete 20rows for xlsx cost 40sec,but xls is 1sec....

the test soure is under..

import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;

public class ReadExcelSheetTest {
    public static void main(final String[] args) throws InvalidFormatException {
        try {
            String templatePath = "C:\\temp\\test.xlsx";
            FileInputStream fis = null;
            fis = new FileInputStream(templatePath);
            Workbook workbook = WorkbookFactory.create(fis);
            Sheet sheet = workbook.getSheet("a");

            for (int i = 20; i >= 1; i--) {
                Row row = sheet.getRow(i);
                if (row != null) {
                    sheet.removeRow(row);
                    System.out.println("-----" + i);
                }
            }
            workbook.write(new FileOutputStream("C:\\temp\\3.xlsx"));
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}
Comment 4 Nick Burch 2014-04-22 12:25:53 UTC
Are you able to use a profiler to work out where the extra time goes on XSSF?
Comment 5 ki 2014-04-23 05:41:29 UTC
To debug the source  I knew c.length = 202176....

package org.apache.poi.xssf.model;
public class CalculationChain extends POIXMLDocumentPart {
・・・・
public void removeItem(int sheetId, String ref){
        //sheet Id of a sheet the cell belongs to
        int id = -1;
        CTCalcCell[] c = chain.getCArray();

        for (int i = 0; i < c.length; i++){
            //If sheet Id  is omitted, it is assumed to be the same as the value of the previous cell.
            if(c[i].isSetI()) id = c[i].getI();

            if(id == sheetId && c[i].getR().equals(ref)){
                if(c[i].isSetI() && i < c.length - 1 && !c[i+1].isSetI()) {
                    c[i+1].setI(id);
                }
                chain.removeC(i);
                break;
            }
        }
    }
Comment 6 Dominik Stadler 2015-03-11 18:40:44 UTC
I could not reproduce this with a simple file created as you describe below.

Can you attach the test-file so we can try to reproduce this.
Comment 7 ki 2015-03-12 01:52:59 UTC
Created attachment 32561 [details]
test-file

(In reply to Dominik Stadler from comment #6)
> I could not reproduce this with a simple file created as you describe below.
> 
> Can you attach the test-file so we can try to reproduce this.

Please confirm it.
Comment 8 Dominik Stadler 2015-03-13 17:25:17 UTC
Your assessment with the time being spent in CalculationChain is correct. The method currently needs to iterate over the large array of entries and needs to perform checks for every removed cell and remove the entry thus copying around the array.

Unfortunately the data structure in this case is defined by how the Microsoft XML format is structured, i.e. all formulas are listed in one large list so it is hard to build optimizations into how the data is stored in memory... hm.
Comment 9 ki 2015-04-06 10:23:24 UTC
thx for answered me.
So theres no way to remove rows from the sheet used a lot of FormulaCell quickly?
Comment 10 Dominik Stadler 2015-09-13 20:00:30 UTC
I tried a few local optimizations, but none really made a difference, the cleanup of formulas on cell removal is unfortunately deeply intertwined with Cell removal and there is no easy way to kind of "remember" all the removed cells and do the operation in bulk at the end. 

So as long as nobody else comes up with a clever way of doing something like this while still keeping the code in maintainable order I don't think we can do much, removing many formulas is also likely not something that many people perform all the time. 

If you want to take a look at improving this then you probably need to look at starting in XSSFRow.removeRow() which calls XSSFRow.removeCell() and from there directs to XSSFWorkbook.onDeleteFormula()...

Anyway I am closing this WONTFIX now until there is some code-idea of how it can be done better.