Summary: | removeRow in XSSF is much slower than with HSSF | ||
---|---|---|---|
Product: | POI | Reporter: | ki <tersitain> |
Component: | XSSF | Assignee: | POI Developers List <dev> |
Status: | RESOLVED WONTFIX | ||
Severity: | normal | CC: | tersitain |
Priority: | P2 | ||
Version: | 3.10-FINAL | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: | test-file |
Description
ki
2014-04-22 01:55:15 UTC
HSSF is often a tiny bit faster than XSSF, but rarely more than a few times Are you able to use a profiler to work out where the extra time on XSSF goes? Also, does it happen for simple files too? i.e. is this related to one specific class of file, or all files? thx for answer me. Juet the simple file whth the FormulaCell[=IF((COUNT(U224,U225,U226,U227))>0,ROUND(SUM(U224,U225,U226,U227),6),0)]. The file has 2000rows and 100 cols,size is 1.3 MB,not ON ie. to delete 20rows for xlsx cost 40sec,but xls is 1sec.... the test soure is under.. import org.apache.poi.openxml4j.exceptions.InvalidFormatException; import org.apache.poi.ss.usermodel.Row; import org.apache.poi.ss.usermodel.Sheet; import org.apache.poi.ss.usermodel.Workbook; import org.apache.poi.ss.usermodel.WorkbookFactory; public class ReadExcelSheetTest { public static void main(final String[] args) throws InvalidFormatException { try { String templatePath = "C:\\temp\\test.xlsx"; FileInputStream fis = null; fis = new FileInputStream(templatePath); Workbook workbook = WorkbookFactory.create(fis); Sheet sheet = workbook.getSheet("a"); for (int i = 20; i >= 1; i--) { Row row = sheet.getRow(i); if (row != null) { sheet.removeRow(row); System.out.println("-----" + i); } } workbook.write(new FileOutputStream("C:\\temp\\3.xlsx")); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } } thx for answer me. just the simple file whth the FormulaCell[=IF((COUNT(U224,U225,U226,U227))>0,ROUND(SUM(U224,U225,U226,U227),6),0)]. The file has 2000rows and 100 cols,size is 1.3 MB,not ON ie. to delete 20rows for xlsx cost 40sec,but xls is 1sec.... the test soure is under.. import org.apache.poi.openxml4j.exceptions.InvalidFormatException; import org.apache.poi.ss.usermodel.Row; import org.apache.poi.ss.usermodel.Sheet; import org.apache.poi.ss.usermodel.Workbook; import org.apache.poi.ss.usermodel.WorkbookFactory; public class ReadExcelSheetTest { public static void main(final String[] args) throws InvalidFormatException { try { String templatePath = "C:\\temp\\test.xlsx"; FileInputStream fis = null; fis = new FileInputStream(templatePath); Workbook workbook = WorkbookFactory.create(fis); Sheet sheet = workbook.getSheet("a"); for (int i = 20; i >= 1; i--) { Row row = sheet.getRow(i); if (row != null) { sheet.removeRow(row); System.out.println("-----" + i); } } workbook.write(new FileOutputStream("C:\\temp\\3.xlsx")); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } } Are you able to use a profiler to work out where the extra time goes on XSSF? To debug the source I knew c.length = 202176.... package org.apache.poi.xssf.model; public class CalculationChain extends POIXMLDocumentPart { ・・・・ public void removeItem(int sheetId, String ref){ //sheet Id of a sheet the cell belongs to int id = -1; CTCalcCell[] c = chain.getCArray(); for (int i = 0; i < c.length; i++){ //If sheet Id is omitted, it is assumed to be the same as the value of the previous cell. if(c[i].isSetI()) id = c[i].getI(); if(id == sheetId && c[i].getR().equals(ref)){ if(c[i].isSetI() && i < c.length - 1 && !c[i+1].isSetI()) { c[i+1].setI(id); } chain.removeC(i); break; } } } I could not reproduce this with a simple file created as you describe below. Can you attach the test-file so we can try to reproduce this. Created attachment 32561 [details] test-file (In reply to Dominik Stadler from comment #6) > I could not reproduce this with a simple file created as you describe below. > > Can you attach the test-file so we can try to reproduce this. Please confirm it. Your assessment with the time being spent in CalculationChain is correct. The method currently needs to iterate over the large array of entries and needs to perform checks for every removed cell and remove the entry thus copying around the array. Unfortunately the data structure in this case is defined by how the Microsoft XML format is structured, i.e. all formulas are listed in one large list so it is hard to build optimizations into how the data is stored in memory... hm. thx for answered me. So theres no way to remove rows from the sheet used a lot of FormulaCell quickly? I tried a few local optimizations, but none really made a difference, the cleanup of formulas on cell removal is unfortunately deeply intertwined with Cell removal and there is no easy way to kind of "remember" all the removed cells and do the operation in bulk at the end. So as long as nobody else comes up with a clever way of doing something like this while still keeping the code in maintainable order I don't think we can do much, removing many formulas is also likely not something that many people perform all the time. If you want to take a look at improving this then you probably need to look at starting in XSSFRow.removeRow() which calls XSSFRow.removeCell() and from there directs to XSSFWorkbook.onDeleteFormula()... Anyway I am closing this WONTFIX now until there is some code-idea of how it can be done better. |