Summary: | [PATCH] A way to autosize columns on SXSSF | ||
---|---|---|---|
Product: | POI | Reporter: | Dmitriy Likhten <dlikhten> |
Component: | SXSSF | Assignee: | POI Developers List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | enhancement | CC: | dlikhten, dtn-asfbugs, jerome.neuville, onealj, st.mailinglists |
Priority: | P2 | Keywords: | PatchAvailable |
Version: | unspecified | ||
Target Milestone: | --- | ||
Hardware: | All | ||
OS: | All | ||
Attachments: |
SXSSF test case showing the problem
SXSSF Autosizing Fix SXSSF Autosizing Fix SXSSF Autosizing fix SXSSF Autosizing fix SXSSF Autosizing Fix, excludes tracking rows SXSSF Autosizing Fix, excludes tracking rows |
Description
Dmitriy Likhten
2015-01-16 21:22:42 UTC
You'd need to refactor the current autosize column code out into a new tool, updating the existing code there to use that instead, then add this sxssf helper, but overall looks do-able and interesting. Unit testing might be tricky, as the sizing varies depending on the fonts available, but you could probably manage something similar to the current fuzzy matching the existing code's tests use As a first step, I'd suggest you sketch out a plan for how you'd refactor the current code, and post it here or on the dev list for review. We can then give you some advice, before you start working on the patch! 1) The logic in SXSSFSheet#autoSizeColumn for width calculation needs to be moved over into some generic code to convert the width to that 256 inflated number (sorry I'm still not 100% familiar with why that happens) 2) SheetUtil.getColumnWidth has a ton of logic, but that logic is intertwined with looping through every single row in the sheet for a given column. It would be really useful to have a SheetUtil.getCellWidth which would allow me to do my count-tracking manually *but* it would still re-use the logic that you guys built for figuring out the width of a column. That way I don't actually have the ability to get things wrong. 3) Unit testing: Since all the interesting logic is going to be left in *your* maintained code, my testing will be simple. I can stub all the width calculations and just focus on my (trivial) summation logic. This way my code cannot introduce new width calculation bugs. I forgot to mention: - SheetUtil.getCellWidth does exist, but the logic inside .getColumnWidth seems to also be significant enough that I don't want any of the knowledge inside my tool (if possible). Or I just don't grok the code well enough. - SXSSFSheet#autoSizeColumn also has some logic regarding max width. Once again I rather not have my code even have the possibility of getting that logic wrong. Also it seems XSSFSheet also has the exact same logic copy-pasta'd. Ditto for HSSFSheet. Other features I can see coming out of this tool: - ability to skip rows for auto-sizing. Example is ignore the header row when doing auto sizing calculations. - ability to set a minimum / maximum value. The tool can be made to allow the user to autosize with a minimum value, or with a maximum value that is smaller than the allowable amount. I am looking into this issue, trying to follow what was proposed by Dmitriy. Any thoughts on how the existing autosizeColumn() impl in SXSSFSheet should be handled? Left as it is, with @Deprecated? Or should it be an empty impl? Created attachment 33194 [details]
SXSSF test case showing the problem
Added patch file with an SXSSF test showing the problem described in this issue.
Created attachment 33207 [details]
SXSSF Autosizing Fix
This patch introduces an AutosizeColumnTracker class for handling autosizing when creating SXSSF files:
* monitorColumn(int) to register which columns will be autosized
* trackRow(Row) to include rows in autosize calculations
* applyAutosizeToColumn(int) to finally apply autosizing after all rows have been added.
Additional memory overhead introduced by this class:
* Data structures to save list of monitored columns and current maxWidth for every column
* Caching DataFormatter instance
Also I'm unsure how to handle backward compatibility, I see three possibilities:
A. throw exception in SXSSFSheet.autosizeColumn(int)
B. keep current behaviour as-is and only add new fixed behaviour
C. remove Sheet.autosizeColumn() and migrate all Sheet implementations to this new autosizing implementation
I'm willing to improve the patch if required, so hopefully it will be included in the next Apache POI release :-)
Created attachment 33251 [details] SXSSF Autosizing Fix See feedback posted to poi-dev mailing list [1]. I applied supporting changes in following commits: r1712213 - (minor) typos in BaseTestBugzillaIssues.java r1712214 - (minor) TestSXSSFBugs.java svn props r1712216 - (minor) SheetUtil svn props r1712217 - SheetUtil factor out getDefaultCharWidth and getColumnWidthForRow methods (note - these are private--just change them to protected, /*package*/, or public, optionally with @Internal if you need to elevate the visibility). r1712219 - (minor) renamed canComputeColumnWidht, see bug 58576 r1691341 These fixes created a few merge conflicts, so I've resolved them in the attached patch. [1] http://mail-archives.apache.org/mod_mbox/poi-dev/201511.mbox/%3CCAM%2BTppJWzhz50TExdHRAzPyFbMz8hgCKuTT%3DLxReq%2BkoBE4bxQ%40mail.gmail.com%3E Attaching an updated patch file with a refactored solution (after discussion on DEV mailing list). Main aspects: * AutoSizeColumnTracker is now completely internal and "hidden" behind Sheet API * by default SXSSF auto-sizing behaviour is identical to Apache POI 3.13 * to get the fixed SXSSF auto-sizing behaviour, one must explicitly call ((SXSSFSheet)sheet).setAutoSizeTrackAllFlushedRows(true). This will track all rows before they get flushed to disk * Ability to manually add/remove rows to/from tracking is not implemented. This should be handled in separate issue since it affects all types (HSSF/XSSF/SXSSF) Created attachment 33269 [details]
SXSSF Autosizing fix
Looks pretty good to me. Could you write some unit tests for the new/modified functionality and submit those as well? Created attachment 33277 [details]
SXSSF Autosizing fix
@Javen: Sorry my test class somehow didn't make it into the patch file. I'm attaching an ant-built patch archive now, it contains all my changes :-)
I've also added a note about auto-sizing behaviour in the SXSSF section in documentation, feel free to comment.
Created attachment 33307 [details] SXSSF Autosizing Fix, excludes tracking rows (In reply to Stefan Thurnherr from comment #9) > Attaching an updated patch file with a refactored solution (after discussion > on DEV mailing list). Main aspects: > * AutoSizeColumnTracker is now completely internal and "hidden" behind Sheet > API > * by default SXSSF auto-sizing behaviour is identical to Apache POI 3.13 > * to get the fixed SXSSF auto-sizing behaviour, one must explicitly call > ((SXSSFSheet)sheet).setAutoSizeTrackAllFlushedRows(true). This will track > all rows before they get flushed to disk > * Ability to manually add/remove rows to/from tracking is not implemented. > This should be handled in separate issue since it affects all types > (HSSF/XSSF/SXSSF) Stefan, I've been working on your patch, attachment 33277 [details], removing the row-tracking code, adding more unit tests and java docs. The attached patch is what I've come up with. I wanted to minimize the amount of changes people will need to make to their SXSSFWorkbook code while still behaving consistently regardless of the number of rows in the random-access window. For example, if the random-access window is the first 5 rows, autoSizeColumn has all the information it needs in the window to compute the best-fit width without telling the AutoSizeColumnTracker the columns they wish to track. While I could allow that, as soon as the first row gets flushed, the user would need to track the columns of interest, though it's now too late. Additionally, this would look like a bug and would be difficult to track down because of the latent failure. It's better to have consistent behavior and fail early. In this patch, autoSizeColumn uses the AutoSizeColumnTracker to determine the best-fit width of all tracked flushed columns, then computes the best-fit column width in the active window, takes the maximum of the two values, and sets the column width to that value. The only change users will need to make to SXSSF code is to add sxSheet.trackAllColumnsForAutoSizing() or sxSheet.trackColumnForAutoSizing(column) (this is faster and uses less memory than tracking all columns) before calling autoSizeColumn. To have consistent behavior with H/XSSFSheet, tracking columns should occur immediately after the sheet is created and before the first row is created, though nothing stops it from occurring later (for example, if you want to exclude the first 3 rows from auto-sizing, you could wait to register the columns until after the first 3 rows have been flushed). Columns can be un-tracked as well, which is needed if user code auto-sizes several columns after writing just a few rows, and then doesn't auto-size anything else. Without the ability to untrack rows, computing the best-fit width for all the remaining rows would be wasted effort. # Scenario 1: Auto-size header row only > SXSSFWorkbook wb = new SXSSFWorkbook(); > SXSSFSheet sh = wb.createSheet(); > sh.trackAllColumnsForAutoSizing(); //do this immediately > Row header = sh.createRow(0); > // populate row with data > header.createCell(0).setCellValue("Some really long header"); > header.createCell(1).setCellValue("short"); > sh.autoSizeColumn(0); > sh.untrackAllColumns(); # Scenario 2: Exclude header row from auto-sizing > SXSSFWorkbook wb = new SXSSFWorkbook(); > SXSSFSheet sh = wb.createSheet(); > Row header = sh.createRow(0); > header.createCell(0).setCellValue("Some really long header"); > header.createCell(1).setCellValue("short"); > sh.flushRows(); > sh.trackAllColumnsForAutoSizing(); > for (int r=0; r<1000; r++) { > Row data = sh.createRow(r); > // populate row with data > data.createCell(0).setCellValue("Cell[r="+r + ",c=0]"); > } > sh.autoSizeColumn(0); Let me know if I need to make any changes before committing this new functionality. Created attachment 33308 [details]
SXSSF Autosizing Fix, excludes tracking rows
Added in r1717146. Updated docs in r1717147. Note to implementers: This fix may break existing code. If your application uses or potentially uses SXSSFSheets and calls autoSizeColumn, you will need to make sure that the columns that are auto-sized are tracked prior to auto-sizing. The best place to do this tracking is immediately after the sheet is created. Use any of the following: * sxssfSheet.trackAllColumnsForAutoSizing() * sxssfSheet.trackColumnsForAutoSizing(int column) * sxssfSheet.trackColumnsForAutoSizing(Collection<Integer> columns) Realize that tracking all columns may have a larger memory footprint than tracking individual columns if you only intend on tracking some of the columns. If the sheet type isn't known until runtime, you'll need to add the following to your code. Sheet sheet = workbook.createSheet(); + if (sheet instanceof SXSSFSheet) { + SXSSFSheet sxSheet = (SXSSFSheet) sheet; + sxSheet.trackAllColumnsForAutoSizing(); + // or track columns individually + } ... sheet.autoSizeColumn(0); See comment 13 for more info. *** Bug 51740 has been marked as a duplicate of this bug. *** Updated spreadsheet quickguide in r1718764. |