Bug 57450

Summary: [PATCH] A way to autosize columns on SXSSF
Product: POI Reporter: Dmitriy Likhten <dlikhten>
Component: SXSSFAssignee: POI Developers List <dev>
Severity: enhancement CC: dlikhten, dtn-asfbugs, jerome.neuville, onealj, st.mailinglists
Priority: P2 Keywords: PatchAvailable
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: SXSSF test case showing the problem
SXSSF Autosizing Fix
SXSSF Autosizing Fix
SXSSF Autosizing fix
SXSSF Autosizing fix
SXSSF Autosizing Fix, excludes tracking rows
SXSSF Autosizing Fix, excludes tracking rows

Description Dmitriy Likhten 2015-01-16 21:22:42 UTC
So it makes sense why the standard "autoSize" mechanism doesn't function on SXSSF Worksheets. However if you provided a tool which allows us to feed one row at a time and keep track of the largest value for each column desired, which can then be applied to the worksheet, we could get the autosize behavior without having to loop through everything at the end.

tracker = new AutosizeTracker(sheet) // assume flush size is 10

tracker.trackRow(row1) // tracks widths, holds no reference to row1

I was considering writing one of these, but it seems that there is no well factored tool for getting the width of a single cell with all the knowledge cooked into the autoSizeColumn method of SXSSFSheet
Comment 1 Nick Burch 2015-01-18 18:40:25 UTC
You'd need to refactor the current autosize column code out into a new tool, updating the existing code there to use that instead, then add this sxssf helper, but overall looks do-able and interesting.

Unit testing might be tricky, as the sizing varies depending on the fonts available, but you could probably manage something similar to the current fuzzy matching the existing code's tests use

As a first step, I'd suggest you sketch out a plan for how you'd refactor the current code, and post it here or on the dev list for review. We can then give you some advice, before you start working on the patch!
Comment 2 Dmitriy Likhten 2015-01-20 16:29:09 UTC
1) The logic in SXSSFSheet#autoSizeColumn for width calculation needs to be moved over into some generic code to convert the width to that 256 inflated number (sorry I'm still not 100% familiar with why that happens)

2) SheetUtil.getColumnWidth has a ton of logic, but that logic is intertwined with looping through every single row in the sheet for a given column. It would be really useful to have a SheetUtil.getCellWidth which would allow me to do my count-tracking manually *but* it would still re-use the logic that you guys built for figuring out the width of a column. That way I don't actually have the ability to get things wrong.

3) Unit testing: Since all the interesting logic is going to be left in *your* maintained code, my testing will be simple. I can stub all the width calculations and just focus on my (trivial) summation logic. This way my code cannot introduce new width calculation bugs.
Comment 3 Dmitriy Likhten 2015-01-20 16:33:40 UTC
I forgot to mention:

- SheetUtil.getCellWidth does exist, but the logic inside .getColumnWidth seems to also be significant enough that I don't want any of the knowledge inside my tool (if possible). Or I just don't grok the code well enough.

- SXSSFSheet#autoSizeColumn also has some logic regarding max width. Once again I rather not have my code even have the possibility of getting that logic wrong. Also it seems XSSFSheet also has the exact same logic copy-pasta'd. Ditto for HSSFSheet.
Comment 4 Dmitriy Likhten 2015-01-20 21:07:39 UTC
Other features I can see coming out of this tool:

- ability to skip rows for auto-sizing. Example is ignore the header row when doing auto sizing calculations.

- ability to set a minimum / maximum value. The tool can be made to allow the user to autosize with a minimum value, or with a maximum value that is smaller than the allowable amount.
Comment 5 Stefan Thurnherr 2015-10-22 16:15:05 UTC
I am looking into this issue, trying to follow what was proposed by Dmitriy.

Any thoughts on how the existing autosizeColumn() impl in SXSSFSheet should be handled? Left as it is, with @Deprecated? Or should it be an empty impl?
Comment 6 Stefan Thurnherr 2015-10-22 18:58:38 UTC
Created attachment 33194 [details]
SXSSF test case showing the problem

Added patch file with an SXSSF test showing the problem described in this issue.
Comment 7 Stefan Thurnherr 2015-10-25 16:22:31 UTC
Created attachment 33207 [details]
SXSSF Autosizing Fix

This patch introduces an AutosizeColumnTracker class for handling autosizing when creating SXSSF files:
* monitorColumn(int) to register which columns will be autosized
* trackRow(Row) to include rows in autosize calculations
* applyAutosizeToColumn(int) to finally apply autosizing after all rows have been added.

Additional memory overhead introduced by this class:
* Data structures to save list of monitored columns and current maxWidth for every column
* Caching DataFormatter instance

Also I'm unsure how to handle backward compatibility, I see three possibilities:
A. throw exception in SXSSFSheet.autosizeColumn(int)
B. keep current behaviour as-is and only add new fixed behaviour
C. remove Sheet.autosizeColumn() and migrate all Sheet implementations to this new autosizing implementation

I'm willing to improve the patch if required, so hopefully it will be included in the next Apache POI release :-)
Comment 8 Javen O'Neal 2015-11-03 07:49:49 UTC
Created attachment 33251 [details]
SXSSF Autosizing Fix

See feedback posted to poi-dev mailing list [1].
I applied supporting changes in following commits:
r1712213 - (minor) typos in BaseTestBugzillaIssues.java
r1712214 - (minor) TestSXSSFBugs.java svn props
r1712216 - (minor) SheetUtil svn props
r1712217 - SheetUtil factor out getDefaultCharWidth and getColumnWidthForRow methods (note - these are private--just change them to protected, /*package*/, or public, optionally with @Internal if you need to elevate the visibility).
r1712219 - (minor) renamed canComputeColumnWidht, see bug 58576 r1691341

These fixes created a few merge conflicts, so I've resolved them in the attached patch.

[1] http://mail-archives.apache.org/mod_mbox/poi-dev/201511.mbox/%3CCAM%2BTppJWzhz50TExdHRAzPyFbMz8hgCKuTT%3DLxReq%2BkoBE4bxQ%40mail.gmail.com%3E
Comment 9 Stefan Thurnherr 2015-11-10 18:27:42 UTC
Attaching an updated patch file with a refactored solution (after discussion on DEV mailing list). Main aspects:
* AutoSizeColumnTracker is now completely internal and "hidden" behind Sheet API
* by default SXSSF auto-sizing behaviour is identical to Apache POI 3.13 
* to get the fixed SXSSF auto-sizing behaviour, one must explicitly call ((SXSSFSheet)sheet).setAutoSizeTrackAllFlushedRows(true). This will track all rows before they get flushed to disk
* Ability to manually add/remove rows to/from tracking is not implemented. This should be handled in separate issue since it affects all types (HSSF/XSSF/SXSSF)
Comment 10 Stefan Thurnherr 2015-11-10 18:28:54 UTC
Created attachment 33269 [details]
SXSSF Autosizing fix
Comment 11 Javen O'Neal 2015-11-16 06:14:12 UTC
Looks pretty good to me. Could you write some unit tests for the new/modified functionality and submit those as well?
Comment 12 Stefan Thurnherr 2015-11-16 20:08:56 UTC
Created attachment 33277 [details]
SXSSF Autosizing fix

@Javen: Sorry my test class somehow didn't make it into the patch file. I'm attaching an ant-built patch archive now, it contains all my changes :-)

I've also added a note about auto-sizing behaviour in the SXSSF section in documentation, feel free to comment.
Comment 13 Javen O'Neal 2015-11-28 12:27:37 UTC
Created attachment 33307 [details]
SXSSF Autosizing Fix, excludes tracking rows

(In reply to Stefan Thurnherr from comment #9)
> Attaching an updated patch file with a refactored solution (after discussion
> on DEV mailing list). Main aspects:
> * AutoSizeColumnTracker is now completely internal and "hidden" behind Sheet
> * by default SXSSF auto-sizing behaviour is identical to Apache POI 3.13 
> * to get the fixed SXSSF auto-sizing behaviour, one must explicitly call
> ((SXSSFSheet)sheet).setAutoSizeTrackAllFlushedRows(true). This will track
> all rows before they get flushed to disk
> * Ability to manually add/remove rows to/from tracking is not implemented.
> This should be handled in separate issue since it affects all types

Stefan, I've been working on your patch, attachment 33277 [details], removing the row-tracking code, adding more unit tests and java docs. The attached patch is what I've come up with.

I wanted to minimize the amount of changes people will need to make to their SXSSFWorkbook code while still behaving consistently regardless of the number of rows in the random-access window.

For example, if the random-access window is the first 5 rows, autoSizeColumn has all the information it needs in the window to compute the best-fit width without telling the AutoSizeColumnTracker the columns they wish to track. While I could allow that, as soon as the first row gets flushed, the user would need to track the columns of interest, though it's now too late. Additionally, this would look like a bug and would be difficult to track down because of the latent failure. It's better to have consistent behavior and fail early.

In this patch, autoSizeColumn uses the AutoSizeColumnTracker to determine the best-fit width of all tracked flushed columns, then computes the best-fit column width in the active window, takes the maximum of the two values, and sets the column width to that value.

The only change users will need to make to SXSSF code is to add sxSheet.trackAllColumnsForAutoSizing() or sxSheet.trackColumnForAutoSizing(column) (this is faster and uses less memory than tracking all columns) before calling autoSizeColumn. To have consistent behavior with H/XSSFSheet, tracking columns should occur immediately after the sheet is created and before the first row is created, though nothing stops it from occurring later (for example, if you want to exclude the first 3 rows from auto-sizing, you could wait to register the columns until after the first 3 rows have been flushed).

Columns can be un-tracked as well, which is needed if user code auto-sizes several columns after writing just a few rows, and then doesn't auto-size anything else. Without the ability to untrack rows, computing the best-fit width for all the remaining rows would be wasted effort.

# Scenario 1: Auto-size header row only
> SXSSFWorkbook wb = new SXSSFWorkbook();
> SXSSFSheet sh = wb.createSheet();
> sh.trackAllColumnsForAutoSizing(); //do this immediately
> Row header = sh.createRow(0);
> // populate row with data
> header.createCell(0).setCellValue("Some really long header");
> header.createCell(1).setCellValue("short");
> sh.autoSizeColumn(0);
> sh.untrackAllColumns();

# Scenario 2: Exclude header row from auto-sizing
> SXSSFWorkbook wb = new SXSSFWorkbook();
> SXSSFSheet sh = wb.createSheet();
> Row header = sh.createRow(0);
> header.createCell(0).setCellValue("Some really long header");
> header.createCell(1).setCellValue("short");
> sh.flushRows();
> sh.trackAllColumnsForAutoSizing();
> for (int r=0; r<1000; r++) {
>     Row data = sh.createRow(r);
>     // populate row with data
>     data.createCell(0).setCellValue("Cell[r="+r + ",c=0]");
> }
> sh.autoSizeColumn(0);

Let me know if I need to make any changes before committing this new functionality.
Comment 14 Javen O'Neal 2015-11-28 12:31:17 UTC
Created attachment 33308 [details]
SXSSF Autosizing Fix, excludes tracking rows
Comment 15 Javen O'Neal 2015-11-30 00:35:19 UTC
Added in r1717146. Updated docs in r1717147.

Note to implementers:
This fix may break existing code.
If your application uses or potentially uses SXSSFSheets and calls autoSizeColumn, you will need to make sure that the columns that are auto-sized are tracked prior to auto-sizing. The best place to do this tracking is immediately after the sheet is created.
Use any of the following:
* sxssfSheet.trackAllColumnsForAutoSizing()
* sxssfSheet.trackColumnsForAutoSizing(int column)
* sxssfSheet.trackColumnsForAutoSizing(Collection<Integer> columns)

Realize that tracking all columns may have a larger memory footprint than tracking individual columns if you only intend on tracking some of the columns.

If the sheet type isn't known until runtime, you'll need to add the following to your code.
Sheet sheet = workbook.createSheet();
+ if (sheet instanceof SXSSFSheet) {
+    SXSSFSheet sxSheet = (SXSSFSheet) sheet;
+    sxSheet.trackAllColumnsForAutoSizing();
+    // or track columns individually
+ }

See comment 13 for more info.
Comment 16 Javen O'Neal 2015-12-07 19:50:23 UTC
*** Bug 51740 has been marked as a duplicate of this bug. ***
Comment 17 Javen O'Neal 2015-12-09 06:04:40 UTC
Updated spreadsheet quickguide in r1718764.