Bug 57919 - Support in-place write when opened from a File
Summary: Support in-place write when opened from a File
Status: NEEDINFO
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: 3.16-dev
Hardware: All All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks: 59252
  Show dependency tree
 
Reported: 2015-05-11 19:07 UTC by Nick Burch
Modified: 2017-06-20 07:29 UTC (History)
0 users



Attachments
timing saving an HSSFWorkbook on local/network drive and buffered/unbuffered (1.87 KB, text/plain)
2016-07-19 02:57 UTC, Javen O'Neal
Details
read from NPOIFSFileSystem and write in place or buffered/unbuffered outputstream (4.39 KB, text/plain)
2016-07-19 06:23 UTC, Javen O'Neal
Details
timing spreadsheet (5.52 KB, text/csv)
2016-07-19 06:35 UTC, Javen O'Neal
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Burch 2015-05-11 19:07:47 UTC
Because the older POIFSFileSytem only ever supported read or write, and only read from an InputStream, all the POIDocument classes only supported writing out to a new stream. When the OOXML support was added, this followed the same pattern

With most things using NPOIFSFileSystem, which supports in-place write when using a File, we can revisit this. It should be possible to have POIDocument / POIXMLDocument classes write in-place if opened from a File, avoiding the need to do the write-as-new + close + rename shuffle on Windows, amongst other benefits

The existing write(OutputStream) would remain, both for compatibility and for those needing to load from streams rather than files, but it should be possible to add a write() method as well for file-based in-place write. Just needs some work!
Comment 1 Nick Burch 2016-07-17 19:04:00 UTC
Initial HSSFWorkbook in-place write support added in r1753103. If we're happy with how this looks and works, should be quite quick to generalise for the other OLE2 (POIDocument) classes which support write
Comment 2 Javen O'Neal 2016-07-17 19:38:38 UTC
Nick said in r1753103
> This will fail (with an {@link IllegalStateException} if the
> Workbook was opened read-only, opened from an {@link InputStream}
> instead of a File, or if this is not the root document. For those cases, 
> you must use {@link #write(OutputStream)} to write to a brand new stream.

We urge users to read spreadsheets from a File rather than an InputStream to reduce memory consumption. Is there any we we could also reduce memory consumption by writing a File instead of an OutputStream?

I noticed a significant performance improvement when writing to a BufferedOutputStream(FileOutputStream) on a network drive (Windows share). We should be cognizant of that, either in the code or as a disclaimer in the JavaDocs.
Comment 3 Javen O'Neal 2016-07-17 20:26:05 UTC
Added performance note about using BufferedOutputStreams in r1753112, since it wasn't mentioned there yet.
Comment 4 Nick Burch 2016-07-18 09:57:36 UTC
Currently (N)POIFS doesn't support opening an empty file, so you'd (for now) need to create a new empty POIFS, save it to a file (would be ~2kb), open that, then save the document into it. I guess we might want a static create method on POIFSFileSystem that hides all that, which could later be replaced with logic to create the new file from scratch

As for if it would help... Any chance you could, on your somewhat shonky filesystem, open a HSSFWorkbook from a POIFS loaded r/w from a file, make some changes, do an in-place write, and see how quick it saves back vs OutputStream vs BufferedOutputStream?
Comment 5 Javen O'Neal 2016-07-19 02:57:47 UTC
Created attachment 34054 [details]
timing saving an HSSFWorkbook on local/network drive and buffered/unbuffered

Here's the timing results:

> ncalls  tottime     cumtime    function
> 30      62.083488   62.083488  save_network_unbuffered
> 30      6.488765    6.488765   save_network_buffered
> 30      1.807957    1.807957   save_local_unbuffered
> 30      1.397861    1.397861   save_local_buffered
Comment 6 Javen O'Neal 2016-07-19 02:59:42 UTC
I used https://svn.apache.org/viewvc/poi/trunk/test-data/spreadsheet/53446.xls?view=log  from r1614697, which was one of the larger spreadsheets.
Comment 7 Javen O'Neal 2016-07-19 06:23:12 UTC
Created attachment 34055 [details]
read from NPOIFSFileSystem and write in place or buffered/unbuffered outputstream
Comment 8 Javen O'Neal 2016-07-19 06:35:01 UTC
Created attachment 34056 [details]
timing spreadsheet

Averaged over 40 runs.

> Method               Open     Modify   Write    Close
> unbuffered local     0.214    0.004    0.137    0.035
> unbuffered network   0.321    0.005    0.219    0.035
> buffered local       0.189    0.002    0.141    0.038
> buffered network     0.326    0.002    0.215    0.035
> inplace local        0.178    0.002    0.097    0.040
> inplace network      0.278    0.003    0.131    0.069
Comment 9 Nick Burch 2016-07-19 14:10:35 UTC
Looks like the in-place write stuff is faster than buffered streams over the network, and locally. Am I reading that right?

If so, I can look to add a write(File) method to HSSFWorkbook which would create a new file and write into that. Oh, and do write() + write(File) on the other write-supporting H??F classes.
Comment 10 Javen O'Neal 2016-07-19 16:20:44 UTC
I think the data from comment 8 is inconclusive because the same open_workbook function was called for each test but has results varying by a factor of two. The order that the tests ran affected the results. What we're seeing here is disk cache performance. I'm going to run fewer trials spaced over more time during peak business hours when the network drive is busiest. I expected to see the order of magnitude difference between local ad network unbuffered.
Comment 11 Javen O'Neal 2016-07-20 02:31:21 UTC
I've been trying to get meaningful data that excludes the effect of JVM JIT (which I've noticed in the form of "modify" running much slower on the first test than subsequent tests and disk caching.

So far it looks like in-place writing is faster than writing to a different location, and in-place writing to network is only slightly slower than in-place writing to local.

I think the utility and simplicity of writing in-place is worth it regardless of the timing results, so I'd say go ahead and continue adding in-placing writing. Is it easy to make in-place writing work for WorkbookFactory.create(File) and HSSFWorkbook(File)?
Comment 12 Nick Burch 2016-07-20 11:15:31 UTC
(In reply to Javen O'Neal from comment #11)
> Is it easy to make in-place writing work for
> WorkbookFactory.create(File) and HSSFWorkbook(File)?

in-place write works for both of those as long as you set read-only to false when opening

> I think the utility and simplicity of writing in-place is worth it
> regardless of the timing results, so I'd say go ahead and continue adding
> in-placing writing.

OK, I've had a go at adding HSSFWorkbook.write(File). Will need to fix a POIFS bug later before it works... Does HSSFWorkbook now look how you'd expect / like all the document classes to look for write methods?
Comment 13 Nick Burch 2016-07-21 23:29:56 UTC
In-place-write and write-to-new-File now done for HSSF, HSLF and HPSF

We could probably use some feedback on it before making more major changes to HWPF to add the support there, and before delving into OPC / X??F (for which we've at least one other bug around close vs revert vs write!)
Comment 14 Javen O'Neal 2017-06-20 07:29:56 UTC
r1799308 might come in handy for handling nested try...finally statements with closeable resources. It should also make it easier to write unit tests that don't leak resources without indenting the code with try...finally statements.