Because the older POIFSFileSytem only ever supported read or write, and only read from an InputStream, all the POIDocument classes only supported writing out to a new stream. When the OOXML support was added, this followed the same pattern With most things using NPOIFSFileSystem, which supports in-place write when using a File, we can revisit this. It should be possible to have POIDocument / POIXMLDocument classes write in-place if opened from a File, avoiding the need to do the write-as-new + close + rename shuffle on Windows, amongst other benefits The existing write(OutputStream) would remain, both for compatibility and for those needing to load from streams rather than files, but it should be possible to add a write() method as well for file-based in-place write. Just needs some work!
Initial HSSFWorkbook in-place write support added in r1753103. If we're happy with how this looks and works, should be quite quick to generalise for the other OLE2 (POIDocument) classes which support write
Nick said in r1753103 > This will fail (with an {@link IllegalStateException} if the > Workbook was opened read-only, opened from an {@link InputStream} > instead of a File, or if this is not the root document. For those cases, > you must use {@link #write(OutputStream)} to write to a brand new stream. We urge users to read spreadsheets from a File rather than an InputStream to reduce memory consumption. Is there any we we could also reduce memory consumption by writing a File instead of an OutputStream? I noticed a significant performance improvement when writing to a BufferedOutputStream(FileOutputStream) on a network drive (Windows share). We should be cognizant of that, either in the code or as a disclaimer in the JavaDocs.
Added performance note about using BufferedOutputStreams in r1753112, since it wasn't mentioned there yet.
Currently (N)POIFS doesn't support opening an empty file, so you'd (for now) need to create a new empty POIFS, save it to a file (would be ~2kb), open that, then save the document into it. I guess we might want a static create method on POIFSFileSystem that hides all that, which could later be replaced with logic to create the new file from scratch As for if it would help... Any chance you could, on your somewhat shonky filesystem, open a HSSFWorkbook from a POIFS loaded r/w from a file, make some changes, do an in-place write, and see how quick it saves back vs OutputStream vs BufferedOutputStream?
Created attachment 34054 [details] timing saving an HSSFWorkbook on local/network drive and buffered/unbuffered Here's the timing results: > ncalls tottime cumtime function > 30 62.083488 62.083488 save_network_unbuffered > 30 6.488765 6.488765 save_network_buffered > 30 1.807957 1.807957 save_local_unbuffered > 30 1.397861 1.397861 save_local_buffered
I used https://svn.apache.org/viewvc/poi/trunk/test-data/spreadsheet/53446.xls?view=log from r1614697, which was one of the larger spreadsheets.
Created attachment 34055 [details] read from NPOIFSFileSystem and write in place or buffered/unbuffered outputstream
Created attachment 34056 [details] timing spreadsheet Averaged over 40 runs. > Method Open Modify Write Close > unbuffered local 0.214 0.004 0.137 0.035 > unbuffered network 0.321 0.005 0.219 0.035 > buffered local 0.189 0.002 0.141 0.038 > buffered network 0.326 0.002 0.215 0.035 > inplace local 0.178 0.002 0.097 0.040 > inplace network 0.278 0.003 0.131 0.069
Looks like the in-place write stuff is faster than buffered streams over the network, and locally. Am I reading that right? If so, I can look to add a write(File) method to HSSFWorkbook which would create a new file and write into that. Oh, and do write() + write(File) on the other write-supporting H??F classes.
I think the data from comment 8 is inconclusive because the same open_workbook function was called for each test but has results varying by a factor of two. The order that the tests ran affected the results. What we're seeing here is disk cache performance. I'm going to run fewer trials spaced over more time during peak business hours when the network drive is busiest. I expected to see the order of magnitude difference between local ad network unbuffered.
I've been trying to get meaningful data that excludes the effect of JVM JIT (which I've noticed in the form of "modify" running much slower on the first test than subsequent tests and disk caching. So far it looks like in-place writing is faster than writing to a different location, and in-place writing to network is only slightly slower than in-place writing to local. I think the utility and simplicity of writing in-place is worth it regardless of the timing results, so I'd say go ahead and continue adding in-placing writing. Is it easy to make in-place writing work for WorkbookFactory.create(File) and HSSFWorkbook(File)?
(In reply to Javen O'Neal from comment #11) > Is it easy to make in-place writing work for > WorkbookFactory.create(File) and HSSFWorkbook(File)? in-place write works for both of those as long as you set read-only to false when opening > I think the utility and simplicity of writing in-place is worth it > regardless of the timing results, so I'd say go ahead and continue adding > in-placing writing. OK, I've had a go at adding HSSFWorkbook.write(File). Will need to fix a POIFS bug later before it works... Does HSSFWorkbook now look how you'd expect / like all the document classes to look for write methods?
In-place-write and write-to-new-File now done for HSSF, HSLF and HPSF We could probably use some feedback on it before making more major changes to HWPF to add the support there, and before delving into OPC / X??F (for which we've at least one other bug around close vs revert vs write!)
r1799308 might come in handy for handling nested try...finally statements with closeable resources. It should also make it easier to write unit tests that don't leak resources without indenting the code with try...finally statements.