66532 – [PATCH] Improve performance of SheetDataWriter

Bug 66532 - [PATCH] Improve performance of SheetDataWriter

Summary: [PATCH] Improve performance of SheetDataWriter

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	XSSF (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 enhancement (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-03-17 13:38 UTC by Matthias Raschhofer
Modified:	2023-03-18 12:08 UTC (History)
CC List:	0 users

Attachments
Patchset of SheetDataWriter (2.77 KB, patch) 2023-03-17 13:38 UTC, Matthias Raschhofer	Details \| Diff
Benchmark (6.95 KB, text/plain) 2023-03-17 19:38 UTC, Matthias Raschhofer	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthias Raschhofer 2023-03-17 13:38:42 UTC

Created attachment 38524 [details]
Patchset of SheetDataWriter

Performance tests showed that creating rows in a SXSSF sheet with lots of data spend a substantial amount of cpu time escaping the strings to write using #outputEscapedString. 

By simplifying the loop and avoiding to convert between string and codepoints a bunch of times, we can improve the writing speed by a good amount.

Comment 1 PJ Fanning 2023-03-17 14:18:44 UTC

It's hard to read this patch on my phone but a quick look makes me think there is a bug with use of string length - the number of chars as opposed the number of codepoints. I'm very reluctant to take this change. Can you provide a jmh benchmark to back up your claims? If the codepoint code is super slow, we could consider an option where users configure whether they want char iteration or codepoint iteration.

Is there any chance of using GitHub instead of patch files?

Comment 2 PJ Fanning 2023-03-17 14:32:02 UTC

https://github.com/apache/poi/pull/405 is an unreleased perf change that may improve perf time of existing code.

Comment 3 Matthias Raschhofer 2023-03-17 19:38:08 UTC

Created attachment 38525 [details]
Benchmark

The attached file contains a benchmark comparing the performance of the proposed patch against previous iterations of the same method.

Comment 4 Matthias Raschhofer 2023-03-17 19:43:52 UTC

Thank you for your comments. I'm happy to use git, however I thought it was readonly and I should provide patches here. I opened a pull request in git (https://github.com/apache/poi/pull/443).

I attached a benchmark also comparing the performance against the change made with https://github.com/apache/poi/pull/405. I think the difference is significant.

I don't think there is an issue with the number of chars vs. the number of codepoints, since the loop counter is increased in case the codepoint is in fact a pair of characters. There are unit tests in the TestSheetDataWriter asserting the correct behaviour for unicode surrogates as well as the 'replaceWithQuestionMark' behaviour.

Comment 5 PJ Fanning 2023-03-18 12:08:26 UTC

Thanks Matthias - I'll close this based on what you provided in https://github.com/apache/poi/pull/443 - if you could close that PR, it be great.