Created attachment 38524 [details] Patchset of SheetDataWriter Performance tests showed that creating rows in a SXSSF sheet with lots of data spend a substantial amount of cpu time escaping the strings to write using #outputEscapedString. By simplifying the loop and avoiding to convert between string and codepoints a bunch of times, we can improve the writing speed by a good amount.
It's hard to read this patch on my phone but a quick look makes me think there is a bug with use of string length - the number of chars as opposed the number of codepoints. I'm very reluctant to take this change. Can you provide a jmh benchmark to back up your claims? If the codepoint code is super slow, we could consider an option where users configure whether they want char iteration or codepoint iteration. Is there any chance of using GitHub instead of patch files?
https://github.com/apache/poi/pull/405 is an unreleased perf change that may improve perf time of existing code.
Created attachment 38525 [details] Benchmark The attached file contains a benchmark comparing the performance of the proposed patch against previous iterations of the same method.
Thank you for your comments. I'm happy to use git, however I thought it was readonly and I should provide patches here. I opened a pull request in git (https://github.com/apache/poi/pull/443). I attached a benchmark also comparing the performance against the change made with https://github.com/apache/poi/pull/405. I think the difference is significant. I don't think there is an issue with the number of chars vs. the number of codepoints, since the loop counter is increased in case the codepoint is in fact a pair of characters. There are unit tests in the TestSheetDataWriter asserting the correct behaviour for unicode surrogates as well as the 'replaceWithQuestionMark' behaviour.
Thanks Matthias - I'll close this based on what you provided in https://github.com/apache/poi/pull/443 - if you could close that PR, it be great.