Bug 62995 - Unzip performance regression on Windows due to BZ 62502
Summary: Unzip performance regression on Windows due to BZ 62502
Status: NEW
Alias: None
Product: Ant
Classification: Unclassified
Component: Core tasks (show other bugs)
Version: 1.9.13
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: Ant Notifications List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-09 22:45 UTC by Falko Modler
Modified: 2020-05-23 06:06 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Falko Modler 2018-12-09 22:45:05 UTC
While playing around with maven-antrun-plugin and more recent Ant versions than 1.9.4 that comes with the plugin, I discovered that with 1.9.13 the unzip task is way slower for larger zipfiles than 1.9.12.

Unfortunately I am not allowed to share the JBoss EAP 6.4 zipfile I was testing with (a RedHat subscription is required).
The file is around 300 MB and cotains 1340 directories and 1517 files.

I guess the problem is the use of File.getCanoniocalPath() (twice for each file being extracted), introduced by this change:
https://github.com/apache/ant/commit/6a41d62cb9ab4e640b72cb4de42a6c211dea645d#diff-28908450670f05abc5779fa0c9291510
Respective ticket: https://bz.apache.org/bugzilla/show_bug.cgi?id=62502

One possible simple fix might be to check File.getPath() for ".." and if (and only if) at least one occurence is found, isLeadingPath() is called.

Envorinment:
Java version: 1.8.0_192, vendor: Oracle Corporation
Default locale: de_DE, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "dos"

PS: Maybe Linux is affected as well, didn't test...
Comment 1 Jaikiran Pai 2018-12-10 07:24:55 UTC
Hello Falko,

Is this target using "unzip" in just a basic manner or is there anything more to it? If possible, can you attach that target.

>> I guess the problem is the use of File.getCanoniocalPath() (twice for each file being extracted)

Looking at the code in question (the one which calls the FileUtils.isLeadingPath()), the "dir" will always be same destination directory. So even if we end up calling getCanonicalPath twice, I would expect the JRE implementation to return a cached value (as far as I remember, 1.8.x latest version of Java still uses canonical path cache). So I think, it shouldn't be that expensive (relatively). However, I'm not stating that this isn't the cause and you probably are right that this change ended up being expensive.

>> Unfortunately I am not allowed to share the JBoss EAP 6.4 zipfile I was testing with (a RedHat subscription is required).
The file is around 300 MB and cotains 1340 directories and 1517 files.

Can you post the timings though, with 1.9.12 and 1.9.13? That will give us some idea on what kind of performance regression we are seeing. In the meantime, I'll see if I can reproduce this on a *nix with a similar large file.
Comment 2 Jaikiran Pai 2018-12-11 09:48:06 UTC
FWIW - I gave it a try with the community version of JBoss EAP, the WildFly zip file (which is around 172MB and somewhat similar to the EAP zip contents) but I couldn't notice any difference is timing performance between latest and older versions of Ant. I am on a *nix system. It would be good to have a build file or other additional data to reproduce this.
Comment 3 Falko Modler 2018-12-11 11:08:34 UTC
>> Is this target using "unzip" in just a basic manner or is there anything more to it?

It's really just basic unzip.

>> I would expect the JRE implementation to return a cached value

Ok, maybe - I don't know. I just had a quick look at the code and did some research on the method and found various blog entries etc. that contained a warning that this method should be used sparingly.
If there is some caching, it must reside in native code. Another question is: Does it cache sub-paths? Each file from a zip (normally) has an unique path string...

>> Can you post the timings though, with 1.9.12 and 1.9.13?

I ran some tests and while there *is* a noticeable/distinct delay, it is *not* "threefold" as I stated initially. I am not sure where/how I saw this massive delay.

1.9.12 allowFilesToEscapeDest=false: ~8s

1.9.12 allowFilesToEscapeDest=true : ~8s

1.9.13 allowFilesToEscapeDest=false: ~10s

1.9.13 allowFilesToEscapeDest=true : ~8s

In the end this leaves us with a ~25% penalty.

PS: As I don't use standalone Ant, the tests were conducted with:
- maven-anrtun-plugin (with updated ant dependency)
- Maven 3.3.9
- Java version: 1.8.0_192, vendor: Oracle Corporation
- Default locale: de_DE, platform encoding: Cp1252
- OS name: "windows 10", version: "10.0", arch: "amd64", family: "dos"

Hardware:
- Thinkpad P51
- Intel Core i7-7820HQ
- 64GB RAM
- Bitlocker-enabled NTFS partition on a Samsung SSD 960 Pro 1TB
- Virus Scanner (McAfee) is *not* active for the involved files/folders