Bug 57253 - DirectoryScanner holds on to too many strings
Summary: DirectoryScanner holds on to too many strings
Status: NEW
Alias: None
Product: Ant
Classification: Unclassified
Component: Core (show other bugs)
Version: 1.8.4
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: Ant Notifications List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-24 15:21 UTC by Jesse Glick
Modified: 2014-11-26 10:36 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jesse Glick 2014-11-24 15:21:12 UTC
Found in 1.8.4 but I do not see any relevant changes since then. Upstream issue is

  https://issues.jenkins-ci.org/browse/JENKINS-25759

When traversing a large directory structure looking for matches, the filesNotIncluded, dirsNotIncluded, and scannedDirs fields of DirectoryScanner (and probably also notFollowedSymlinks) grow to be quite large.

In the case of filesNotIncluded this is senseless if you never plan on calling getNotIncludedFiles() etc. Even if you did, this method forces a slow scan to be performed! Yet the field is built up during a fast scan too. Same for dirsNotIncluded. Not sure about notFollowedSymlinks.

scannedDirs is trickier, because it actually seems to be used during the fast scan, though https://github.com/apache/ant/commit/9e89cec932a798d958cf6bb310936d9a00c09a9d does not offer any explanation (or test) other than that it “avoids double scanning”.

The partial workaround is to subclass DirectoryScanner and periodically clear the filesNotIncluded and dirsNotIncluded fields. notFollowedSymlinks is private but does not seem to hold anything in typical cases. scannedDirs is private but (as above) it is unclear whether it is safe to clear it anyway. clearResults() can clear all these, but also throws out things which are wanted (filesIncluded, for example), and setting these fields back to their former values after calling clearResults() seems risky.
Comment 1 Stefan Bodewig 2014-11-25 08:50:14 UTC
It's not quite fair to ask me to explain a code change I've created more than eleven years ago ;-)

Looking through my archives this was the first time we tried various options speeding up DirectoryScanner - this has been long before we introduced TokenizedPath/Pattern.  It is quite possible later changes rendered the hasBeenScanned test redundant.  I'll perform some experiments.

Note getScannedDirs is used by a test (and nothing else AFAICT), so if you wanted to remove the extra check, this will need to be adapted.

I haven't thought through all the reasons we keep the structures you mention, will do so later today, just wanted to provide an early feedback.
Comment 2 Stefan Bodewig 2014-11-25 10:49:19 UTC
If you look at how slowScan works you'll see the code needs to record not-included and excluded files and dirs it encounters during a fast scan.  The slow scan only scans the excluded/not-included dirs and would miss all the files and directories that have been seen previously.

We could avoid holding on to those files and dirs at the cost of making the slow scan even slower (it would have to traverse the whole filesystem again rather than just the parts that haven't been scanned already).
Comment 3 Jesse Glick 2014-11-25 18:31:36 UTC
(In reply to Stefan Bodewig from comment #2)
> We could avoid holding on to those files and dirs at the cost of making the
> slow scan even slower

For a caller which does not intend to use the slow scan mode at all, this would be fine. The problem is that the current API forces a mode of operation which uses unbounded heap.
Comment 4 Stefan Bodewig 2014-11-26 10:36:48 UTC
Yes, I understand that, I was just pointing out why the code is what it is and what consequences a change would have.

Unfortunately we don't have any proper way to tell DirectoryScanner whether we intend to ever use the slow scan results.

AFAICT Ant itself never uses the results of a slow-scan, only <delete> needs the list of not-followed symlinks.