Caching Module Resources - DRAFT

Version:
February 19, 2002
Author:
Jesse Glick
Abstract:

A significant portion of NetBeans' startup time is spent opening module JAR files, retrieving classes and resources from these JARs, examining manifests, and parsing and merging XML layers, followed by firing changes in the layers despite the fact that no files have yet been changed. This document proposes a way to collapse these tasks into a streamlined process involving opening a limited number of JARs containing classes and resources essential to startup and loading a premerged layer before any changes might be fired. The optimization should be transparent to module authors and users alike and require no API modifications.

Justification and Filed Issues

Too much startup time is spent loading certain kinds of information from modules, where this information does not typically change between consecutive runs.

Time drains

Opening JARs

Opening too many JARs during startup is undesirable because it requires some operating-system resources to physically open the JARs; time to read the full ZIP entry list for all enabled modules; and memory to keep the entry lists. Searching through JAR entry lists when loading resources also consumes a bit of time.

It would be preferable to open only a limited number of JAR files during startup. For example, open just JARs in the classpath (as controlled by the JVM's application classloader), and one "jumbo jar" containing critical resources.

Since most modules are not likely to be used during startup, their actual JARs should be opened only when initiating an unusual action that requires loading classes or resources not normally used. This can be accomplished by keeping a special JAR containing only those resources expected to be used during the VM session, adaptively based on previous runs.

Loading layers

A moderate amount of time is spent loading XML layers. Some of this is simply XML parsing time. Layers as written by a module author often include comments and whitespace which are not needed at runtime; or CDATA sections which need not frequently be loaded. A little time is also spent switching parser streams, and some time is spent merging layers together.

A cache could keep a precomputed merged master layer containing files from all module layers, which could be loaded in one sweep. URL-referenced file contents as well as literally included contents could be stored separately from the merged layer XML.

One optimization not covered here is to minimize the number of file attributes actually present, specifically SystemFileSystem.localizingBundle, SystemFileSystem.icon, and relative ordering constraints. Other proposed optimizations cover these improvements.

Firing changes in the system file system

A large chunk of time during startup (a couple of seconds) is spent firing changes in the system file system after the module layers have been added to the layer stack. This is gratuitous.

A cache could load merged layers before creating the system file system, assuring that no file events need be fired.

Computing lookup

A lot of time is spent preparing lookup. Effectively this means parsing *.instance and especially *.settings files to find their declared interface types and remembering which files can provide instances for a given class.

Yarda Tulach already has a proposed way of speeding this up by serializing the index portion of lookup between sessions so that instances can be found directly. A cache system could help manage expiration and recreation of such a serialized lookup, rather than having to implement it as a separate component.

Filed Issues

Proposed Implementation

The proposed implementation involves a compact cache kept in the user directory which stores module manifests, merged layers, and commonly-used classes and resources. The cache is controlled by a single hash key which should change if significant changes are made to the execution environment but not between typical consecutive runs of the IDE. If there is a cache hit on startup, the cache is hit. If there is a miss, resources are loaded the slow way, and upon exiting the IDE the cache is recreated according to state at shutdown time.

It should be possible to dynamically switch off the cache completely. Effectively this means that startup behaves exactly like a cache miss, and no cache recreation will be done at shutdown time.

Physical Storage Format

XXX the package mapping will not work, since two modules may have their own private copies of some library, etc.

The cache should be kept in the user directory, as this provides an assuredly writable area, and unlike temporary directories it has a good chance of corresponding directly to which user is running NetBeans and in which configuration. It should be a single JAR file, say $userdir/cache.jar. It can serve multiple functions, to avoid needing to open more than one JAR.

Although it might be a bit faster to use a special binary format to store such things, using XML, text files, and a JAR has the advantage of being much easier to debug and diagnose problems in. Compared to all the activity that takes place during startup, using textual formats will probably not be significant overhead. Of course, the cache format could be changed in the future to use a binary format such as a B-tree, if this proves desirable enough to justify the opacity.

The following files in the JAR serve various functions:

key.txt

One line containing the cache key, as an integer.

jars.txt

An unsorted list of JARs (modules only - not their locale variants, extensions, or patches), one per line as an absolute filename.

packages.txt

Unsorted list of packages found in module JARs (and extensions and locale variants and patches) as well as on the classpath. Each line consists of a package name (format: org/netbeans/modules/foo/), then a space, then an integer which is either either a JAR index starting at one from jars.txt - in case the package is to be found in a module JAR (or its variants or extensions or patches) - or a zero in case the package is to be found in the classpath. Note that NetBeans already does not permit two modules to split up a package.

META-INF/pathremainder/integer/integer

All files found in META-INF/ and its subdirectories, including META-INF/MANIFEST.MF manifests, are included in the cache JAR, with the filename suffixed with a slash, the module number, another slash, and a sequence number (starting at zero and allocated incrementally for each module number). For a given META-INF/** resource name found among a module JAR, its variants, extensions, and patches, all available instances of that resource are included, each with its own sequence number - patches first in an arbitrary order, then the module, then extensions and variants in an arbitrary order.

all other paths

All other paths (which must be in some package, else they are not loadable anyway, and must not be in META-INF/ because of the above rule) represent that exact class or resource in some module JAR (or variant or extension or patch). Only classes and resources considered essential by the cache are kept this way; others can be found in the original JAR according to packages.txt.

Cache Key

XXX hash of $CLASSPATH, locale, branding, file names and timestamps of JARs in systemClassLoader (incl. branding- and locale-variants)

Startup Procedure During Cache Hit

XXX

Startup Procedure During Cache Miss

XXX

Cache Recreation

XXX

Issues

XXX interaction with project & session switching