Bug 59432

Summary: createName is slow when there are many Names (10000)
Product: POI Reporter: GuoHui Chen <cgh_chen>
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED DUPLICATE    
Severity: enhancement    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: All   

Description GuoHui Chen 2016-05-06 03:29:43 UTC
XSSFName

public void setNameName(String name)
  {
    validateName(name);

    int sheetIndex = getSheetIndex();

    for (int i = 0; i < this._workbook.getNumberOfNames(); i++) {
      XSSFName nm = this._workbook.getNameAt(i);
      if ((nm != this) && 
        (name.equalsIgnoreCase(nm.getNameName())) && (sheetIndex == nm.getSheetIndex())) {
        String msg = new StringBuilder().append("The ").append(sheetIndex == -1 ? "workbook" : "sheet").append(" already contains this name: ").append(name).toString();
        throw new IllegalArgumentException(msg);
      }
    }

    this._ctName.setName(name);
  }
Comment 1 Javen O'Neal 2016-05-17 14:35:23 UTC
Adding a name requires checking the name manager for existing names to avoid defining the same name at the same scope (my guess is this would result in a corrupt workbook). POI uses a naïve implementation, shown in comment 0, which requires O(n) time for a naïve implementation. We could perform this check in O(1) using a hash table with a hashable tuple (scope, name) as the key. A less elegant, inferior solution that runs in O(1) uses nested hash tables: the first layer having scope (sheet name or global) keys and the second layer having name keys (inner and outer key could swapped). This comes at the cost of higher memory consumption and increasing the complexity of the code (and therefore higher chance for bugs).

Given the code from comment 0, I'm not surprised that adding N names is slow, as it is performing O(N²) operations.

Here's what you could do:
1) Provide a patch with a speed-optimized implementation with 100% test coverage.
2) Provide a patch with a non-validating version of setNameName (probably called setNameNameUnsafe) [1]
3) access the CT* classes yourself, either with introspection, subclassing, or forking POI, which gives you direct access to the CTName data structure. This would complicate upgrading POI in the future.

[1] Relevant discussion on dev@poi mailing list http://apache-poi.1045710.n5.nabble.com/Preventing-corrupt-workbooks-td5722973.html
Comment 2 Javen O'Neal 2016-05-17 15:00:45 UTC
class NameManagerKeyTuple() {
    public final String scope, name;
    public constructor NameManagerKeyTuple(String scope, String name) {
        this.scope = scope; // how will you represent global scope? Null?, How will you update this data structure if a sheet is renamed or re-ordered? Keeping this data structure in sync with sheet renames is the biggest hurdle.
        this.name = name; // and assert not null and convert to canonical case a la String.toLower(Locale.ROOT)
    }
    @Override
    public int hashCode() {
        return scope.hashCode() + name.hashCode(); // whatever you use for global scope, make sure it has a consistent hash code. Null requires null check to prevent NPE.
    }
    @Override
    public interest equals(Object other) {
        if (!(other instanceof NameManagerKeyTuple)) {
            return false;
        }
        NameManagerKeyTuple o = (NameManagerKeyTuple) other;
        return scope.equals(o.scope) && name.equals(o.name); // again, check for NPE if values could be null;
    }
}

NameManager
private final Map<NameManagerKeyTuple, CTName> names = new HashMap<>();
Comment 3 Javen O'Neal 2016-06-17 11:10:17 UTC
Moved getNumberOfNames outside of the loop in r1748832.
Using a StringBuilder as suggested in comment 0 provides no performance benefit.
Comment 4 Dominik Stadler 2016-10-15 20:26:17 UTC
It seems this was actually fixed already in POI 3.15 via r1754521 and bug 59734, there we changed the access pattern to fetch by name from a ArrayListValuedHashMap from commons-collections4.

*** This bug has been marked as a duplicate of bug 59734 ***