On https://issues.apache.org/jira/browse/TIKA-3017, a user shared a 19kb pptx file that requires 2GB to parse or else throws an OOM. The problem appears to be that BitSet is horribly memory inefficient on large integers and there are some shape ids with very large numbers. Let's switch this to a simple HashSet and look for other BitSet oom vulnerabilities throughout the codebase. Executor task launch worker for task 47360 at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48) at java.util.Arrays.copyOf([JI)[J (Arrays.java:3308) at java.util.BitSet.ensureCapacity(I)V (BitSet.java:337) at java.util.BitSet.expandTo(I)V (BitSet.java:352) at java.util.BitSet.set(I)V (BitSet.java:447) at org.apache.poi.xslf.usermodel.XSLFSheet.registerShapeId(I)V (XSLFSheet.java:123) at org.apache.poi.xslf.usermodel.XSLFDrawing.<init>
Please don't use HashSet and search for a sparse BitSet class instead. The reason for BitSet is/was an efficient way of searching for the next available id. looping through and testing for an id is not the way to go ...
Sorry. I misunderstood. Thank you.
Turns out you can trigger an oom with -Xmx900m with just three integer values: BitSet bs = new BitSet(); bs.set(1794795584); bs.set(1270275541); bs.set(2127760244); Any objections to: https://github.com/brettwooldridge/SparseBitSet I've run a handful of speed/memory/accuracy tests and it _seems_ far better than Java's BitSet. If there are no objections, should we copy/paste the class or import another dependency?
+1 ... I wanted to point out the same ... well it's one of the first google matches anyway when searching for sparse bitset
What's your preference for adding a dependency vs copy/pasting one large class. The benefit to copy/paste is that we can strip out object serialization, which I now detest. And, right, those are only private methods, but still...
Copying means duplicating code and is usually not done unless the code is "submitted" to the project. It would just hide the fact that use code from other projects. Using a dependency means we can more easily update to newer versions, so I would prefer that. Hopefully we can switch to a modern dependency management at some point to get rid of the current changes in the build.xml for updated dependencies. On the subject: I had good experiences with https://github.com/lemire/javaewah in another project. This one also seems to be at least more actively maintained than the other one and used by many more projects on GitHub, 316 vs. 16.
Sounds good. If we're adding a dependency, then, y, I'm happy to go with javaewah. Let me kick the same tires on that one. The single class in wooldridge's was appealing if we were to copy/paste.
-1 to use javaewah. I have only read the intro/readme.md about compressed bitmaps, which says that the sparse scenario is not a use case for compressed bitmaps. I don't know how many shape ids would be used at maximum, but I guess something like 20-30k should be the maximum goal. The SparseBitSet is ASL licensed so I don't see a problem importing it into our source - although I'm a bit puzzled how lines 4-5 of the source should be treated ... and usually ASL licensed code should have the ASL header on top ...
I'd suggest to look at Lemire's RoaringBitmap (https://github.com/RoaringBitmap/RoaringBitmap), it's a bit newer and it works fine with both sparse and dense bitmaps.
I ran two tests of roaring bitmap, java bitset and zaxxers sparsebitset. I wanted to simulate "worst case scenario" which is random ints from 0 to Integer.MAX_Value. First test: 1000000 random numbers (well above anything we'd think we'd need) roaring bitmap (tested with -Xmx128m) 48018 total ms, 539.5280898876405 avg ms java bitset (tested with -Xmx2g) 23826 total ms, 267.70786516853934 avg ms zaxxer (requires -Xmx256m) 27730 total ms, 311.5730337078652 avg ms 100000 random numbers (still a good deal more that we think we'd need) roaring bitmap (tested with -Xmx128m) 6508 total ms, 73.12359550561797 avg ms java bitset (tested with -Xmx2g) 7607 total ms, 85.47191011235955 avg ms zaxxer (tested with -Xmx128m) 3353 total ms, 37.674157303370784 avg ms
I'm wary of microbenchmarks (especially when I write them), and these results could be misleading. I did include a warmup phase at. At the very least, they show memory consumption is awful with Java, much better with Zaxxer and still better with roaring. Zaxxer is the fastest, with roaring doing better than Java at 100k, which is well above what we think we'd need. roaring does look to be better supported/more recent/more active community; zaxxer has an interface that is closer to BitSet and would be an easier drop in place kind of thing. I did get an "IllegalArgumentException" with zaxxer on one run, which gives me pause...
Hackery of code: Random r = new Random(); int sz = 100000; int iterations = 100; long absents = 0; long gets = 0; long totalElapsed = 0; int warmup = 10; int observations = 0; for (int it = 0; it < iterations; it++) { long start = System.currentTimeMillis(); System.out.println("iteration " + it); int[] ints = new int[sz]; for (int i = 0; i < sz; i++) { ints[i] = Math.abs(r.nextInt()); } // BitSet bs = new BitSet(); RoaringBitmap rb = new RoaringBitmap(); for (int i = 0; i < sz; i++) { /*bs.set(ints[i]); long absent = bs.nextClearBit(ints[i]); absents += absent; if (bs.get(Math.abs(r.nextInt()))) { gets++; }*/ rb.add(ints[i]); long absent = rb.nextAbsentValue(ints[i]); absents += absent; if (rb.contains(Math.abs(r.nextInt()))) { gets++; } } if (it > warmup) { long elapsed = System.currentTimeMillis() - start; System.out.println(absents + " " + gets + " " + elapsed); totalElapsed += elapsed; observations++; } } System.out.println(totalElapsed + " total ms, " + (double) totalElapsed / (double) observations + " avg ms");
Sorry. The exception I got with Zaxxer was when I tried to get an Integer.MAX_VALUE: Exception in thread "main" java.lang.IndexOutOfBoundsException: i=2147483647
And ewah is far, far, far slower...I didn't even wait for completion. Times on 100000 random integers: iteration 0 19198ms iteration 1 18822ms iteration 2 19034ms
could we implement a wrapping API and allow people to plugin their own implementations (providing a reference implementation that is the default)?
(In reply to PJ Fanning from comment #15) > could we implement a wrapping API and allow people to plugin their own > implementations (providing a reference implementation that is the default)? I hope we can achieve this without adding another ThreadLocal. With all those tweaks we provide elsewhere, I think it's time to think about some kind of POI context/configuration class, which can be optionally provided on the top level usermodel classes.
>I hope we can achieve this without adding another ThreadLocal. +1 I propose picking roaring or zaxxer and just running with it for now.
Created attachment 36927 [details] PATCH I'm attaching a first draft of a patch with zaxxer. I'm not going to commit it and head off on holiday. :) Longer term, Robert Muir recommended Lucene's SparseBitSet, and it looks good. I've opened https://issues.apache.org/jira/browse/COLLECTIONS-743 and will submit a patch there. If we can get it into collections, we can drop the zaxxer dependency.
r1872445 added zaxxer for now. We can open a new ticket for other options if needed. Please re-open if there are any surprises or areas for improvements.
Hello Team, We are trying to upgrade Apache POI 4.1.12. As per https://github.com/brettwooldridge/SparseBitSet/blob/master/LICENSE, We can see SparseBitSet 1.2 is license under Apache License 2.0. However when we see src/main/java/com/zaxxer/sparsebits/SparseBitSet.java, it has following comment - /*- This software is the work of Paladin Software International, Incorporated, * based upon previous work done for and by Sun Microsystems, Inc. */ Our legal is asking what this comments means and asking for a clarification if it is indeed licensed under Apache License 2.0. We are trying to reach out to owner Mr. Brett Wooldridge but no success yet. In case do you know about the licensing impact, please guide us. Thanks Saurabh K.
Hi Saurabh, That is not a question for Apache POI team - that is not our code. It is a question for https://github.com/brettwooldridge/SparseBitSet
Yes, thanks. Can we configure/customize to use our own implementation and not SparseBitSet?
Not everything in POI uses this code - so you could try excluding this lib and see if your code still works. You could also stick to an older version of POI - one released before we introduced this dependency. Another option would be for your developers to fork POI and use a different BitSet implementation.
Thanks everyone. We get confirmation from Mr. Brett Wooldridge that SparseBitSet is fully under the Apache Software License V2 and free from any encumbrances. From: Brett Wooldridge <brett.wooldridge@gmail.com> Sent: Thursday, March 4, 2021 1:18 PM Subject: Re: Clarification Needed for SparseBitSet, a dependent component for using Apache POI XXXX, SparseBitSet is fully under the Apache Software License V2, with no other copyright, patent, or distribution encumbrances. The Apache Foundation itself would have never created such a dependency on a non-ASLv2 library. Best regards, Brett Wooldridge