Bug 64015 - Replace usage of Java BitSet with a memory-optimized/sparse implementation
Summary: Replace usage of Java BitSet with a memory-optimized/sparse implementation
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XSLF (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-18 16:48 UTC by Tim Allison
Modified: 2021-03-05 05:39 UTC (History)
0 users



Attachments
PATCH (18.88 KB, patch)
2019-12-20 21:00 UTC, Tim Allison
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2019-12-18 16:48:06 UTC
On https://issues.apache.org/jira/browse/TIKA-3017, a user shared a 19kb pptx file that requires 2GB to parse or else throws an OOM.  The problem appears to be that BitSet is horribly memory inefficient on large integers and there are some shape ids with very large numbers.  Let's switch this to a simple HashSet and look for other BitSet oom vulnerabilities throughout the codebase.

Executor task launch worker for task 47360
 at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
 at java.util.Arrays.copyOf([JI)[J (Arrays.java:3308)
 at java.util.BitSet.ensureCapacity(I)V (BitSet.java:337)
 at java.util.BitSet.expandTo(I)V (BitSet.java:352)
 at java.util.BitSet.set(I)V (BitSet.java:447)
 at org.apache.poi.xslf.usermodel.XSLFSheet.registerShapeId(I)V (XSLFSheet.java:123)
 at org.apache.poi.xslf.usermodel.XSLFDrawing.<init>
Comment 1 Andreas Beeker 2019-12-18 16:56:13 UTC
Please don't use HashSet and search for a sparse BitSet class instead. The reason for BitSet is/was an efficient way of searching for the next available id. looping through and testing for an id is not the way to go ...
Comment 2 Tim Allison 2019-12-18 16:59:11 UTC
Sorry. I misunderstood.  Thank you.
Comment 3 Tim Allison 2019-12-18 20:22:47 UTC
Turns out you can trigger an oom with -Xmx900m with just three integer values:
        
BitSet bs = new BitSet();
bs.set(1794795584);
bs.set(1270275541);
bs.set(2127760244);

Any objections to: https://github.com/brettwooldridge/SparseBitSet

I've run a handful of speed/memory/accuracy tests and it _seems_ far better than Java's BitSet.

If there are no objections, should we copy/paste the class or import another dependency?
Comment 4 Andreas Beeker 2019-12-18 20:32:18 UTC
+1 ... I wanted to point out the same ... well it's one of the first google matches anyway when searching for sparse bitset
Comment 5 Tim Allison 2019-12-18 20:36:39 UTC
What's your preference for adding a dependency vs copy/pasting one large class.  

The benefit to copy/paste is that we can strip out object serialization, which I now detest.  And, right, those are only private methods, but still...
Comment 6 Dominik Stadler 2019-12-18 20:55:47 UTC
Copying means duplicating code and is usually not done unless the code is "submitted" to the project. It would just hide the fact that use code from other projects. 

Using a dependency means we can more easily update to newer versions, so I would prefer that. 

Hopefully we can switch to a modern dependency management at some point to get rid of the current changes in the build.xml for updated dependencies.

On the subject: I had good experiences with https://github.com/lemire/javaewah in another project. This one also seems to be at least more actively maintained than the other one and used by many more projects on GitHub, 316 vs. 16.
Comment 7 Tim Allison 2019-12-18 20:57:53 UTC
Sounds good.  If we're adding a dependency, then, y, I'm happy to go with javaewah.  Let me kick the same tires on that one.

The single class in wooldridge's was appealing if we were to copy/paste.
Comment 8 Andreas Beeker 2019-12-18 21:31:04 UTC
-1 to use javaewah. I have only read the intro/readme.md about compressed bitmaps, which says that the sparse scenario is not a use case for compressed bitmaps.
I don't know how many shape ids would be used at maximum, but I guess something like 20-30k should be the maximum goal.

The SparseBitSet is ASL licensed so I don't see a problem importing it into our source - although I'm a bit puzzled how lines 4-5 of the source should be treated ... and usually ASL licensed code should have the ASL header on top ...
Comment 9 Konstantin Gribov 2019-12-19 16:39:39 UTC
I'd suggest to look at Lemire's RoaringBitmap (https://github.com/RoaringBitmap/RoaringBitmap), it's a bit newer and it works fine with both sparse and dense bitmaps.
Comment 10 Tim Allison 2019-12-19 17:26:55 UTC
I ran two tests of roaring bitmap, java bitset and zaxxers sparsebitset.

I wanted to simulate "worst case scenario" which is random ints from 0 to Integer.MAX_Value.

First test: 1000000 random numbers (well above anything we'd think we'd need)

roaring bitmap (tested with -Xmx128m)
48018 total ms, 539.5280898876405 avg ms

java bitset (tested with -Xmx2g)
23826 total ms, 267.70786516853934 avg ms

zaxxer (requires -Xmx256m)
27730 total ms, 311.5730337078652 avg ms


100000 random numbers (still a good deal more that we think we'd need)

roaring bitmap (tested with -Xmx128m)
6508 total ms, 73.12359550561797 avg ms

java bitset (tested with -Xmx2g)
7607 total ms, 85.47191011235955 avg ms

zaxxer (tested with -Xmx128m)
3353 total ms, 37.674157303370784 avg ms
Comment 11 Tim Allison 2019-12-19 17:31:10 UTC
I'm wary of microbenchmarks (especially when I write them), and these results could be misleading.  I did include a warmup phase at.

At the very least, they show memory consumption is awful with Java, much better with Zaxxer and still better with roaring.

Zaxxer is the fastest, with roaring doing better than Java at 100k, which is well above what we think we'd need.

roaring does look to be better supported/more recent/more active community; zaxxer has an interface that is closer to BitSet and would be an easier drop in place kind of thing.  I did get an "IllegalArgumentException" with zaxxer on one run, which gives me pause...
Comment 12 Tim Allison 2019-12-19 17:32:52 UTC
Hackery of code:
        Random r = new Random();
        int sz = 100000;
        int iterations = 100;
        long absents = 0;
        long gets = 0;
        long totalElapsed = 0;
        int warmup = 10;
        int observations = 0;
        for (int it = 0; it < iterations; it++) {
            long start = System.currentTimeMillis();
            System.out.println("iteration " + it);
            int[] ints = new int[sz];
            for (int i = 0; i < sz; i++) {
                ints[i] = Math.abs(r.nextInt());
            }
     //       BitSet bs = new BitSet();
            RoaringBitmap rb = new RoaringBitmap();
            for (int i = 0; i < sz; i++) {
                /*bs.set(ints[i]);
                long absent = bs.nextClearBit(ints[i]);
                absents += absent;
                if (bs.get(Math.abs(r.nextInt()))) {
                    gets++;
                }*/

                rb.add(ints[i]);
                long absent = rb.nextAbsentValue(ints[i]);
                absents += absent;
                if (rb.contains(Math.abs(r.nextInt()))) {
                    gets++;
                }
            }
            if (it > warmup) {
                long elapsed = System.currentTimeMillis() - start;
                System.out.println(absents + " " + gets + " " + elapsed);
                totalElapsed += elapsed;
                observations++;
            }
        }
        System.out.println(totalElapsed + " total ms, " +
                (double) totalElapsed / (double) observations + " avg ms");
Comment 13 Tim Allison 2019-12-19 17:34:58 UTC
Sorry.  The exception I got with Zaxxer was when I tried to get an Integer.MAX_VALUE:

Exception in thread "main" java.lang.IndexOutOfBoundsException: i=2147483647
Comment 14 Tim Allison 2019-12-19 17:41:07 UTC
And ewah is far, far, far slower...I didn't even wait for completion.

Times on 100000 random integers:
iteration 0 19198ms
iteration 1 18822ms
iteration 2 19034ms
Comment 15 PJ Fanning 2019-12-19 19:38:24 UTC
could we implement a wrapping API and allow people to plugin their own implementations (providing a reference implementation that is the default)?
Comment 16 Andreas Beeker 2019-12-19 19:57:09 UTC
(In reply to PJ Fanning from comment #15)
> could we implement a wrapping API and allow people to plugin their own
> implementations (providing a reference implementation that is the default)?

I hope we can achieve this without adding another ThreadLocal.
With all those tweaks we provide elsewhere, I think it's time to think about some kind of POI context/configuration class, which can be optionally provided on the top level usermodel classes.
Comment 17 Tim Allison 2019-12-19 21:22:41 UTC
>I hope we can achieve this without adding another ThreadLocal.
+1

I propose picking roaring or zaxxer and just running with it for now.
Comment 18 Tim Allison 2019-12-20 21:00:51 UTC
Created attachment 36927 [details]
PATCH

I'm attaching a first draft of a patch with zaxxer.  I'm not going to commit it and head off on holiday. :)

Longer term, Robert Muir recommended Lucene's SparseBitSet, and it looks good. I've opened https://issues.apache.org/jira/browse/COLLECTIONS-743 and will submit a patch there.  If we can get it into collections, we can drop the zaxxer dependency.
Comment 19 Tim Allison 2020-01-07 15:46:34 UTC
r1872445

added zaxxer for now.

We can open a new ticket for other options if needed.

Please re-open if there are any surprises or areas for improvements.
Comment 20 Saurabh 2021-03-03 10:56:51 UTC
Hello Team,

We are trying to upgrade Apache POI 4.1.12.

As per https://github.com/brettwooldridge/SparseBitSet/blob/master/LICENSE, We can see SparseBitSet 1.2 is license under Apache License 2.0.

However when we see src/main/java/com/zaxxer/sparsebits/SparseBitSet.java, it has following comment -

/*- This software is the work of Paladin Software International, Incorporated,
 *  based upon previous work done for and by Sun Microsystems, Inc. */

Our legal is asking what this comments means and asking for a clarification if it is indeed licensed under Apache License 2.0.

We are trying to reach out to owner Mr. Brett Wooldridge but no success yet.

In case do you know about the licensing impact, please guide us.

Thanks
Saurabh K.
Comment 21 PJ Fanning 2021-03-03 11:07:37 UTC
Hi Saurabh,
That is not a question for Apache POI team - that is not our code. It is a question for https://github.com/brettwooldridge/SparseBitSet
Comment 22 Saurabh 2021-03-03 11:42:15 UTC
Yes, thanks.

Can we configure/customize to use our own implementation and not SparseBitSet?
Comment 23 PJ Fanning 2021-03-03 11:46:09 UTC
Not everything in POI uses this code - so you could try excluding this lib and see if your code still works.

You could also stick to an older version of POI - one released before we introduced this dependency.

Another option would be for your developers to fork POI and use a different BitSet implementation.
Comment 24 Saurabh 2021-03-05 05:39:09 UTC
Thanks everyone. We get confirmation from Mr. Brett Wooldridge that 
SparseBitSet is fully under the Apache Software License V2 and free from any encumbrances.


From: Brett Wooldridge <brett.wooldridge@gmail.com> 
Sent: Thursday, March 4, 2021 1:18 PM
Subject: Re: Clarification Needed for SparseBitSet, a dependent component for using Apache POI

XXXX,

SparseBitSet is fully under the Apache Software License V2, with no other copyright, patent, or distribution encumbrances. The Apache Foundation itself would have never created such a dependency on a non-ASLv2 library.

Best regards,
Brett Wooldridge