Bug 67080 - ImplicitObjectELResolverImpl.getValue() is slow
Summary: ImplicitObjectELResolverImpl.getValue() is slow
Status: VERIFIED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.x
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-29 17:48 UTC by John Engebretson
Modified: 2023-11-03 16:51 UTC (History)
2 users (show)



Attachments
Reproducer & speed test (3.17 KB, text/plain)
2023-08-29 17:49 UTC, John Engebretson
Details
decompiled reproducer (28.06 KB, text/plain)
2023-08-30 20:19 UTC, John Engebretson
Details
JMH test to compare binarySearch to switch(scope) (2.98 KB, text/plain)
2023-09-20 15:59 UTC, Christopher Schultz
Details
Alternative enum lookup approach JMH (3.82 KB, text/plain)
2023-09-20 17:21 UTC, Anurag Dubey
Details

Note You need to log in before you can comment on or make changes to this bug.
Description John Engebretson 2023-08-29 17:48:39 UTC
A high-volume, latency sensitive application reports that `ImplicitObjectELResolverImpl.getValue()` uses approximately 0.2% of cpu, of which most (0.18%) comes from the call to `Arrays.binarySearch` for an array of strings.  This problem was previously noted on https://bz.apache.org/bugzilla/show_bug.cgi?id=53895 but the solution was rejected for various reasons.

I instead propose replacing the binary search with a switch statement on the string values.  This requires Java 7 and was not an option when the previous issue was closed.

The attached speed test shows that the `switch` statement is approximately 66% faster.  I don't think it matters but I ordered the case statements in a rough approximation of their frequency in our application.


The attached speed test returns results such as the following:

Done with old in 1522
Done with old in 1636
Done with old in 1606
Done with old in 1641
Done with old in 1583
Done with old in 1587
Done with old in 1629
Done with old in 1627
Done with old in 1607
Done with old in 1569
Done with new in 474
Done with new in 453
Done with new in 450
Done with new in 452
Done with new in 449
Done with new in 450
Done with new in 454
Done with new in 452
Done with new in 459
Done with new in 447
Comment 1 John Engebretson 2023-08-29 17:49:27 UTC
Created attachment 38935 [details]
Reproducer & speed test
Comment 2 Christopher Schultz 2023-08-30 19:55:04 UTC
I haven't studied what the Java compiler does with switch(String) statements, but I believe that numeric-based switch statements have two flavors: lookup vs table. One of them is faster but is only possible under certain circumstances. Introducing Strings likely complicates things further.

If we just need to convert a String value into a number, there's a data structure pretty much purpose-built for that: a hash table. What if you implement the lookup as HashMap<String,Integer>.get()?
Comment 3 John Engebretson 2023-08-30 20:18:50 UTC
I used javap to examine the bytecode of the reproducer - the full output is attached but the critical portion is below.  In short, it hashes each constant and switches based on that hash.

This is superior to binary search because it's constant-time and superior to HashMap because there is no object instantiation/traversal.


 protected void runNewTest();
    descriptor: ()V
    flags: (0x0004) ACC_PROTECTED
    Code:
      stack=5, locals=5, args_size=1
         0: iconst_m1
         1: istore_1
         2: iconst_0
         3: istore_2
         4: goto          363
         7: getstatic     #40                 // Field TEST_SCOPES:[Ljava/lang/String;
        10: iload_2
        11: aaload
        12: astore_3
        13: aload_3
        14: dup
        15: astore        4
        17: invokevirtual #62                 // Method java/lang/String.hashCode:()I
        20: lookupswitch  { // 11
             -1354757532: 120
             -1221270899: 133
              -697829026: 146
              -272077475: 159
              -214877152: 172
               -10806556: 185
               106436749: 198
               872295685: 211
               889144335: 224
              1291689221: 237
              2073857551: 250
                 default: 324
            }
       120: aload         4
       122: ldc           #18                 // String cookie
       124: invokevirtual #66                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
       127: ifne          294
       130: goto          324
       133: aload         4
       135: ldc           #20                 // String header
       137: invokevirtual #66                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
       140: ifne          283
       143: goto          324
       146: aload         4
Comment 4 John Engebretson 2023-08-30 20:19:25 UTC
Created attachment 38938 [details]
decompiled reproducer
Comment 5 Anurag Dubey 2023-09-07 15:41:35 UTC
Create Pull Request for the proposed change, have also shared the performance comparison for Old vs Switch implementation vs Hashmap implementation -
https://github.com/apache/tomcat/pull/666
Comment 6 John Engebretson 2023-09-07 15:45:06 UTC
Anurag's solution is clearly faster than mine, I like it.
Comment 7 Anurag Dubey 2023-09-20 09:24:18 UTC
Hey Christopher Schultz, can you help out with the Pull Request i raised.
Comment 8 Mark Thomas 2023-09-20 15:43:19 UTC
The alternative approach is not faster. The test that showed it was was not a fair test. I will be applying a variation of the original patch.
Comment 9 Christopher Schultz 2023-09-20 15:59:05 UTC
Created attachment 39050 [details]
JMH test to compare binarySearch to switch(scope)

# JMH version: 1.37
# VM version: JDK 21, OpenJDK 64-Bit Server VM, 21+35-2513
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 5 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.ScopeLookup.binarySearch

# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 32873899.937 ops/s
# Warmup Iteration   2: 32037578.845 ops/s
# Warmup Iteration   3: 32101040.458 ops/s
# Warmup Iteration   4: 32619813.449 ops/s
# Warmup Iteration   5: 32459923.794 ops/s
Iteration   1: 32878096.534 ops/s
Iteration   2: 32782993.884 ops/s
Iteration   3: 32802020.501 ops/s
Iteration   4: 32357564.918 ops/s
Iteration   5: 34414429.950 ops/s


Result "org.sample.ScopeLookup.binarySearch":
  33047021.158 ±(99.9%) 3046287.685 ops/s [Average]
  (min, avg, max) = (32357564.918, 33047021.158, 34414429.950), stdev = 791111.238
  CI (99.9%): [30000733.472, 36093308.843] (assumes normal distribution)


# JMH version: 1.37
# VM version: JDK 21, OpenJDK 64-Bit Server VM, 21+35-2513
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 5 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.ScopeLookup.caseSearch

# Run progress: 50.00% complete, ETA 00:00:50
# Fork: 1 of 1
# Warmup Iteration   1: 45942620.330 ops/s
# Warmup Iteration   2: 43150765.515 ops/s
# Warmup Iteration   3: 44932581.558 ops/s
# Warmup Iteration   4: 45360554.270 ops/s
# Warmup Iteration   5: 43506970.095 ops/s
Iteration   1: 44393187.971 ops/s
Iteration   2: 46287069.970 ops/s
Iteration   3: 46461809.778 ops/s
Iteration   4: 45619371.053 ops/s
Iteration   5: 46167433.097 ops/s


Result "org.sample.ScopeLookup.caseSearch":
  45785774.374 ±(99.9%) 3233722.237 ops/s [Average]
  (min, avg, max) = (44393187.971, 45785774.374, 46461809.778), stdev = 839787.396
  CI (99.9%): [42552052.137, 49019496.610] (assumes normal distribution)


# Run complete. Total time: 00:01:41

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

NOTE: Current JVM experimentally supports Compiler Blackholes, and they are in use. Please exercise
extra caution when trusting the results, look into the generated code to check the benchmark still
works, and factor in a small probability of new VM bugs. Additionally, while comparisons between
different JVMs are already problematic, the performance difference caused by different Blackhole
modes can be very significant. Please make sure you use the consistent Blackhole mode for comparisons.

Benchmark                  Mode  Cnt         Score         Error  Units
ScopeLookup.binarySearch  thrpt    5  33047021.158 ± 3046287.685  ops/s
ScopeLookup.caseSearch    thrpt    5  45785774.374 ± 3233722.237  ops/s
Comment 10 Christopher Schultz 2023-09-20 16:12:08 UTC
This test seems to show that there is indeed a speed improvement by using a switch statement instead of a binary search, somewhere in the 30% range, and it seems consistent. I didn't get the kind of speedup that John sees, but that may be due to environmental differences.

Scope lookups probably occur many times in a given JSP, so this can probably add-up, especially when seeing a report that this is a hot-spot.

In order to get a good randomized test where the JIT and/or CPU were not cheating by discovering that we were e.g. always searching for "applicationScope", I used java.util.Random.nextInt to randomly-choose one of the scopes for each test. I did not check to see how much of the testing time was taken up by the Random.nextInt invocation, but both compared methods are using the same random technique.

I'm happy to receive a critique of my JMH test if it does not show what I think it shows.
Comment 11 Mark Thomas 2023-09-20 16:30:22 UTC
(In reply to Mark Thomas from comment #8)
> The alternative approach is not faster. The test that showed it was was not
> a fair test. I will be applying a variation of the original patch.

To be clear the original patch/approach from John Engebretson is faster than alternative patch/approach proposed by Anurag Dubey.

Both are faster than binary search.
Comment 12 Anurag Dubey 2023-09-20 17:21:52 UTC
Created attachment 39051 [details]
Alternative enum lookup approach JMH
Comment 13 Anurag Dubey 2023-09-20 17:23:20 UTC
(In reply to Mark Thomas from comment #11)
> (In reply to Mark Thomas from comment #8)
> > The alternative approach is not faster. The test that showed it was was not
> > a fair test. I will be applying a variation of the original patch.
> 
> To be clear the original patch/approach from John Engebretson is faster than
> alternative patch/approach proposed by Anurag Dubey.
> 
> Both are faster than binary search.

I agree that the earlier test shared for Alternative enum lookup approach didn't involve a switch lookup. I have created another test involving switch after the enum lookup, which looks faster than other 2, thoughts on this ?


Benchmark                            Mode  Cnt   Score   Error   Units
LookupVsSwitch.binarySearch         thrpt   15  12.175 ? 0.472  ops/us
LookupVsSwitch.fromLookupAndSwitch  thrpt   15  19.759 ? 0.878  ops/us
LookupVsSwitch.fromSwitch           thrpt   15  15.106 ? 0.520  ops/us
Comment 14 Christopher Schultz 2023-09-20 18:01:21 UTC
The fromLookupAndSwitch technique seems to be implemented in an awkward way. Why bother converting from String -> Enum and then using a switch on the enum, instead of just (a) implementing a getScopeNumber() method in each enum instance or (b) just using convenient ordinal values which match the value you actually want?

I would think this would be the best implementation if you wanted to use Enums:

enum Scope {
    applicationScope(0), cookie(1), header(2), ...

  public static int fromLookupAndSwitch(String scope) {
    Scope scope = lookupMap.get(scope);
    if(null == scope) {
    } else {
      return scope.ordinal();
    }
  }
}

I understand that this test is no longer "from lookup and switch" but I decided to keep the method name the same.

I guess the point is if you use a hash, you don't have to use a switch at all.
Comment 15 Christopher Schultz 2023-09-20 18:05:46 UTC
... also what about Scope.valueOf(scopeName) and then you don't have to implement your own map? It's possible that a manually-constructed Map would be faster than the default mapping provided by java.lang.Enum, but there is less code required.
Comment 16 Anurag Dubey 2023-09-20 18:09:25 UTC
(In reply to Christopher Schultz from comment #14)
> The fromLookupAndSwitch technique seems to be implemented in an awkward way.
> Why bother converting from String -> Enum and then using a switch on the
> enum, instead of just (a) implementing a getScopeNumber() method in each
> enum instance or (b) just using convenient ordinal values which match the
> value you actually want?
> 
> I would think this would be the best implementation if you wanted to use
> Enums:
> 
> enum Scope {
>     applicationScope(0), cookie(1), header(2), ...
> 
>   public static int fromLookupAndSwitch(String scope) {
>     Scope scope = lookupMap.get(scope);
>     if(null == scope) {
>     } else {
>       return scope.ordinal();
>     }
>   }
> }
> 
> I understand that this test is no longer "from lookup and switch" but I
> decided to keep the method name the same.
> 
> I guess the point is if you use a hash, you don't have to use a switch at
> all.

That is to mimic the outer switch which anyways applies in the getValue method. ref- https://github.com/anuragdy/tomcat/blob/4257836d0930c853f862dfe5f0aad4c24cefec31/java/javax/servlet/jsp/el/ImplicitObjectELResolver.java#L90
Comment 17 Christopher Schultz 2023-09-20 18:39:09 UTC
Fair enough, but if we are going to try to squeeze as much performance out of this, then the correct process is with delegated dispatch:

enum Scope {
  applicationScope(0) {
    public Object getScopedValue(ELContext ctx, Object base, Object property) {
      PageContext page =  (PageContext) context
           .getContext(JspContext.class);
      context.setPropertyResolved(base, property);

      return ScopeManager.get(page).getApplicationScope();
    }
  },

  cookie(1) {
    public Object getScopedValue(ELContext ctx, Object base, Object property) {
      PageContext page =  (PageContext) context
           .getContext(JspContext.class);
      context.setPropertyResolved(base, property);

      return ScopeManager.get(page).getCookie();
    }
  },

  ...
}

Then:

Scope scope = Scope.valueOf(property);
if(null != scope) {
  return scope.getScopedValue(ctx, base, property);
} else {
  return null;
}

If you are going to perform a lookup, make sure you get the best bang for your buck.
Comment 18 Mark Thomas 2023-09-21 11:48:31 UTC
(In reply to Christopher Schultz from comment #17)

> If you are going to perform a lookup, make sure you get the best bang for
> your buck.

+1

I'll put together a patch along those lines.
Comment 19 Mark Thomas 2023-09-21 14:23:56 UTC
Fixed in:
- 11.0.x for 11.0.0-M12 onwards
- 10.1.x for 10.1.14 onwards
-  9.0.x for  9.0.81 onwards
-  8.5.x for  8.5.94 onwards
Comment 20 John Engebretson 2023-09-21 15:49:33 UTC
Thank you to everyone involved!
Comment 21 John Engebretson 2023-11-03 16:51:41 UTC
This change reached production and decreased the cost of javax.servlet.jsp.el.ImplicitObjectELResolver.getValue() by approximately 80%.