-
Notifications
You must be signed in to change notification settings - Fork 3k
Down the Rabbit Hole
This is where we give away the recipe to the secret sauce. When you come in with benchmarks like ours there is a certain amount of skepticism that must be addressed.
In order to make HikariCP as fast as it is, we went down to bytecode-level engineering, and beyond. We pulled out every trick we know to help the JIT help you. We studied the bytecode output of the compiler, and even the assembly output of the JIT to limit key routines to less then the JIT inline-threshold. We flattened inheritance hierarchies, shadowed member variables, eliminated casts.
Sometimes seeing that a routine was surprisingly over the inline-threshold, we would figure out how to squeeze a few extra bytecodes out. Take this simple example:
public SQLException checkException(SQLException sqle) {
String sqlState = sqle.getSQLState();
if (sqlState == null)
return sqle;
if (sqlState.startsWith("08"))
_forceClose = true;
else if (SQL_ERRORS.contains(sqlState))
_forceClose = true;
return sqle;
}
Simple enough method, checking if the SQLSTATE of an exception indicates a disconnection error. Here is the bytecode:
0: aload_1
1: invokevirtual #148 // Method java/sql/SQLException.getSQLState:()Ljava/lang/String;
4: astore_2
5: aload_2
6: ifnonnull 11
9: aload_1
10: areturn
11: aload_2
12: ldc #154 // String 08
14: invokevirtual #156 // Method java/lang/String.startsWith:(Ljava/lang/String;)Z
17: ifeq 28
20: aload_0
21: iconst_1
22: putfield #144 // Field _forceClose:Z
25: goto 45
28: getstatic #41 // Field SQL_ERRORS:Ljava/util/Set;
31: aload_2
32: invokeinterface #162, 2 // InterfaceMethod java/util/Set.contains:(Ljava/lang/Object;)Z
37: ifeq 45
40: aload_0
41: iconst_1
42: putfield #144 // Field _forceClose:Z
45: aload_1
46: areturn
Smart rabbits know that the default inline threshold for a JVM running the server Hotspot compiler is 35 bytecodes. So we gave this routine some love. That early return is costing us, and maybe those conditionals could be combined. Second attempt was this:
String sqlState = sqle.getSQLState();
if (sqlState != null && (sqlState.startsWith("08") || SQL_ERRORS.contains(sqlState)))
_forceClose = true;
return sqle;
Close but no cigar, one bytecode over the threshold at 36 bytecodes. How about:
String sqlState = sqle.getSQLState();
_forceClose |= (sqlState != null && (sqlState.startsWith("08") || SQL_ERRORS.contains(sqlState)));
return sale;
Looks simpler right? It's actually worse, 45 bytecodes. Final solution:
String sqlState = sqle.getSQLState();
if (sqlState != null)
_forceClose |= sqlState.startsWith("08") | SQL_ERRORS.contains(sqlState);
return sqle;
Note the binary OR (|) operator usage. A nice hack sacrificing theoretical performance (binary OR is slower in theory) for concrete performance (the code is inlined, making up for it). And the resulting bytecode:
0: aload_1
1: invokevirtual #146 // Method java/sql/SQLException.getSQLState:()Ljava/lang/String;
4: astore_2
5: aload_2
6: ifnull 34
9: aload_0
10: dup
11: getfield #142 // Field _forceClose:Z
14: aload_2
15: ldc #152 // String 08
17: invokevirtual #154 // Method java/lang/String.startsWith:(Ljava/lang/String;)Z
20: getstatic #40 // Field SQL_ERRORS:Ljava/util/Set;
23: aload_2
24: invokeinterface #160, 2 // InterfaceMethod java/util/Set.contains:(Ljava/lang/Object;)Z
29: ior
30: ior
31: putfield #142 // Field _forceClose:Z
34: aload_1
35: areturn
Right under the wire at 35 bytecodes. Small routine, and actually not particularly high-traffic, but you get the idea. Multiply that level of effort across the HikariCP library and you start to get an inkling of why it is fast.
Pretty much every connection pool, dare we say every pool available, has to "wrap" your real Connection
,
Statement
, PreparedStatement
, etc. instances and intercept methods like close()
so that the
Connection
isn't actually closed but instead is returned to the pool. Statement
and it's subclasses
must be wrapped, and SQLException
caught and inspected to see if the exception reflects a disconnection
that warrants ejecting the Connection
from the pool.
What this means is "delegation". The Connection
wrapper cares about intercepting close()
or
execute(sql)
for example, but for almost all of the other methods of Connection
it simply delegates (exception handling omitted). Something like:
public Clob createClob() {
return delegate.createClob();
}
The first iteration of HikariCP also did this, and it still provides a "fallback" mode. An interface like
PreparedStatement
contains some 50+ methods, only 4 of which we are interested in intercepting. Rather
than creating a wrapper class that has 50+ "delegate" methods like the above, we use Javassist to generate
all of the delegate methods. While this provides no inherent performance increase, it means that our
"proxy" (wrapper) class only need contain the overridden methods. The Statement
proxy class in HikariCP
is only ~160 lines of code including comments, compared to 1100+ lines of code in other pools.
This approach is in keeping with our minimalist ethos.
Our delegates perform quite admirably:
Pool | Med (ms) | Avg (ms) | Max (ms) |
---|---|---|---|
BoneCP | 5049 | 3249 | 6929 |
HikariCP | 13 | 11 | 58 |
The fact that even using delegates like everyone else HikariCP achieves 13ms times on this benchmark is more attributable to Hikari's efficient core than anything else.
And yet, looking at the bytecode for all of the delegate methods, with their getfield
, checkcast
, and
invokeinterface
op codes, it really touched our nerve. Is it possible to go faster?
Can we actually eliminate delegation itself?
"I've always been mad, I know I've been mad,
like the most of us,
very hard to explain why you're mad,
even if you're not mad..."
- Pink Floyd
But how? How could we eliminate delegation and still intercept the methods we need? Even more, we need to
wrap every "delegate" method with a try..catch
to interrogate SQLExceptions, which is actually interception
now isn't it?
In order to eliminate delegation the user needs to run against the "bare metal" of their driver classes, yet we still need to intercept methods and wrap them with exception handlers. We were already using Javassist to generate our classes for delegation. Why not use Javassist to inject our code directly into the driver's classes?
However, the classes must be altered before they are loaded ... because convincing the JVM to reload classes is
no trivial task, particularly when you don't own the ClassLoader
in question. The answer lay in
java.lang.instrument
. We built an instrumentation "agent" that "instruments" the driver classes on the fly
as they are loaded, injecting our code into them, including try..catch
blocks where necessary. The
instrumentation agent is dynamically loaded and unloaded so that it doesn't spend time inspecting classes that
have nothing to do with JDBC and no need for instrumentation.
As slim as our "delegate" proxies are, there is still a fair amount of code, especially in the
ConnectionProxy
class. The prospect of "inlining" the bytecode, or worse source code, into the
instrumentation code had a bad smell about it. We've already written the intercept code once in our proxies,
can't we just use that somehow? But the code is in our classes, not in the target driver's classes.
This is where we think code can sometimes become art. We created an annotation @HikariInject
, and with it we
annotate all of the fields and methods that we want injected from the existing proxy classes. The
instrumentation agent inspects our proxy classes, and injects fields or methods tagged with @HikariInject
into the target driver class -- with some special logic for handling collisions. The pure gold is, the exact
same class code that is used in "delegation" mode is the same exact class code that is injected in
"instrumentation" mode. There is only one canonical source for both.
The instrumenter is extremely robust, but if there is any kind of failure injecting the code, HikariCP drops back to delegation mode (and logs a message to that effect). The JVM is smart enough to know that if an instrumentation agent throws an exception, the class is loaded cleanly without it -- nothing can be corrupted. Injection takes place at pool startup time, and typically takes only about 200ms.
The result of this is:
Pool | Med (ms) | Avg (ms) | Max (ms) |
---|---|---|---|
BoneCP | 5049 | 3249 | 6929 |
HikariCP | 8 | 7 | 13 |
While going from 13ms (delegates) to 8ms (instrumentation) may not seem like much, it represents a 40% improvement.
Still, even without instrumentation, how do we get anywhere near 13ms for 60+ million JDBC API invocations? Well, we're obviously running against a stub implementation, so the JIT is doing a lot of inlining. However, the same inlining at the stub-level is occurring for BoneCP in the benchmark. So, no inherent advantage to us.
But inlining is part of the equation, and I will say that BoneCP has at least 10 methods that are flagged as "hot" by the JVM that the JIT considers too large to inline. And at least two of these are critical path. HikariCP has none. Additionally, some of the features in BoneCP require it to do much more work (I thought "bone" stood for bare-bones, maybe I'm mistaken). Which brings us to another topic...
Some light reading. TL;DR Obviously, when you're running 400 threads "at once", you aren't really running them "at once" unless you have 400 cores. The operating system, using N cores, switches between your threads giving each a small "slice" of time to run called a quanta or quantum.
But with 400 threads, when your time runs out (as a thread) it may be a "long time" before the scheduler gives you a chance to run again. With this many threads, if a thread cannot complete what it needs to get done during its time-slice, well, there is a performance penalty to be paid. And not a small one.
We have combed through HikariCP, crushing and optimizing the critical code paths to ensure they can fully execute 60+ million JDBC invocations within a single "quanta". With of course the exception of a truly blocked condition, such as no available connections.
Which brings us to...
Another big hit incurred when you can't get your work done in a quanta is CPU cache-line invalidation. If your thread is preempted by the scheduler, when it does get a chance to run again all of the data it was frequently accessing is likely no longer in the core's L1 or core-pair L2 cache. Even more likely because you have no control over which core you will be scheduled on next.
Almost certainly. Our original goal when moving from delegates to instrumentation was to reach sub-millisecond times for 60+ million JDBC API invocations, and that goal still remains. We have some ideas that really get into the esoterica of modern CPU architectures, such as "false sharing".