INTRO
Scenario: A high-throughput Java microservice performs flawlessly in staging but shows unpredictable latency spikes in production under real traffic. CPU usage fluctuates, GC pauses are inconsistent, and profiling reveals hotspots that “disappear” after a few minutes of uptime.
Problem Insight: What you’re observing isn’t just GC or thread contention—it’s the JVM itself dynamically optimizing your code through JIT compilation. The transition from interpreted bytecode to optimized machine code introduces warm-up phases, compilation thresholds, and speculative optimizations that can significantly impact latency and throughput.
Why It Matters at Scale:
At scale, JIT behavior directly affects:
- Tail latency (p99/p999)
- CPU utilization and cloud cost
- Startup performance vs steady-state throughput
- Predictability of performance in containerized environments
Outcome:
By the end of this tutorial, you will:
- Understand how HotSpot’s JIT (C1, C2, Tiered Compilation) works internally
- Identify and control JIT behavior in production systems
- Apply advanced JVM tuning strategies to reduce latency and improve throughput
- Analyze JIT decisions using real profiling and diagnostic tools
Problem Definition: JIT-Induced Performance Variability
A familiar production story goes like this: a Java service looks healthy under sustained load, yet the first few thousand requests after deployment are slower, noisier, and far less predictable than the steady-state profile captured in a benchmark report. The root cause is often not application logic, network jitter, or even garbage collection in isolation. It is the JVM changing the execution strategy of your code while the system is live. HotSpot does not simply “run Java”; it observes method invocation frequency, loop back-edges, branch profiles, and type distributions, then promotes hot code from interpretation to increasingly optimized machine code. That adaptive behavior is one of the JVM’s greatest strengths, but it also means performance is time-dependent.
Consider a simplified request path in a service that is usually deployed behind autoscaling:
public final class PriceService {
public long calculate(long notional, long rateBps, int days) {
long interest = (notional * rateBps * days) / 36500;
return interest > 0 ? interest : 0;
}
public long handle(long n, long r, int d) {
return calculate(n, r, d);
}
}
Code language: Java (java)On the first few calls, handle() and calculate() typically execute in the interpreter. That keeps startup responsive, but interpreted dispatch, profiling hooks, and unresolved runtime assumptions make per-request latency materially higher than it will be later. Once HotSpot decides these methods are hot enough, it compiles them, first with a cheaper compiler tier and later with more aggressive optimization. The important detail is that this transition happens during live traffic, so the latency curve is not flat; it evolves. In low-latency APIs or cold-start-heavy microservices, that warm-up tax can become visible in p95 and p99 distributions even when average throughput looks acceptable.
You can usually see this transition directly:
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation PriceServiceApp
Code language: Bash (bash)A typical compilation log will show methods first interpreted, then compiled at a lower tier, and eventually recompiled at a higher tier as profiling matures. Internally, HotSpot is not merely translating bytecode to native code; it is using runtime evidence to decide whether inlining, constant folding, and other optimizations are now safe and profitable. That matters because the CPU is experiencing two very different execution modes: one dominated by interpreter overhead and indirect dispatch, and another dominated by optimized machine code with better locality and fewer dynamic checks. The trade-off is obvious in production: the JVM earns long-term throughput by spending CPU on observation and compilation up front.
A small benchmark makes the problem harder to ignore:
@Benchmark
public long coldPath() {
return service.handle(1_000_000L, 425L, 30);
}
Code language: AVR Assembler (avrasm)In a JMH-style harness, the same method can show dramatically different numbers depending on whether warm-up has completed. Without sufficient warm-up, you are benchmarking tier transitions as much as business logic. In real systems, that translates into uneven autoscaling behavior: new pods consume extra CPU compiling hot paths just when traffic is rising, and a deoptimization triggered by a broken type assumption can briefly push latency upward again. For trading engines, payment gateways, and request-driven microservices, JIT-induced variability is therefore not a theoretical curiosity. It is a runtime characteristic that directly shapes SLA compliance, capacity planning, and the true cost of “fast enough” Java.
Naive Approach: Interpreted Execution and Static Compilation
When teams first encounter JVM performance variability, the instinct is often to wish the runtime would just pick a single execution model and stay there. If interpreted execution causes warm-up noise, why not run everything interpreted and keep behavior simple? If JIT introduces compilation overhead and latency spikes, why not compile ahead of time and avoid runtime adaptation altogether? Both ideas sound attractive when viewed through the narrow lens of startup consistency, but both break down quickly in production because they optimize for one phase of execution while ignoring how Java applications actually behave under sustained, changing workloads.
Start with the interpreter. In the earliest moments of a JVM process, bytecode is not yet optimized machine code; it is executed instruction by instruction through HotSpot’s interpreter. That design is not an accident. Interpretation gives the JVM immediate executability, which is why a service can begin handling work before any serious compilation has happened. For short-lived processes and cold-start-sensitive functions, that matters. But the cost is that every method invocation, branch, and loop carries the overhead of dispatch machinery that optimized native code would largely remove.
public final class CurrencyNormalizer {
public long normalize(long amount, int scale) {
long factor = 1;
for (int i = 0; i < scale; i++) {
factor *= 10;
}
return amount / factor;
}
}
Code language: Java (java)When this method first runs, the interpreter walks its bytecode rather than executing a tightly optimized sequence of native instructions. Internally, that means repeated bytecode dispatch, runtime checks, and limited use of the processor’s full optimization potential. On a cold path, this is acceptable because the JVM prioritizes responsiveness over peak throughput. Under load, however, the loop and arithmetic begin to matter. A method like this may look trivial in source form, yet interpreted execution amplifies its cost across millions of requests. The trade-off is not theoretical: a low-latency service that depends on many such small methods will often spend its earliest traffic window paying for indirection rather than useful work.
You can observe that “run now, optimize later” posture directly with bytecode inspection:
javap -c CurrencyNormalizer.class
Code language: Bash (bash)The output shows the JVM’s real starting point: stack-based bytecode instructions such as lmul, ldiv, and branch jumps around the loop. What matters here is that the JVM does not begin with machine code specialized for your CPU, branch patterns, or actual runtime types. It begins with portable instructions and only later decides whether the method is hot enough to deserve compilation. That portability is a strength of the platform, but it also explains why pure interpretation does not scale as a general strategy. It gets you to first execution quickly, yet it leaves too much CPU on the table once the code path becomes important.
The obvious counterproposal is static compilation: precompile everything and remove JIT from the equation. In practice, that solves only part of the problem. A statically compiled image can improve startup behavior and reduce warm-up sensitivity, which is why ahead-of-time strategies are attractive for serverless workloads and short-lived containers. But a naive AOT mindset assumes the compiler knows enough at build time to make the same decisions the JIT can make after observing the application under real traffic. In HotSpot, many of the most valuable optimizations are profile-driven. Inlining decisions depend on call-site behavior, speculative optimizations depend on observed class hierarchies, and loop optimizations become more accurate when the runtime has real frequency data rather than guesses.
java -Xint PriceServiceApp
java -Xcomp PriceServiceApp
Code language: Bash (bash)These two diagnostics are useful because they expose the extremes. -Xint forces interpreted execution, which usually produces highly stable but predictably poor throughput and elevated CPU consumption. -Xcomp aggressively compiles methods, effectively pushing the JVM toward eager compilation. Neither mode reflects the best production strategy. The first sacrifices steady-state efficiency for immediate simplicity; the second spends CPU and startup time compiling code before the runtime has gathered enough profile information to make the best optimization choices. In real services, especially autoscaled ones, that can mean shifting cost rather than eliminating it: interpreted mode inflates request latency, while over-eager compilation inflates startup CPU and may still generate code that later proves suboptimal.
That is why neither pure interpretation nor naive static compilation is enough. Interpreted execution is a useful bootstrap mechanism, not a serious end-state for hot paths. Static compilation can reduce cold-start pain, but without runtime feedback it cannot fully exploit the dynamic behavior that makes HotSpot effective under sustained load. The real challenge is not choosing one side of the spectrum; it is understanding why the JVM moves between them, and how that movement shapes both latency and throughput in live systems.
Deep Dive: HotSpot JIT Architecture (C1, C2, Tiered Compilation)
If you have ever watched a Java service “settle down” after deployment, you have already seen HotSpot’s compiler architecture at work. The steady-state profile that engineers usually care about is not the JVM’s starting state. HotSpot begins with loaded classes and bytecode, executes that bytecode in the interpreter, collects runtime profiling data, and only then decides which methods deserve native code generation. That adaptive pipeline exists because most programs spend most of their time in a relatively small fraction of their code, so the JVM can afford to observe broadly and optimize selectively. Oracle’s HotSpot documentation describes this as adaptive optimization: the VM first runs the program, identifies hot spots during execution, and then concentrates compilation effort on the code that matters most. (Oracle)
A small service method is enough to show the mechanics:
public final class OrderPricer {
public long total(long subtotal, long taxBps, long shipping) {
long taxed = subtotal + ((subtotal * taxBps) / 10_000);
return taxed + shipping;
}
public long handle(long subtotal, long taxBps, long shipping) {
return total(subtotal, taxBps, shipping);
}
}
Code language: Java (java)When handle() starts receiving traffic, HotSpot does not immediately emit its best machine code. It first interprets the bytecode while counting method invocations and loop back-edges, building the profiling picture that later compilation decisions rely on. That distinction matters in production because early requests are paying for both business logic and runtime observation. The CPU is not yet executing the fully optimized path; it is executing a more instrumented, more conservative path whose job is partly to learn. In a cold pod behind autoscaling, that means latency is not just a function of your application code. It is also a function of how quickly the JVM gathers enough evidence to promote code through the compilation pipeline.
You can see the promotions directly:
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation OrderPricerApp
Code language: Bash (bash)With tiered compilation enabled, the output will show methods moving through compilation levels rather than leaping from interpreter straight to peak optimization. Oracle’s performance documentation states that tiered compilation is the default mode for the server VM, and that it combines interpreter profiling with client-compiler-generated code that can continue collecting profile information while running substantially faster than the interpreter. The same documentation notes that this can improve startup and even improve peak performance because the faster profiling phase gives the VM more opportunity to learn before handing code to the server compiler. (Oracle Docs)
That is the key to understanding the division of labor between C1 and C2. C1, historically the client compiler, is optimized for speed of compilation rather than maximum aggressiveness of generated code. Its purpose in a tiered system is not merely to “make code faster” in the abstract; it gets hot methods out of the interpreter quickly, reducing early execution cost while continuing to attach profiling instrumentation to compiled code paths. In other words, C1 is an operational compromise: compile quickly, keep the application moving, and keep learning. This is why tiered compilation generally behaves better than an interpreter-only startup model. Oracle’s documentation explicitly says compiled profiling is substantially faster than interpreter profiling, and that advantage is precisely what makes modern JVM warm-up less punishing than older all-or-nothing approaches. (Oracle Docs)
C2, by contrast, is where HotSpot starts acting like a truly aggressive optimizing compiler. Once runtime profiles are considered mature enough, C2 recompiles the hot code with a much stronger optimization budget. This is where inlining becomes transformative rather than incremental. Oracle’s HotSpot architecture paper notes that once the VM has gathered hot-spot information, it performs extensive method inlining, including for virtual invocations that many developers assume must remain expensive. (Oracle) In practice, that means object-oriented call chains that appear deep and fragmented in source form may collapse into a single optimized trace at the machine-code level.
Consider a slightly more realistic example:
interface DiscountPolicy {
long apply(long amount);
}
final class FixedRatePolicy implements DiscountPolicy {
public long apply(long amount) {
return amount - (amount / 20);
}
}
final class CheckoutService {
private final DiscountPolicy policy = new FixedRatePolicy();
public long checkout(long amount) {
return policy.apply(amount) + 499;
}
}
Code language: Java (java)At source level, checkout() performs an interface call and a small arithmetic adjustment. Under C2, if the call site is observed to be effectively monomorphic, HotSpot may inline apply() and then optimize the surrounding arithmetic as one unit. That matters because once a call disappears, the optimizer can expose further opportunities: constant propagation, branch simplification, dead code elimination, and better register allocation all become easier. The performance trade-off is that such optimizations are speculative. They depend on the runtime profile staying valid. If a previously monomorphic call site becomes polymorphic, HotSpot may deoptimize and fall back to a less optimized state. That is not a flaw in the design; it is the cost of using runtime evidence aggressively.
The same pattern appears in escape analysis and loop optimization. C2 can eliminate or scalar-replace allocations whose objects do not escape a compilation unit, and it can transform loops more aggressively once it has enough confidence in branch and trip-count behavior. These optimizations are not merely compiler tricks for benchmark slides. They directly affect CPU efficiency, memory traffic, and instruction-cache pressure in long-running services. The HotSpot optimization documentation and associated OpenJDK material tie C2 compilation to capabilities such as inlining and escape-analysis-driven transformations, which is why steady-state Java performance often looks radically different from cold-start performance even without any code changes. (Oracle)
The tier numbers make this progression more concrete. OpenJDK’s tiered-compilation material documents the commonly used levels as an execution ladder: level 0 is the interpreter; levels 1 through 3 are C1-generated code with different profiling states; and level 4 is fully optimized C2 code. The common transition pattern is 0 -> 3 -> 4, though other paths occur depending on queue pressure, method shape, and profile maturity. OpenJDK’s tiered-compilation slides also show that deoptimization can send compiled code back to level 0, and that alternate paths such as 0 -> 2 -> 3 -> 4 or direct 0 -> 4 promotion can appear in special cases. (OpenJDK CR)
A representative log often looks like this:
92 3 com.example.OrderPricer::total (18 bytes)
114 4 com.example.OrderPricer::total (18 bytes)
121 3 com.example.OrderPricer::handle (10 bytes)
138 4 com.example.OrderPricer::handle (10 bytes)
Code language: plaintext (plaintext)The important point is not memorizing the exact formatting of PrintCompilation, but recognizing what the JVM is telling you. A level-3 entry usually means the method has left the interpreter and is running as C1-compiled code with profiling still active. A later level-4 entry means the method has been promoted again, now compiled by C2 with a more aggressive optimization strategy. Internally, this is HotSpot reconciling two competing production goals: get code out of the interpreter early enough to protect startup and responsiveness, but delay peak optimization until the runtime has enough evidence to make that optimization worthwhile. Oracle and OpenJDK both emphasize this balance as the core value of tiered compilation: faster early execution, richer profiling, and better eventual peak performance. (Oracle Docs)
This balance is why tiered compilation remains the default in modern HotSpot. In a real service, compilation thresholds, queue lengths, and profile maturity are not abstract VM trivia; they shape how quickly a pod becomes efficient, how much CPU is spent compiling under load, and how much latency turbulence appears before the system reaches steady state. C1 buys time and responsiveness. C2 buys throughput and deeper optimization. Tiered compilation is the mechanism that lets HotSpot have both, at the cost of accepting that performance is a moving target during the lifetime of the process.
Implementation: Observing and Controlling JIT in Practice
The practical challenge with JIT tuning is that the JVM rarely fails loudly. A service does not emit an exception because a hot method was compiled too late, because C2 compilation queued behind other work, or because a speculative optimization was invalidated during peak traffic. Instead, the symptoms appear indirectly: p99 latency rises after a deployment, new pods take longer to become efficient, CPU burns unusually high during warm-up, or a benchmark result changes after a minor refactor that should not have affected the algorithm. Observing the JIT is therefore less about “turning on a debug mode” and more about correlating compiler activity with the shape of real workload execution.
A useful first step is to run the service with compilation diagnostics in a controlled environment, ideally against production-like traffic rather than synthetic happy-path requests:
java \
-XX:+UnlockDiagnosticVMOptions \
-XX:+PrintCompilation \
-XX:+PrintInlining \
-jar pricing-service.jar
Code language: Bash (bash)PrintCompilation exposes when methods are compiled, at which tier, and whether previous compiled versions are replaced or invalidated. PrintInlining goes deeper by showing whether call sites were inlined or rejected, often with reasons such as method size, insufficient profile data, or polymorphic dispatch. Internally, this is HotSpot making cost-model decisions based on invocation counters, branch profiles, type feedback, and compiler thresholds. For performance work, this matters because a method that appears trivial in source code may remain expensive if it sits behind an uninlined virtual call or reaches C2 only after the service has already absorbed real traffic. The trade-off is that these flags are noisy and can add overhead, so they belong in reproduction environments, load tests, or short diagnostic windows rather than as a permanent production default.
For latency-sensitive services, TieredStopAtLevel is useful because it allows you to intentionally constrain the compiler pipeline and compare behavior across execution modes:
java -XX:TieredStopAtLevel=1 -jar pricing-service.jar
java -XX:TieredStopAtLevel=4 -jar pricing-service.jar
Code language: Bash (bash)Stopping at level 1 keeps execution in a lightweight C1-compiled mode, which can reduce the cost of deeper optimization during startup but sacrifices the aggressive C2 transformations that usually produce the best steady-state throughput. Allowing level 4 gives HotSpot the full tiered path, where methods can progress toward optimized C2 code once profiling matures. This comparison is valuable in autoscaled environments because it separates two different problems that often get confused: poor cold-start behavior and poor steady-state performance. If level 1 improves early latency but loses throughput later, the service may need pre-warming, readiness gating, or more realistic capacity buffers rather than a blanket compiler restriction.
Compilation logs become much more useful when treated as data. Even a simple parser can reveal which methods are repeatedly compiled, promoted, or made non-entrant:
java -XX:+PrintCompilation -jar pricing-service.jar 2>&1 \
| awk '/com.example/ { print $0 }' \
| tee jit-compilation.log
Code language: Bash (bash)This kind of filtering helps isolate application methods from framework noise, which is essential in Spring, Netty, Kafka, or gRPC services where startup paths can dominate the log. A method repeatedly appearing as compiled and then invalidated is often a sign of unstable profiling assumptions, class loading after warm-up, or polymorphism that the optimizer initially underestimated. Internally, HotSpot may be discarding previously optimized machine code because the assumptions attached to that code are no longer valid. That protects correctness, but it can also produce short bursts of latency and CPU activity as execution falls back and recompilation occurs.
For deeper analysis, JFR provides a lower-friction production diagnostic path than raw compiler logs:
java \
-XX:StartFlightRecording=filename=jit-profile.jfr,settings=profile,dumponexit=true \
-jar pricing-service.jar
Code language: Bash (bash)A JFR recording lets you connect compilation activity with allocation pressure, CPU samples, lock contention, and request timing, which is closer to how performance problems appear in real systems. JITWatch then complements this by visualizing compilation logs, bytecode, inlining decisions, and generated assembly where available. At the system boundary, Linux perf or async-profiler can show whether CPU time is being spent in application code, JVM runtime code, compiler threads, or generated machine code. The trade-off is observability overhead and operational complexity: the more deeply you instrument the JVM, the more careful you must be about measurement distortion. In production-grade tuning, the goal is not to force the JIT into a preconceived shape, but to gather enough evidence to understand whether the compiler is helping, arriving too late, or repeatedly changing its mind under real workload conditions.
Advanced Optimization Techniques
Once a Java service reaches its hot path, the most important JVM optimizations are no longer obvious from source code. The methods that dominate CPU time in systems like Netty event loops, Kafka request handling, matching engines, or pricing services are often small, allocation-heavy, and deeply layered behind abstractions. HotSpot’s C2 compiler is effective precisely because it does not optimize the code as written; it optimizes the code as observed. That distinction is central to understanding why the same Java code can behave very differently before and after warm-up, and why minor changes to allocation shape, call-site stability, or loop structure can move a workload from highly optimized machine code back into a slower, more defensive execution path.
A common example is escape analysis. Consider a request-processing method that appears to allocate a short-lived object on every call:
final class QuoteCalculator {
static final class Money {
final long cents;
Money(long cents) {
this.cents = cents;
}
Money add(Money other) {
return new Money(this.cents + other.cents);
}
}
long calculate(long base, long fee) {
Money subtotal = new Money(base);
Money total = subtotal.add(new Money(fee));
return total.cents;
}
}
Code language: Java (java)At source level, this method allocates multiple Money objects. In a CPU-bound service, that looks dangerous because allocation rate translates into memory bandwidth pressure and, eventually, GC work. Under C2, however, these objects may never become heap objects at all. If HotSpot proves that the Money instances do not escape the compilation scope, it can scalar-replace them, keeping their fields in registers or stack slots and eliminating the allocation entirely. Internally, the optimizer is no longer thinking in terms of Java objects; it is reducing object state into primitive values that flow through the compiled graph. The trade-off is that this optimization is fragile. If the object is stored in a field, returned through an interface boundary, captured by a lambda, or passed into a method that cannot be inlined, the escape analysis proof may fail, and the allocation returns.
Inlining is usually the gateway to these deeper optimizations. Without inlining, the compiler sees method boundaries; with inlining, it sees a larger region of behavior that can be optimized as a unit. That is why call-site shape matters so much in production systems built on abstractions:
interface Encoder {
int encode(byte[] source, byte[] target);
}
final class JsonEncoder implements Encoder {
public int encode(byte[] source, byte[] target) {
int written = 0;
for (byte b : source) {
target[written++] = b;
}
return written;
}
}
final class ResponseWriter {
private final Encoder encoder = new JsonEncoder();
int write(byte[] payload, byte[] buffer) {
return encoder.encode(payload, buffer);
}
}
Code language: Java (java)If profiling shows that encoder.encode() is monomorphic, C2 can inline through the interface call and optimize the loop inside JsonEncoder. That can reduce virtual dispatch, expose bounds-check elimination opportunities, and improve instruction locality. If the same call site later sees several encoder implementations, the JVM may have to deoptimize code that was compiled under a narrower assumption. This is where speculative optimization becomes both powerful and risky. HotSpot optimizes for the world it has observed, not every world that could exist. When the observed world changes, correctness is preserved through deoptimization, but the service may briefly pay for uncommon traps, fallback execution, and recompilation.
You can inspect some of these decisions with inlining diagnostics:
java \
-XX:+UnlockDiagnosticVMOptions \
-XX:+PrintCompilation \
-XX:+PrintInlining \
-jar response-service.jar
Code language: Bash (bash)The output often reveals that performance is limited not by the obvious hot method, but by a call that failed to inline because it was too large, too polymorphic, or insufficiently profiled. This matters because inlining failure can prevent escape analysis, loop optimizations, and dead code elimination from firing later. The trade-off is that aggressive inlining also increases compiled code size. More code is not always faster; larger compiled bodies can pressure the instruction cache, increase compilation time, and contribute to code cache usage. In high-throughput services, the best optimization is often the one that improves the hot path without bloating everything around it.
On-stack replacement adds another layer to this runtime adaptability. Long-running loops cannot always wait for a method to return before benefiting from compiled code. In batch processors, stream consumers, or event-loop-style workloads, a hot loop may run for a long time inside a method that was originally entered through the interpreter. OSR allows HotSpot to compile the loop body and transfer execution into optimized code while the method is still active. That improves throughput for long-running loops, but it can also create measurement surprises: a benchmark may start in interpreted mode, transition mid-loop, and report a number that blends multiple execution states.
These optimizations explain why JVM performance engineering is rarely about adding flags blindly. Escape analysis, scalar replacement, inlining, speculative optimization, deoptimization, and OSR all interact. In a stable workload, they let Java approach the efficiency of carefully optimized native code while preserving a high-level programming model. In an unstable workload with changing type profiles, late class loading, excessive abstraction, or insufficient warm-up, the same machinery can create latency cliffs. The production skill is knowing when the JIT is successfully removing abstraction cost, and when it is repeatedly paying to rediscover that the runtime profile has changed.
Performance Analysis: Benchmarking JIT Behavior
Benchmarking JVM code is difficult because the thing being measured is not static. A method may begin life in the interpreter, move through C1-compiled profiling code, reach C2-optimized machine code, and later deoptimize if its assumptions are invalidated. This is why many hand-written Java benchmarks lie. A loop around System.nanoTime() often measures class loading, interpreter execution, tier transitions, branch-profile instability, and dead code elimination rather than the cost of the operation itself. In production, this mistake becomes expensive: a team may approve a change that looks faster in a microbenchmark but increases p99 latency once the service is deployed under real traffic and realistic type distributions.
A minimal JMH benchmark starts by separating warm-up from measurement:
@State(Scope.Thread)
public class PricingBenchmark {
private final PricingService service = new PricingService();
@Benchmark
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@Fork(3)
public long calculatePrice() {
return service.calculate(1_000_000L, 425L, 30);
}
}
Code language: Java (java)The warm-up phase is not cosmetic. It gives HotSpot time to collect invocation counts, compile hot methods, perform inlining, and stabilize the generated code before the measurement phase starts. Forking matters because it runs the benchmark in fresh JVM processes, reducing the chance that one measurement is polluted by earlier compiler state. Internally, this structure acknowledges that JVM performance has a lifecycle. The cost of a cold method and the cost of a fully optimized method are different measurements, and combining them produces numbers that are rarely useful for capacity planning.
The next trap is dead code elimination. If a benchmark computes a value that is never consumed, C2 may legally remove the work because the result has no observable effect:
@Benchmark
public void brokenBenchmark() {
long result = service.calculate(1_000_000L, 425L, 30);
}
@Benchmark
public long validBenchmark() {
return service.calculate(1_000_000L, 425L, 30);
}
Code language: Java (java)In the first method, the optimizer may decide that result is irrelevant and eliminate part or all of the computation. The benchmark then reports the cost of doing almost nothing, which is worse than being inaccurate because it encourages false confidence. Returning the value, or consuming it through JMH’s Blackhole, forces the operation to remain observable. This matters for real systems because production code usually has side effects, object lifetimes, branch behavior, and memory visibility constraints that synthetic benchmarks often erase accidentally.
Before-and-after optimization comparisons should also be designed around the CPU behavior you expect to improve. For example, replacing a polymorphic call site with a stable implementation may reduce virtual dispatch and improve inlining, but the benefit only appears if the benchmark reflects the same call-site shape as production:
@Benchmark
public long monomorphicPath() {
return fixedPolicy.apply(1_000_000L);
}
@Benchmark
public long polymorphicPath() {
DiscountPolicy policy = policies[index++ & 1];
return policy.apply(1_000_000L);
}
Code language: Java (java)The monomorphic path gives HotSpot a clean profile and may allow C2 to inline through the call. The polymorphic path introduces multiple receiver types, which can block or limit inlining and keep more dispatch overhead in the generated code. At the CPU level, this can change instruction locality, branch prediction behavior, and register allocation. The trade-off is that the monomorphic benchmark may look better while being less representative if production actually sees multiple implementations.
For CI/CD performance testing, the goal is not to turn every benchmark into a perfect simulation. Microbenchmarks are useful for isolating mechanisms such as allocation elimination, inlining, and arithmetic cost. Macrobenchmarks are needed to capture request routing, serialization, cache pressure, GC interaction, thread scheduling, and network boundaries. JIT behavior sits across both layers. A strong performance process therefore treats JMH results as evidence about a specific hot path, not as proof that the whole service will scale. The final validation still needs production-like load, realistic warm-up, and latency percentiles that expose how the JVM behaves before and after it reaches steady state.
Edge Cases and Pitfalls
The hardest JIT problems in production are rarely caused by the compiler “being slow” in a simple sense. They usually come from the JVM making reasonable assumptions that become unstable under real traffic. A service may warm up cleanly during deployment, hit excellent throughput in a load test, and then degrade hours later when a new tenant, plugin, feature flag, or class-loading path changes the runtime profile. HotSpot’s speculative optimizations are designed around observed behavior, so when the observed behavior shifts, the VM may invalidate compiled code and return execution to a safer tier. One deoptimization is normal. A stream of them can become visible as latency noise, CPU spikes, and sudden loss of steady-state performance.
A common source is profile pollution at a hot call site:
interface RiskRule {
boolean accept(Transaction tx);
}
final class CardRule implements RiskRule {
public boolean accept(Transaction tx) {
return tx.amount() < 10_000;
}
}
final class WireRule implements RiskRule {
public boolean accept(Transaction tx) {
return tx.countryRiskScore() < 70;
}
}
final class RiskEngine {
boolean evaluate(RiskRule rule, Transaction tx) {
return rule.accept(tx);
}
}
Code language: Java (java)If evaluate() initially sees only CardRule, C2 may treat the call as monomorphic, inline accept(), and optimize the surrounding code aggressively. If production traffic later introduces WireRule, and then several more implementations, the original compiled code no longer matches the actual type profile. HotSpot preserves correctness by deoptimizing and recompiling with broader assumptions, but that transition is not free. The CPU moves from a compact, inlined path back toward dispatch-heavy execution while the compiler catches up. The trade-off is the core bargain of speculative optimization: the JVM wins when profiles are stable, but unstable polymorphism can turn those wins into periodic latency cliffs.
You can often detect this pattern by combining compilation diagnostics with deoptimization visibility:
java \
-XX:+UnlockDiagnosticVMOptions \
-XX:+PrintCompilation \
-XX:+LogCompilation \
-jar risk-service.jar
Code language: Bash (bash)The resulting log can be inspected with JITWatch to find methods that are repeatedly compiled, made non-entrant, or recompiled at different tiers. A method becoming non-entrant means existing compiled code is no longer suitable for new invocations, often because an assumption attached to that code was invalidated. This matters in long-running services because performance regressions may not correlate with deployment time. They can correlate with traffic mix, late-loaded classes, tenant-specific behavior, or rarely exercised error paths that poison the profile of a previously clean hot path.
Another pitfall is treating tiered compilation flags as universal tuning knobs. For example:
java -XX:TieredStopAtLevel=1 -jar api-service.jar
Code language: Bash (bash)This may reduce warm-up turbulence because the JVM avoids the cost of deeper C2 optimization, but it also prevents the service from reaching its best steady-state code. In a container with tight CPU limits, that can look attractive during startup and then become expensive under sustained load because more cores are needed to handle the same throughput. The reverse problem appears when the code cache is too constrained: compiled methods compete for limited native-code storage, causing flushing, recompilation, or missed optimization opportunities. In practice, JIT stability is not achieved by forcing the most aggressive settings everywhere. It comes from keeping hot call sites predictable, warming realistic paths before exposing instances to full traffic, monitoring compiler activity alongside latency, and tuning only when the evidence shows that HotSpot’s default balance between startup, profiling, and optimization is not matching the workload.
Real-World Case Study
A fintech API I would use as a representative case is an order-pricing service deployed on Kubernetes behind an autoscaler. The service was not algorithmically slow: once warm, it handled requests comfortably within its latency budget. The issue appeared only during scale-out events. New pods entered rotation almost immediately after passing a shallow health check, and the first burst of real customer traffic paid the full JVM warm-up cost. For roughly the first two to three minutes, p99 latency was materially higher than the steady-state profile, even though CPU and GC dashboards looked normal at a coarse level.
Deployment window: scale-out during market open
Before warm-up strategy:
p50 latency: 8 ms
p95 latency: 42 ms
p99 latency: 180–240 ms
CPU: short spikes to 85–95%
Compilation activity: high during first live traffic window
After warm-up strategy:
p50 latency: 7 ms
p95 latency: 24 ms
p99 latency: 55–70 ms
CPU: smoother ramp-up
Compilation activity: mostly completed before readiness
Code language: plaintext (plaintext)The key signal was not average latency; it was the mismatch between cold and warm execution. Internally, the newly started JVM was still interpreting request paths, collecting method and branch profiles, compiling frequently executed methods through C1, and eventually promoting the hottest code to C2. During that period, production traffic was competing with compiler threads for CPU. The trade-off was subtle: the JVM was doing the right thing for long-term throughput, but it was doing it at exactly the wrong time from an SLA perspective.
The first fix was to make readiness reflect performance readiness, not just process availability:
@RestController
final class WarmupController {
private final PricingService pricingService;
WarmupController(PricingService pricingService) {
this.pricingService = pricingService;
}
@PostMapping("/internal/warmup")
void warmup() {
for (int i = 0; i < 250_000; i++) {
pricingService.priceOrder(
1_000_000L + i,
"USD",
CustomerTier.INSTITUTIONAL
);
}
}
}
Code language: Java (java)This deliberately exercises the same pricing path used by live traffic, including currency normalization, customer-tier logic, and fee calculation. The goal is not to “cache” a result; it is to feed HotSpot realistic profiles before the pod receives customer requests. That gives C1 and C2 enough evidence to inline stable call sites, optimize arithmetic-heavy paths, and reduce interpreter involvement. The trade-off is deployment latency: pods take longer to become ready, but the fleet avoids exposing users to the most volatile execution phase.
The second change was to test tier behavior explicitly rather than guessing:
java \
-XX:+UnlockDiagnosticVMOptions \
-XX:+PrintCompilation \
-XX:TieredStopAtLevel=4 \
-jar pricing-api.jar
Code language: Bash (bash)Keeping tiered compilation enabled through level 4 preserved C2’s steady-state benefits, while PrintCompilation in staging confirmed that the major request-path methods were promoted before readiness. A lower TieredStopAtLevel reduced startup compilation cost but left too much throughput on the table during market-open load. The final production posture was therefore not “disable JIT complexity”; it was to move unavoidable JIT work out of the user-facing request window. That distinction is important. For high-throughput APIs, the practical win often comes from controlling when warm-up happens, not from pretending the JVM has no warm-up phase.
Comparison with Alternatives
The JVM JIT is not the only way to execute production services, and it is not always the best fit. Its strength is adaptive optimization: HotSpot can observe the workload as it actually behaves, specialize hot paths, inline through stable call sites, eliminate allocations, and recompile when assumptions change. That gives long-running Java services excellent peak throughput, but it also creates warm-up cost, compiler CPU activity, and a larger runtime footprint. In cloud environments, where workloads are frequently scaled, restarted, and packed into constrained containers, those trade-offs become architectural decisions rather than runtime trivia.
GraalVM Native Image represents the opposite philosophy: compile the application ahead of time into a native executable and remove most JVM warm-up from the request path.
native-image \
-O2 \
-H:Name=pricing-api \
-jar pricing-api.jar
./pricing-api
Code language: Bash (bash)This improves startup time and can reduce memory footprint, which is attractive for serverless functions, CLI tools, short-lived jobs, and scale-to-zero microservices. Internally, however, the compiler is making decisions before real traffic exists. It cannot rely on the same live type profiles, branch frequencies, or speculative runtime feedback that HotSpot uses. The result is often better cold-start behavior but less adaptive peak optimization. For services that live for hours or days under heavy load, a warmed JVM may still outperform an AOT binary on throughput-sensitive hot paths.
The contrast becomes clearer when comparing deployment modes:
# HotSpot JVM: slower warm-up, stronger runtime adaptation
java -XX:+UseG1GC -jar pricing-api.jar
# Native executable: faster startup, less runtime optimization
./pricing-api
Code language: Bash (bash)The JVM version carries the cost of class loading, profiling, tiered compilation, and runtime metadata, but it can keep optimizing as the workload evolves. The native executable starts quickly and behaves more predictably early, but it gives up much of the JVM’s ability to reshape code based on production behavior. That matters when choosing a runtime for microservices: a request-heavy pricing engine, stream processor, or Kafka consumer may benefit from HotSpot’s steady-state optimization, while a bursty serverless endpoint may care more about cold-start latency and memory density.
Go and Rust shift the trade-off again. Go offers fast builds, simple deployment, efficient goroutines, and predictable startup, but its runtime optimization model is not equivalent to HotSpot’s speculative JIT. Rust goes further toward static control, delivering excellent predictable performance and low memory overhead, but at the cost of a more explicit programming model and less runtime flexibility.
fn calculate_price(subtotal: i64, tax_bps: i64, shipping: i64) -> i64 {
subtotal + ((subtotal * tax_bps) / 10_000) + shipping
}
Code language: Java (java)A Rust function like this is compiled ahead of time into native code with no JIT warm-up phase. That predictability is valuable in latency-critical systems, but it also means there is no runtime compiler watching production traffic and adapting call-site decisions after deployment. The practical conclusion is not that JIT, AOT, Go, or Rust is universally superior. HotSpot is strongest when the service is long-lived, hot paths are stable, and peak throughput matters. AOT and statically compiled runtimes are strongest when startup time, memory footprint, binary deployment, and early latency predictability dominate the cost model.
Best Practices Checklist
Treat JVM performance work as an evidence-driven process, not a flag-tuning exercise. The first rule is to benchmark with JMH rather than hand-written loops, because naive benchmarks often measure interpreter execution, tier transitions, dead code elimination, or System.nanoTime() overhead instead of the hot path itself.
@Benchmark
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@Fork(3)
public long priceOrder() {
return pricingService.price(1_000_000L, 425L, 30);
}
Code language: Java (java)This structure gives HotSpot time to collect profiles, compile hot methods, and stabilize generated code before measurement begins. Without warm-up, the benchmark blends cold execution with steady-state execution, which is exactly the mistake that leads teams to underestimate production p99 latency during autoscaling.
JIT visibility should also be part of serious performance testing. Compilation logs are too noisy to leave permanently enabled in most production environments, but they are extremely useful during controlled load tests or short diagnostic windows.
java \
-XX:+UnlockDiagnosticVMOptions \
-XX:+PrintCompilation \
-XX:+PrintInlining \
-jar pricing-api.jar
Code language: Bash (bash)These diagnostics show whether important methods reach C2, whether hot call sites inline successfully, and whether compiled code is repeatedly invalidated. The trade-off is observability overhead, so compiler logs should complement lower-overhead tools such as JFR and async-profiler rather than replace them.
Tiered compilation should be tuned only when the workload proves it needs it. For latency-sensitive services, experiment in staging with realistic traffic before changing defaults:
java -XX:TieredStopAtLevel=1 -jar api.jar
java -XX:TieredStopAtLevel=4 -jar api.jar
Code language: Bash (bash)If level 1 improves startup but hurts throughput, the better fix may be pre-warming or readiness gating, not disabling C2. In most systems, HotSpot’s heuristics are good enough until profiling proves otherwise. The safest practice is to trust the JIT by default, measure with realistic warm-up, inspect compiler behavior when latency shifts, and tune only after JFR, async-profiler, and benchmark evidence point to a specific compiler-related bottleneck.
Conclusion
The most important lesson is that HotSpot’s JIT is not a passive optimization layer bolted onto the JVM. It is a dynamic runtime system that observes execution, builds profiles, compiles selectively, speculates aggressively, and retreats safely when assumptions fail. That is why Java performance is rarely a single number. The same method can behave differently during startup, warm-up, steady state, and after deoptimization, and those transitions directly affect latency, CPU usage, and scaling behavior.
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -jar service.jar
Code language: Bash (bash)A command like this is a reminder that performance is happening over time, not just inside source code. When you see methods moving from interpreted execution through C1 and eventually to C2, you are watching the JVM trade early responsiveness for later throughput. That trade-off is usually the right one, which is why the default tiered JIT configuration is appropriate for most long-running services.
java -XX:TieredStopAtLevel=1 -jar service.jar
java -XX:TieredStopAtLevel=4 -jar service.jar
Code language: Bash (bash)These two modes should not be treated as magic tuning switches. They are diagnostic tools for understanding whether your workload is suffering from startup compilation cost, insufficient steady-state optimization, or unrealistic warm-up assumptions. For low-latency APIs, trading engines, and high-throughput stream processors, deeper tuning, pre-warming, and readiness gating can be justified. For serverless functions, short-lived jobs, and scale-to-zero workloads, AOT options such as GraalVM Native Image may be more appropriate because startup time and memory density can matter more than adaptive peak performance. The practical skill is knowing when to trust HotSpot, when to measure it, and when the runtime model itself no longer matches the deployment model.
