<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>jvm on Red Hat App Services Performance Team</title><link>https://redhatperf.github.io/categories/jvm/</link><description>Recent content in jvm on Red Hat App Services Performance Team</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Wed, 10 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://redhatperf.github.io/categories/jvm/index.xml" rel="self" type="application/rss+xml"/><item><title>Harder, Better, Faster, Stronger... Earlier!</title><link>https://redhatperf.github.io/post/when-the-jit-cant-keep-up/</link><pubDate>Wed, 10 Jun 2026 00:00:00 +0000</pubDate><guid>https://redhatperf.github.io/post/when-the-jit-cant-keep-up/</guid><description>&lt;img src="https://redhatperf.github.io/post/when-the-jit-cant-keep-up/banner.jpg" alt="Featured image of post Harder, Better, Faster, Stronger... Earlier!" />&lt;div class="paragraph">
&lt;p>&lt;em>&lt;a href="https://www.youtube.com/watch?v=gAjR4_CbPpQ">Daft Punk&lt;/a>&lt;/em>&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>Our &lt;a href="https://github.com/quarkusio/spring-quarkus-perf-comparison/tree/38345ca">Quarkus vs Spring CRUD benchmark&lt;/a> shows Quarkus delivering roughly 2x Spring’s throughput on a REST/CRUD workload backed by PostgreSQL. The latest &lt;a href="https://github.com/quarkusio/benchmarks/commit/d26d45d">public results&lt;/a> confirm this.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>But throughput is an average over a measurement window. It doesn’t tell you how long the application took to get there. This post looks at the dimension averages hide: &lt;strong>time to peak performance&lt;/strong>.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>The benchmark runs on our RHEL 9.6 performance lab with JDK 25, 4 pinned cores per application, and 100 concurrent HTTP connections. Each run consists of a &lt;strong>2-minute warmup&lt;/strong> at full load, a &lt;strong>30-second cooldown&lt;/strong>, and a &lt;strong>30-second load test&lt;/strong> where throughput is measured.&lt;/p>
&lt;/div>
&lt;div class="sect1">
&lt;h2 id="_the_warmup_curve">The warmup curve&lt;/h2>
&lt;div class="sectionbody">
&lt;div class="paragraph">
&lt;p>Hyperfoil records per-second throughput during each phase. Here is the warmup curve for the two baseline configurations, Quarkus and Spring, both with platform threads:&lt;/p>
&lt;/div>
&lt;div class="imageblock">
&lt;div class="content">
&lt;img src="warmup-tps-baseline.png" alt="Warmup curve: quarkus3-jvm vs spring4-jvm"/>
&lt;/div>
&lt;/div>
&lt;div class="paragraph">
&lt;p>Quarkus reaches ~14,700 req/s within 60 seconds. Spring plateaus at ~3,300 req/s during warmup, then jumps to ~7,800 during the load test after the 30-second cooldown gives the compiler time to catch up.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>The shape matters. Quarkus’s curve has a steep initial ramp and plateaus early. Spring plateaus at less than a quarter of Quarkus’s throughput during warmup, and only recovers after the cooldown period.&lt;/p>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="sect1">
&lt;h2 id="_why_is_spring_slower_to_warm_up">Why is Spring slower to warm up?&lt;/h2>
&lt;div class="sectionbody">
&lt;div class="paragraph">
&lt;p>The HotSpot JVM uses &lt;a href="https://developers.redhat.com/articles/2021/06/23/how-jit-compiler-boosts-java-performance-openjdk">tiered compilation&lt;/a>: code starts in the interpreter, gets compiled by C1, and eventually by C2 which produces the fastest native code. C2 runs on dedicated background threads. If those threads can’t get CPU time, the application stays on slower code longer.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>&lt;a href="https://man7.org/linux/man-pages/man1/pidstat.1.html">pidstat&lt;/a> with per-thread reporting (&lt;code>-t&lt;/code>) shows &lt;code>%wait&lt;/code>: the percentage of time a thread spends in the run queue, ready to execute but waiting for a CPU. This is collected as part of our &lt;a href="https://github.com/quarkusio/spring-quarkus-perf-comparison/issues/62">active benchmarking practice&lt;/a>.&lt;/p>
&lt;/div>
&lt;div class="imageblock">
&lt;div class="content">
&lt;img src="c2-wait-baseline.png" alt="C2 thread CPU and wait: spring4-jvm vs quarkus3-jvm"/>
&lt;/div>
&lt;/div>
&lt;div class="admonitionblock note">
&lt;table>
&lt;tbody>&lt;tr>
&lt;td class="icon">
&lt;div class="title">Note&lt;/div>
&lt;/td>
&lt;td class="content">
&lt;div class="paragraph">
&lt;p>The JVM runs 2 C2 compiler threads on this configuration. The chart sums both threads: 200% wait means both are fully waiting for a CPU. Green is CPU time doing actual compilation work; pink is time spent in the run queue, ready to run but waiting for a CPU.&lt;/p>
&lt;/div>
&lt;/td>
&lt;/tr>
&lt;/tbody>&lt;/table>
&lt;/div>
&lt;div class="paragraph">
&lt;p>During &lt;strong>startup&lt;/strong> (green region, before any request arrives), both frameworks show &lt;strong>C2 running freely&lt;/strong>: green with no pink. Spring’s C2 threads are at ~200% usr (both maxed out), reflecting more framework code to compile at boot. Quarkus’s C2 threads peak at ~80% usr, less startup compilation work. Spring’s startup also takes longer (~13 seconds vs ~8 for Quarkus; the &lt;a href="https://github.com/quarkusio/benchmarks/commit/d26d45d">public benchmark results&lt;/a> show &lt;strong>Quarkus starts in roughly half the time&lt;/strong>).&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>Once the &lt;strong>warmup&lt;/strong> load hits, the picture changes.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>&lt;strong>spring4-jvm&lt;/strong> (top panel): the C2 threads are immediately overwhelmed. The pink band fills the warmup phase, with the two threads collectively spending 140-180% of their time waiting. The compiler can barely run for the whole duration of the warmup.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>&lt;strong>quarkus3-jvm&lt;/strong> (bottom panel): C2 threads are also contended during the first half of warmup (100-150% summed wait). But C2 activity drops to near zero by the second half. The compiler finishes its work within the first ~60 seconds.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>Both are starved during warmup. The difference is recovery, and it has two dimensions. First, JDK Flight Recorder’s &lt;code>jdk.CompilerStatistics&lt;/code> event (which reports the cumulative total of all compilations every second) shows that &lt;strong>spring4-jvm needs ~17,600 total compilations to reach peak, while quarkus3-jvm needs ~12,500: 41% fewer.&lt;/strong> A leaner framework means less work for the compiler. Second, Spring’s threads yield CPU less often, so the compiler gets fewer scheduling gaps to do that larger amount of work.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>Spring Boot’s Tomcat creates a platform thread per HTTP connection. With 100 connections, pidstat shows &lt;strong>114 threads with %CPU &amp;gt; 0&lt;/strong> during warmup. Quarkus shows &lt;strong>39&lt;/strong>. Averaged per worker thread across the warmup:&lt;/p>
&lt;/div>
&lt;table class="tableblock frame-all grid-all stretch">
&lt;colgroup>
&lt;col style="width: 40%;"/>
&lt;col style="width: 20%;"/>
&lt;col style="width: 20%;"/>
&lt;col style="width: 20%;"/>
&lt;/colgroup>
&lt;thead>
&lt;tr>
&lt;th class="tableblock halign-left valign-top">&lt;/th>
&lt;th class="tableblock halign-left valign-top">Avg voluntary cswch/s&lt;/th>
&lt;th class="tableblock halign-left valign-top">Avg involuntary cswch/s&lt;/th>
&lt;th class="tableblock halign-left valign-top">Ratio nvol/vol&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">spring4-jvm&lt;/p>&lt;/td>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">105&lt;/p>&lt;/td>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">864&lt;/p>&lt;/td>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">&lt;strong>8.3&lt;/strong>&lt;/p>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">quarkus3-jvm&lt;/p>&lt;/td>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">502&lt;/p>&lt;/td>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">43&lt;/p>&lt;/td>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">0.1&lt;/p>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;div class="paragraph">
&lt;p>Spring worker threads are involuntarily preempted 20x more than Quarkus threads. Quarkus threads yield the CPU voluntarily 4x more: &lt;strong>they complete request work faster and return to I/O wait, leaving scheduling gaps for the C2 compiler&lt;/strong>.&lt;/p>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="sect1">
&lt;h2 id="_virtual_threads_change_the_equation">Virtual threads change the equation&lt;/h2>
&lt;div class="sectionbody">
&lt;div class="paragraph">
&lt;p>Virtual threads multiplex onto a small number of carrier threads. With 100 HTTP connections, only ~4 carriers are on-CPU instead of 100+ platform threads. Adding virtual threads to both frameworks:&lt;/p>
&lt;/div>
&lt;div class="imageblock">
&lt;div class="content">
&lt;img src="warmup-tps-spring-vt.png" alt="Spring: platform threads vs virtual threads"/>
&lt;/div>
&lt;/div>
&lt;div class="paragraph">
&lt;p>&lt;strong>Spring with virtual threads reaches ~9,900 req/s and warms up within 30 seconds.&lt;/strong> With platform threads, Spring is still at ~3,300 req/s at the 30-second mark, and is still climbing during the load test, never reaching a stable peak within the entire benchmark window.&lt;/p>
&lt;/div>
&lt;div class="imageblock">
&lt;div class="content">
&lt;img src="warmup-tps-quarkus-vt.png" alt="Quarkus: platform threads vs virtual threads"/>
&lt;/div>
&lt;/div>
&lt;div class="paragraph">
&lt;p>&lt;strong>Quarkus with virtual threads reaches peak in ~15 seconds instead of ~60 seconds&lt;/strong>, but both converge to ~15,000 req/s. The benefit is warmup speed, not peak throughput.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>Virtual threads reduce the on-CPU thread count from 114 to 18 for Spring, and from 39 to 15 for Quarkus. Fewer threads competing for 4 cores means the C2 compiler threads get more CPU time during warmup.&lt;/p>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="sect1">
&lt;h2 id="_what_about_leyden">What about Leyden?&lt;/h2>
&lt;div class="sectionbody">
&lt;div class="paragraph">
&lt;p>&lt;a href="https://openjdk.org/projects/leyden/">Project Leyden&lt;/a> in JDK 25 caches class loading, linking, and method profiling data in an AOT cache (&lt;a href="https://openjdk.org/jeps/483">JEP 483&lt;/a>, &lt;a href="https://openjdk.org/jeps/514">JEP 514&lt;/a>, &lt;a href="https://openjdk.org/jeps/515">JEP 515&lt;/a>). &lt;strong>AOT Code Compilation&lt;/strong> (caching the actual C2-compiled native code) is &lt;a href="https://openjdk.org/jeps/8335368">not yet in JDK 25&lt;/a> and is available on the Leyden premain branch for future releases.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>To be clear: &lt;strong>Leyden is not at fault here.&lt;/strong> The AOT cache in JDK 25 is designed to accelerate startup, and it does: Spring starts in 6.4s with Leyden vs 9.3s without. The warmup behavior we observe is a consequence of how the training run is performed, not a limitation of Leyden itself.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>Our benchmark follows the &lt;a href="https://docs.spring.io/spring-boot/reference/packaging/aot-cache.html">official Spring Boot AOT cache documentation&lt;/a>, which recommends &lt;code>-Dspring.context.exit=onRefresh&lt;/code> for the training run. This exercises class loading and context initialization, but &lt;strong>does not send any HTTP traffic&lt;/strong>. The AOT cache therefore contains profiling data for startup methods, not for the request-handling hot path. Without cached compiled code, &lt;strong>C2 still needs to compile hot methods at runtime&lt;/strong>, and the profiling data from a startup-only training run does not help with that.&lt;/p>
&lt;/div>
&lt;div class="imageblock">
&lt;div class="content">
&lt;img src="warmup-tps-spring-leyden.png" alt="Spring: jvm vs Leyden"/>
&lt;/div>
&lt;/div>
&lt;div class="paragraph">
&lt;p>&lt;strong>spring4-leyden measures 2,921 TPS after the 2-minute warmup, less than half of spring4-jvm (7,850).&lt;/strong> The application has not yet reached peak performance when the load test starts. During warmup, the two curves are close for the first 30 seconds, then spring4-jvm pulls ahead while spring4-leyden stalls at ~2,500-3,000.&lt;/p>
&lt;/div>
&lt;div class="imageblock">
&lt;div class="content">
&lt;img src="warmup-tps-quarkus-leyden.png" alt="Quarkus: jvm vs Leyden"/>
&lt;/div>
&lt;/div>
&lt;div class="paragraph">
&lt;p>&lt;strong>quarkus3-leyden takes ~90 seconds to reach peak, vs ~60 seconds without Leyden.&lt;/strong> The final throughput is also lower (13,060 vs 14,710). Quarkus’s Leyden integration uses &lt;a href="https://quarkus.io/blog/leyden-2/">&lt;code>@QuarkusIntegrationTest&lt;/code>&lt;/a> for training, which exercises actual HTTP endpoints, providing more representative profiling data than a startup-only training run. Yet even with this richer training data, the warmup is slower with Leyden. The AOT method profiling in JDK 25 captures what to compile, but the compiled code still needs to be produced at runtime by C2.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>The C2 compiler thread chart for the Leyden variants shows why:&lt;/p>
&lt;/div>
&lt;div class="imageblock">
&lt;div class="content">
&lt;img src="c2-wait-leyden.png" alt="C2 thread CPU and wait: spring4-leyden vs quarkus3-leyden"/>
&lt;/div>
&lt;/div>
&lt;div class="paragraph">
&lt;p>The STARTUP region is visibly shorter with Leyden, though the load generator adds its own startup overhead before the first request arrives. But the C2 threads are starved even more severely during warmup. &lt;strong>spring4-leyden shows near-200% wait through the entire warmup and into the load test.&lt;/strong> &lt;strong>quarkus3-leyden recovers, but takes ~90 seconds to reach peak instead of ~60 seconds without Leyden: 50% longer.&lt;/strong>&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>As with the non-Leyden case, &lt;strong>virtual threads fix the problem for both frameworks&lt;/strong>: spring4-virtual-leyden delivers 9,264 TPS (vs 2,921 without virtual threads) and quarkus3-virtual-leyden delivers 13,410 TPS (vs 13,060). Fewer carrier threads on CPU means the &lt;strong>C2 compiler gets the scheduling gaps it needs&lt;/strong>, regardless of whether Leyden is enabled.&lt;/p>
&lt;/div>
&lt;div class="sect2">
&lt;h3 id="_confirmation_experiment_xbatch">Confirmation experiment: -Xbatch&lt;/h3>
&lt;div class="paragraph">
&lt;p>spring4-leyden is the most severely affected configuration. The pidstat data shows C2 threads spending most of their time waiting for CPU. To confirm that this CPU starvation is the bottleneck, we ran it with &lt;code>-Xbatch&lt;/code>. This JVM flag forces application threads to block when they trigger a compilation, effectively backpressuring them and freeing CPU for the compiler. It is not a production-ready fix, but it answers a specific question: if the compiler gets enough CPU time, does the throughput recover?&lt;/p>
&lt;/div>
&lt;table class="tableblock frame-all grid-all stretch">
&lt;colgroup>
&lt;col style="width: 50%;"/>
&lt;col style="width: 25%;"/>
&lt;col style="width: 25%;"/>
&lt;/colgroup>
&lt;thead>
&lt;tr>
&lt;th class="tableblock halign-left valign-top">&lt;/th>
&lt;th class="tableblock halign-left valign-top">spring4-leyden&lt;/th>
&lt;th class="tableblock halign-left valign-top">spring4-leyden + &lt;code>-Xbatch&lt;/code>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">Throughput&lt;/p>&lt;/td>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">2,921 TPS&lt;/p>&lt;/td>
&lt;td class="tableblock halign-left valign-top">&lt;p class="tableblock">&lt;strong>~7,900 TPS&lt;/strong>&lt;/p>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;div class="paragraph">
&lt;p>&lt;strong>Throughput jumps from 2,921 to ~7,900 TPS.&lt;/strong> The throughput loss comes from the compiler not getting enough CPU.&lt;/p>
&lt;/div>
&lt;div class="admonitionblock note">
&lt;table>
&lt;tbody>&lt;tr>
&lt;td class="icon">
&lt;div class="title">Note&lt;/div>
&lt;/td>
&lt;td class="content">
&lt;div class="paragraph">
&lt;p>&lt;strong>Troubleshooting hints:&lt;/strong> To diagnose C2 starvation in your own application, enable &lt;a href="https://dev.java/learn/jvm/jfr/">JDK Flight Recorder&lt;/a> with &lt;code>-XX:StartFlightRecording=settings=profile,dumponexit=true&lt;/code> and look at:&lt;/p>
&lt;/div>
&lt;div class="ulist">
&lt;ul>
&lt;li>
&lt;p>&lt;code>jdk.CompilerStatistics&lt;/code>: track &lt;code>compileCount&lt;/code> and &lt;code>nmethodCodeSize&lt;/code> over time. If the compiled code size is still growing during your measurement window, the compiler hasn’t finished.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>jdk.Compilation&lt;/code>: individual compilations exceeding 100ms (with &lt;code>settings=profile&lt;/code>). &lt;strong>Wall-clock durations of 10+ seconds indicate the compiler thread is being preempted.&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>jdk.CompilerQueueUtilization&lt;/code>: if the C2 &lt;code>queueSize&lt;/code> is non-zero during your measurement window, methods are waiting to be compiled.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/div>
&lt;div class="paragraph">
&lt;p>Combine with &lt;code>pidstat -t -u -w 1&lt;/code> to see C2 thread &lt;code>%wait&lt;/code> and worker thread involuntary context switches.&lt;/p>
&lt;/div>
&lt;/td>
&lt;/tr>
&lt;/tbody>&lt;/table>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="sect1">
&lt;h2 id="_the_os_scheduler_matters">The OS scheduler matters&lt;/h2>
&lt;div class="sectionbody">
&lt;div class="paragraph">
&lt;p>On a preliminary test with &lt;strong>RHEL 10&lt;/strong> (kernel 6.12, EEVDF scheduler), spring4-leyden delivers 4,411 TPS, a &lt;strong>36% improvement&lt;/strong> over the 3,231 TPS on RHEL 9.6 (kernel 5.14, CFS), same hardware and topology. &lt;strong>The newer scheduler gives C2 threads more CPU time under contention.&lt;/strong> It does not eliminate the problem, but it reduces the severity.&lt;/p>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="sect1">
&lt;h2 id="_does_this_happen_in_production">Does this happen in production?&lt;/h2>
&lt;div class="sectionbody">
&lt;div class="paragraph">
&lt;p>Java microservices on containerized platforms commonly run with 4-8 CPU cores and thread pools of 200+. The same dynamic applies: &lt;strong>C2 threads are starved, compilations take seconds instead of milliseconds, and response times may never stabilize.&lt;/strong> Pods get killed and restarted before the compiler finishes, a cycle that repeats indefinitely.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>C2 starvation has a second effect: while methods are stuck at tier 3 (C1 with profiling), every thread updates shared &lt;a href="https://wiki.openjdk.org/spaces/HotSpot/pages/13729947/MethodData">MethodData&lt;/a> counters concurrently, causing cache coherency overhead that actively degrades scalability with more threads (&lt;a href="https://bugs.openjdk.org/browse/JDK-8134940">JDK-8134940&lt;/a>, &lt;a href="https://bugs.openjdk.org/browse/JDK-8348027">JDK-8348027&lt;/a>). For a deeper analysis, see &lt;a href="https://redhatperf.github.io/post/method-data-scalability/">Sharing is (S)Caring: How Tiered Compilation Affects Java Application Scalability&lt;/a>.&lt;/p>
&lt;/div>
&lt;div class="paragraph">
&lt;p>The JVM keeps serving requests while broken: liveness probes pass, the application just runs slow code. Without &lt;a href="https://www.brendangregg.com/activebenchmarking.html">active benchmarking&lt;/a>, C2 starvation is invisible.&lt;/p>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="sect1">
&lt;h2 id="_takeaways">Takeaways&lt;/h2>
&lt;div class="sectionbody">
&lt;div class="ulist">
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Throughput averages hide warmup problems.&lt;/strong> Time to peak performance is a separate dimension. An application that delivers good steady-state throughput may take minutes to get there, or never reach it.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>The C2 compiler needs CPU time.&lt;/strong> When too many threads compete for the same cores, the C2 threads get preempted. The application runs slower code for longer, and the problem compounds through &lt;a href="https://bugs.openjdk.org/browse/JDK-8134940">MethodData contention&lt;/a>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Virtual threads help&lt;/strong> by reducing the number of on-CPU threads, leaving more scheduling time for the compiler. Both frameworks warm up faster with virtual threads.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Leaner request processing helps.&lt;/strong> Quarkus worker threads yield the CPU voluntarily 4x more often than Spring’s, with 20x fewer involuntary preemptions. The compiler gets scheduling gaps to work in.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Leyden does not yet cache compiled code.&lt;/strong> Until AOT Code Compilation (&lt;a href="https://openjdk.org/jeps/8335368">JEP draft 8335368&lt;/a>) ships, Leyden can make the warmup problem worse, if C2 is starved. Virtual threads can recover the lost throughput.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>The OS scheduler matters.&lt;/strong> RHEL 10’s EEVDF scheduler shows a 36% throughput improvement for the worst case. The scheduling algorithm determines how background compiler threads fare under contention.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/div>
&lt;/div>
&lt;/div>
&lt;div class="sect1">
&lt;h2 id="_tools_used">Tools used&lt;/h2>
&lt;div class="sectionbody">
&lt;div class="ulist">
&lt;ul>
&lt;li>
&lt;p>&lt;strong>&lt;a href="https://dev.java/learn/jvm/jfr/">JDK Flight Recorder&lt;/a>&lt;/strong> (&lt;code>-XX:StartFlightRecording=settings=profile,dumponexit=true&lt;/code>): &lt;code>jdk.CompilerStatistics&lt;/code> for compiled code size over time, &lt;code>jdk.Compilation&lt;/code> for individual compilation durations&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;a href="https://github.com/sysstat/sysstat">pidstat&lt;/a>&lt;/strong> (&lt;code>-t -u -w&lt;/code>): per-thread CPU utilization including &lt;code>%wait&lt;/code> (time in run queue waiting for CPU) and context switches (voluntary / involuntary). Part of the &lt;a href="https://github.com/sysstat/sysstat">sysstat&lt;/a> package.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;a href="https://github.com/Hyperfoil/Hyperfoil">Hyperfoil&lt;/a>&lt;/strong>: load generator (100 concurrent connections, fixed-thread mode); per-second series data used for warmup throughput curves&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/div>
&lt;div class="paragraph">
&lt;p>The benchmark, methodology, and data are public: see &lt;a href="https://github.com/quarkusio/spring-quarkus-perf-comparison/issues/591">issue #591&lt;/a> and &lt;a href="https://github.com/quarkusio/spring-quarkus-perf-comparison/issues/420">issue #420&lt;/a>. The benchmark code used in this analysis is at &lt;a href="https://github.com/quarkusio/spring-quarkus-perf-comparison/tree/38345ca">commit 38345ca&lt;/a>.&lt;/p>
&lt;/div>
&lt;/div>
&lt;/div></description></item></channel></rss>