[SPARK-56320] Introduce JIT compilation metrics by leixm · Pull Request #55159 · apache/spark

leixm · 2026-04-02T08:26:16Z

What changes were proposed in this pull request?

Add three JIT-related executor metrics to Spark's existing executor metrics framework:

JITCompilationTime: Approximate accumulated elapsed time (in milliseconds) that the JIT compiler has spent compiling, sourced from CompilationMXBean.
CodeCacheUsed: Current code cache used memory in bytes. On Java 8, from the single "Code Cache" memory pool. On Java 9+, sum of the three segmented code cache pools.
CodeCacheMax: Maximum code cache capacity in bytes. Same pool resolution as CodeCacheUsed.
These metrics follow the same pattern as the existing GarbageCollectionMetrics — collected via standard JVM MXBeans, exposed in event logs, Spark UI REST API, and the Prometheus endpoint. Java 8 (single Code Cache) and Java 9+ (segmented Code Cache) are handled transparently.

Why are the changes needed?

Spark already tracks GC metrics (7 metrics) to help diagnose GC-related performance issues. However, JIT compilation — the other critical JVM subsystem that directly impacts performance — has zero observability in Spark today.

This matters for Spark workloads for two reasons:

Code Cache exhaustion causes severe performance degradation. When the Code Cache fills up, the JVM disables the JIT compiler (CodeCache is full. Compiler has been disabled) and previously compiled hot code falls back to interpreted execution, causing sudden and significant performance drops. Spark's Whole-Stage CodeGen dynamically generates classes at runtime, which accelerates Code Cache consumption in long-running executors. Without CodeCacheUsed and CodeCacheMax, operators cannot compute Code Cache utilization or set alerts before exhaustion happens.

JIT compilation steals CPU from user computation. In interactive workloads (e.g., Spark SQL via JDBC/Thrift Server or Spark Connect), JIT compilation overhead can dominate the latency of the first few queries. JITCompilationTime allows users to quantify this "warm-up" cost.

Currently, obtaining this information requires SSH-ing into executor nodes and running jstat/jcmd, or configuring JMX remote access — neither is practical at scale. Integrating these metrics into Spark's metrics framework provides the same level of observability that GC metrics already enjoy: available in the Spark UI, event logs (for post-mortem analysis), and Prometheus (for alerting).

Note that while Code Cache memory is included in the aggregate JVMOffHeapMemory metric, that metric also includes Metaspace and Compressed Class Space. When JVMOffHeapMemory grows, there is no way to determine whether the increase is from Code Cache or Metaspace without these dedicated metrics.

Does this PR introduce any user-facing change?

These metrics are exposed through:
Event logs: included in ExecutorMetrics for all executor heartbeats and stage completions.
REST API: available in executor summary and stage detail endpoints (e.g., /api/v1/applications/{appId}/executors).
Prometheus endpoint: exposed at /metrics/executors/prometheus as JITCompilationTime_seconds_total, CodeCacheUsed_bytes, and CodeCacheMax_bytes.

How was this patch tested?

JsonProtocolSuite and HistoryServerSuite

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor with Claude claude-4.6-opus

[SPARK-56320] Introduce JIT compilation metrics

1cf6506

leixm force-pushed the SPARK-56320 branch from 6336e46 to 1cf6506 Compare April 2, 2026 08:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56320] Introduce JIT compilation metrics#55159

[SPARK-56320] Introduce JIT compilation metrics#55159
leixm wants to merge 1 commit intoapache:masterfrom
leixm:SPARK-56320

leixm commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leixm commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leixm commented Apr 2, 2026 •

edited

Loading