Skip to content

[SPARK-56320] Introduce JIT compilation metrics#55159

Open
leixm wants to merge 1 commit intoapache:masterfrom
leixm:SPARK-56320
Open

[SPARK-56320] Introduce JIT compilation metrics#55159
leixm wants to merge 1 commit intoapache:masterfrom
leixm:SPARK-56320

Conversation

@leixm
Copy link
Copy Markdown
Contributor

@leixm leixm commented Apr 2, 2026

What changes were proposed in this pull request?

Add three JIT-related executor metrics to Spark's existing executor metrics framework:

JITCompilationTime: Approximate accumulated elapsed time (in milliseconds) that the JIT compiler has spent compiling, sourced from CompilationMXBean.
CodeCacheUsed: Current code cache used memory in bytes. On Java 8, from the single "Code Cache" memory pool. On Java 9+, sum of the three segmented code cache pools.
CodeCacheMax: Maximum code cache capacity in bytes. Same pool resolution as CodeCacheUsed.
These metrics follow the same pattern as the existing GarbageCollectionMetrics — collected via standard JVM MXBeans, exposed in event logs, Spark UI REST API, and the Prometheus endpoint. Java 8 (single Code Cache) and Java 9+ (segmented Code Cache) are handled transparently.

Why are the changes needed?

Spark already tracks GC metrics (7 metrics) to help diagnose GC-related performance issues. However, JIT compilation — the other critical JVM subsystem that directly impacts performance — has zero observability in Spark today.

This matters for Spark workloads for two reasons:

Code Cache exhaustion causes severe performance degradation. When the Code Cache fills up, the JVM disables the JIT compiler (CodeCache is full. Compiler has been disabled) and previously compiled hot code falls back to interpreted execution, causing sudden and significant performance drops. Spark's Whole-Stage CodeGen dynamically generates classes at runtime, which accelerates Code Cache consumption in long-running executors. Without CodeCacheUsed and CodeCacheMax, operators cannot compute Code Cache utilization or set alerts before exhaustion happens.

JIT compilation steals CPU from user computation. In interactive workloads (e.g., Spark SQL via JDBC/Thrift Server or Spark Connect), JIT compilation overhead can dominate the latency of the first few queries. JITCompilationTime allows users to quantify this "warm-up" cost.

Currently, obtaining this information requires SSH-ing into executor nodes and running jstat/jcmd, or configuring JMX remote access — neither is practical at scale. Integrating these metrics into Spark's metrics framework provides the same level of observability that GC metrics already enjoy: available in the Spark UI, event logs (for post-mortem analysis), and Prometheus (for alerting).

Note that while Code Cache memory is included in the aggregate JVMOffHeapMemory metric, that metric also includes Metaspace and Compressed Class Space. When JVMOffHeapMemory grows, there is no way to determine whether the increase is from Code Cache or Metaspace without these dedicated metrics.

Does this PR introduce any user-facing change?

These metrics are exposed through:
Event logs: included in ExecutorMetrics for all executor heartbeats and stage completions.
REST API: available in executor summary and stage detail endpoints (e.g., /api/v1/applications/{appId}/executors).
Prometheus endpoint: exposed at /metrics/executors/prometheus as JITCompilationTime_seconds_total, CodeCacheUsed_bytes, and CodeCacheMax_bytes.

How was this patch tested?

JsonProtocolSuite and HistoryServerSuite

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor with Claude claude-4.6-opus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant