[SPARK-56320] Introduce JIT compilation metrics#55159
Open
leixm wants to merge 1 commit intoapache:masterfrom
Open
[SPARK-56320] Introduce JIT compilation metrics#55159leixm wants to merge 1 commit intoapache:masterfrom
leixm wants to merge 1 commit intoapache:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add three JIT-related executor metrics to Spark's existing executor metrics framework:
JITCompilationTime: Approximate accumulated elapsed time (in milliseconds) that the JIT compiler has spent compiling, sourced from CompilationMXBean.
CodeCacheUsed: Current code cache used memory in bytes. On Java 8, from the single "Code Cache" memory pool. On Java 9+, sum of the three segmented code cache pools.
CodeCacheMax: Maximum code cache capacity in bytes. Same pool resolution as CodeCacheUsed.
These metrics follow the same pattern as the existing GarbageCollectionMetrics — collected via standard JVM MXBeans, exposed in event logs, Spark UI REST API, and the Prometheus endpoint. Java 8 (single Code Cache) and Java 9+ (segmented Code Cache) are handled transparently.
Why are the changes needed?
Spark already tracks GC metrics (7 metrics) to help diagnose GC-related performance issues. However, JIT compilation — the other critical JVM subsystem that directly impacts performance — has zero observability in Spark today.
This matters for Spark workloads for two reasons:
Code Cache exhaustion causes severe performance degradation. When the Code Cache fills up, the JVM disables the JIT compiler (CodeCache is full. Compiler has been disabled) and previously compiled hot code falls back to interpreted execution, causing sudden and significant performance drops. Spark's Whole-Stage CodeGen dynamically generates classes at runtime, which accelerates Code Cache consumption in long-running executors. Without CodeCacheUsed and CodeCacheMax, operators cannot compute Code Cache utilization or set alerts before exhaustion happens.
JIT compilation steals CPU from user computation. In interactive workloads (e.g., Spark SQL via JDBC/Thrift Server or Spark Connect), JIT compilation overhead can dominate the latency of the first few queries. JITCompilationTime allows users to quantify this "warm-up" cost.
Currently, obtaining this information requires SSH-ing into executor nodes and running jstat/jcmd, or configuring JMX remote access — neither is practical at scale. Integrating these metrics into Spark's metrics framework provides the same level of observability that GC metrics already enjoy: available in the Spark UI, event logs (for post-mortem analysis), and Prometheus (for alerting).
Note that while Code Cache memory is included in the aggregate JVMOffHeapMemory metric, that metric also includes Metaspace and Compressed Class Space. When JVMOffHeapMemory grows, there is no way to determine whether the increase is from Code Cache or Metaspace without these dedicated metrics.
Does this PR introduce any user-facing change?
These metrics are exposed through:
Event logs: included in ExecutorMetrics for all executor heartbeats and stage completions.
REST API: available in executor summary and stage detail endpoints (e.g., /api/v1/applications/{appId}/executors).
Prometheus endpoint: exposed at /metrics/executors/prometheus as JITCompilationTime_seconds_total, CodeCacheUsed_bytes, and CodeCacheMax_bytes.
How was this patch tested?
JsonProtocolSuite and HistoryServerSuite
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor with Claude claude-4.6-opus