Add Pressure Stall Information (PSI) metrics (reopened #2996)#3068
Add Pressure Stall Information (PSI) metrics (reopened #2996)#3068alpineQ wants to merge 14 commits intoopen-telemetry:mainfrom
Conversation
# Conflicts: # docs/system/system-metrics.md
Co-authored-by: James Thompson <thompson.tomo@outlook.com>
Co-authored-by: James Thompson <thompson.tomo@outlook.com>
|
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
|
@thompson-tomo @braydonk @trask |
|
@alpineQ can you rebase/merge in master as the doc templates have been updated. |
|
@thompson-tomo any updates on this? |
thompson-tomo
left a comment
There was a problem hiding this comment.
Docs and definitions look good to me based on published guidance & clarification.
|
hi @alpineQ, this will need review and approval from @open-telemetry/semconv-system-approvers |
|
@trask do these @open-telemetry/semconv-system-approvers really exist or only you can see them? 🤣 |
|
@alpineQ Apologies for the delayed response. The group has been focused on delivering the first stable release of a subset of system metrics, and unfortunately this PR slipped through the cracks. I’ve also noticed that we’re attempting to add a memory pressure metric for Darwin as well (open-telemetry/opentelemetry-collector-contrib#45154). This made me wonder whether we could agree on a cross-platform, generic naming scheme for pressure metrics (for example, system.cpu.pressure). Since I’m not very familiar with how this concept is handled across other platforms, I’ve added this topic to the agenda for our next SIG meeting (08/01/2026) so we can discuss it together. |
|
@alpineQ in light of open-telemetry/opentelemetry-collector-contrib#45154 it appears memory pressure is also applicable to macos. Should we split based on resource type which would mean we end up with:
Io would become disk, network or other depending on what it refers to. This way these metrics are complementing
We then describe it in the description that it comes from psi. |
|
As I've opened the original issue for the collector, I'd like to briefly chime in that from an end user perspective it would make sense to define the metrics similar to what they look like at the source. If it were my call I'd either go with I can see an argument with adding window as part of the name, e.g. stall_type could also be part of the metric name (e.g. Per my understanding, functionally this point is relevant for analytics back ends, where creating statistics across metrics can be handled very differently. For example, when you want to find the maximum across 10s and 60s or the maximum across cpu and memory, having this detail as part of the metric name can apparently complicate the required query language with some back ends. I'm very interested in learning of other arguments, and seeing how this is decided in the end. Thank you everyone who spends time and effort in making this whole thing possible. |
|
This topic was discussed during the System SemConv SIG on 08/01/2025. The resulting naming proposal combines the suggestions above:
|
There was a problem hiding this comment.
@rogercoll an interesting question came up with when looking at what changes are needed.
We have spoken about metric naming but hasn't been considered is attribute naming. What is the recomendation for attribute naming.
Does system.linux.psi.stall_type become:
- system.pressure.linux.stall_type?
Or do we need separate attributes for each resource type? Ie
- system.memory.pressure.linux.stall_type?
Have raised https://github.com/open-telemetry/semantic-conventions/pull/3261/changes#r2678240965 to discuss this aspect.
I would say not to include the resource type in this case, as
What I would leave for discussion in https://github.com/open-telemetry/semantic-conventions/pull/3261/files#r2678240965 is the OS part in the attribute. (the attribute is attached to a metric which already shares the OS uniqueness) |
Co-authored-by: James Thompson <thompson.tomo@outlook.com>
Co-authored-by: James Thompson <thompson.tomo@outlook.com>
|
I'm sorry but I kind of lost track of what word nitpicking you are trying to implement here. Edits by maintainers are enabled for this fork. If you need more editing freedom, you are free to open a new PR and reuse changes defined here without me |
|
@alpineQ could you regenerate/update the docs based on the latest model changes. The open topic is attribute naming in particular if and where the OS name goes. This should be discussed in the linked issue. The options are:
|
|
I tried that to avoid leaving the work unfinished, but got errors referencing undefined new fields—likely due to incomplete renaming—so I gave up. |
thompson-tomo
left a comment
There was a problem hiding this comment.
@alpineQ this should hopefully allow the docs to update. If so, then go through the process of adding the 2 metrics to the cpu section and then adding thr io section.
|
|
||
| This metric is [recommended][MetricRecommended]. | ||
|
|
||
| <!-- semconv metric.system.linux.psi.total_time --> |
There was a problem hiding this comment.
| <!-- semconv metric.system.linux.psi.total_time --> | |
| <!-- semconv metric.system.memory.linux.pressure.total --> |
|
|
||
| This metric is [recommended][MetricRecommended]. | ||
|
|
||
| <!-- semconv metric.system.linux.psi.pressure --> |
There was a problem hiding this comment.
| <!-- semconv metric.system.linux.psi.pressure --> | |
| <!-- semconv metric.system.memory.linux.pressure.average --> |
| <!-- END AUTOGENERATED TEXT --> | ||
| <!-- endsemconv --> | ||
|
|
||
| ### Metric: `system.linux.psi.total_time` |
There was a problem hiding this comment.
| ### Metric: `system.linux.psi.total_time` | |
| ### Metric: `system.memory.linux.pressure.total` |
|
|
||
| For more details, see the [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html). | ||
|
|
||
| ### Metric: `system.linux.psi.pressure` |
There was a problem hiding this comment.
| ### Metric: `system.linux.psi.pressure` | |
| ### Metric: `system.memory.linux.pressure.average` |
|
|
||
| ## Linux PSI (Pressure Stall Information) metrics | ||
|
|
||
| **Description:** Linux Pressure Stall Information (PSI) metrics captured under the namespace `system.linux.psi`. | ||
|
|
||
| PSI is a Linux kernel feature (available since kernel 4.20) that identifies and | ||
| quantifies resource contention. It measures the time impact that resource | ||
| crunches have on workloads by tracking the percentage of time tasks are stalled | ||
| waiting for CPU, memory, or I/O resources. | ||
|
|
||
| PSI helps in: | ||
|
|
||
| - Sizing workloads to hardware or provisioning hardware according to workload demand | ||
| - Detecting productivity losses caused by resource scarcity | ||
| - Dynamic system management (load shedding, job migration, strategic pausing) | ||
| - Maximizing hardware utilization without sacrificing workload health | ||
|
|
||
| For more details, see the [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html). |
There was a problem hiding this comment.
| ## Linux PSI (Pressure Stall Information) metrics | |
| **Description:** Linux Pressure Stall Information (PSI) metrics captured under the namespace `system.linux.psi`. | |
| PSI is a Linux kernel feature (available since kernel 4.20) that identifies and | |
| quantifies resource contention. It measures the time impact that resource | |
| crunches have on workloads by tracking the percentage of time tasks are stalled | |
| waiting for CPU, memory, or I/O resources. | |
| PSI helps in: | |
| - Sizing workloads to hardware or provisioning hardware according to workload demand | |
| - Detecting productivity losses caused by resource scarcity | |
| - Dynamic system management (load shedding, job migration, strategic pausing) | |
| - Maximizing hardware utilization without sacrificing workload health | |
| For more details, see the [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html). |
This should be on the metrics if not already.
| - [Linux PSI (Pressure Stall Information) metrics](#linux-psi-pressure-stall-information-metrics) | ||
| - [Metric: `system.linux.psi.pressure`](#metric-systemlinuxpsipressure) | ||
| - [Metric: `system.linux.psi.total_time`](#metric-systemlinuxpsitotal_time) |
There was a problem hiding this comment.
| - [Linux PSI (Pressure Stall Information) metrics](#linux-psi-pressure-stall-information-metrics) | |
| - [Metric: `system.linux.psi.pressure`](#metric-systemlinuxpsipressure) | |
| - [Metric: `system.linux.psi.total_time`](#metric-systemlinuxpsitotal_time) | |
| - [Metric: `system.memory.linux.pressure.average`](#metric-systemmemorylinuxpressureaverage) | |
| - [Metric: `system.memory.linux.pressure.total`](#metric-systemmemorylinuxpressuretotal) |
|
This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days. |
|
@alpineQ Would you still be able to work on this PR and revisit the suggestions (#3068 (comment))? |
|
This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days. |
Closes #2995
Changes
This PR adds support for Linux Pressure Stall Information (PSI) metrics to the system semantic conventions.
PSI is a Linux kernel feature (available since kernel 4.20) that identifies and quantifies resource contention by measuring the time impact that CPU, memory, and I/O resource crunches have on workloads.
New Metrics
system.linux.psi.pressure(Gauge): Measures resource pressure as a percentage of time that tasks were stalled over a time window (10s, 60s, or 300s)system.linux.psi.total_time(Counter): Tracks the total cumulative stall time in microseconds since system bootNew Attributes
system.psi.resource: The resource type (cpu,memory,io)system.psi.stall_type: The stall severity (somefor partial stalls,fullfor complete stalls where all non-idle tasks are blocked)system.psi.window: The time window for pressure calculation (10s,60s,300s)Use Cases
PSI metrics enable:
References
Relevant issues and PRs
There are issues on this matter in:
And 2 PRs that I am proposing to address these issues:
Important
Pull requests acceptance are subject to the triage process as described in Issue and PR Triage Management.
PRs that do not follow the guidance above, may be automatically rejected and closed.
Merge requirement checklist
[chore]Reopened #2996