Skip to content

Add Pressure Stall Information (PSI) metrics (reopened #2996)#3068

Closed
alpineQ wants to merge 14 commits intoopen-telemetry:mainfrom
alpineQ:main
Closed

Add Pressure Stall Information (PSI) metrics (reopened #2996)#3068
alpineQ wants to merge 14 commits intoopen-telemetry:mainfrom
alpineQ:main

Conversation

@alpineQ
Copy link
Copy Markdown

@alpineQ alpineQ commented Nov 11, 2025

Closes #2995

Changes

This PR adds support for Linux Pressure Stall Information (PSI) metrics to the system semantic conventions.

PSI is a Linux kernel feature (available since kernel 4.20) that identifies and quantifies resource contention by measuring the time impact that CPU, memory, and I/O resource crunches have on workloads.

New Metrics

  • system.linux.psi.pressure (Gauge): Measures resource pressure as a percentage of time that tasks were stalled over a time window (10s, 60s, or 300s)
  • system.linux.psi.total_time (Counter): Tracks the total cumulative stall time in microseconds since system boot

New Attributes

  • system.psi.resource: The resource type (cpu, memory, io)
  • system.psi.stall_type: The stall severity (some for partial stalls, full for complete stalls where all non-idle tasks are blocked)
  • system.psi.window: The time window for pressure calculation (10s, 60s, 300s)

Use Cases

PSI metrics enable:

  • Sizing workloads to hardware or provisioning hardware according to workload demand
  • Detecting productivity losses caused by resource scarcity
  • Dynamic system management (load shedding, job migration, strategic pausing)
  • Maximizing hardware utilization without sacrificing workload health

References

Relevant issues and PRs

There are issues on this matter in:

And 2 PRs that I am proposing to address these issues:

Important

Pull requests acceptance are subject to the triage process as described in Issue and PR Triage Management.
PRs that do not follow the guidance above, may be automatically rejected and closed.

Merge requirement checklist

  • CONTRIBUTING.md guidelines followed.
  • Change log entry added, according to the guidelines in When to add a changelog entry.
    • If your PR does not need a change log, start the PR title with [chore]
  • Links to the prototypes or existing instrumentations (when adding or changing conventions)

Reopened #2996

alpineQ and others added 7 commits November 11, 2025 10:10
@lmolkova lmolkova moved this from Untriaged to Awaiting codeowners approval in Semantic Conventions Triage Nov 20, 2025
@github-actions
Copy link
Copy Markdown

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions Bot added the Stale label Nov 26, 2025
@alpineQ
Copy link
Copy Markdown
Author

alpineQ commented Nov 27, 2025

@thompson-tomo @braydonk @trask
Issue #2996 was reopened here. If any additional changes are needed, I'm open to suggestions.

@thompson-tomo
Copy link
Copy Markdown
Contributor

@alpineQ can you rebase/merge in master as the doc templates have been updated.

@github-actions github-actions Bot removed the Stale label Nov 28, 2025
@alpineQ
Copy link
Copy Markdown
Author

alpineQ commented Dec 1, 2025

@thompson-tomo any updates on this?

Copy link
Copy Markdown
Contributor

@thompson-tomo thompson-tomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs and definitions look good to me based on published guidance & clarification.

@trask
Copy link
Copy Markdown
Member

trask commented Dec 1, 2025

hi @alpineQ, this will need review and approval from @open-telemetry/semconv-system-approvers

@alpineQ
Copy link
Copy Markdown
Author

alpineQ commented Dec 14, 2025

@trask do these @open-telemetry/semconv-system-approvers really exist or only you can see them? 🤣

@rogercoll
Copy link
Copy Markdown
Contributor

@alpineQ Apologies for the delayed response. The group has been focused on delivering the first stable release of a subset of system metrics, and unfortunately this PR slipped through the cracks.

I’ve also noticed that we’re attempting to add a memory pressure metric for Darwin as well (open-telemetry/opentelemetry-collector-contrib#45154). This made me wonder whether we could agree on a cross-platform, generic naming scheme for pressure metrics (for example, system.cpu.pressure).

Since I’m not very familiar with how this concept is handled across other platforms, I’ve added this topic to the agenda for our next SIG meeting (08/01/2026) so we can discuss it together.

@rogercoll rogercoll removed the Stale label Dec 29, 2025
@thompson-tomo
Copy link
Copy Markdown
Contributor

@alpineQ in light of open-telemetry/opentelemetry-collector-contrib#45154 it appears memory pressure is also applicable to macos.

Should we split based on resource type which would mean we end up with:

  • system.cpu.pressure.linux.ratio
  • system.cpu.pressure.linux.total_time
  • system.memory.pressure.linux.ratio
  • system.memory.pressure.linux.total_time

Io would become disk, network or other depending on what it refers to.

This way these metrics are complementing

  • system.memory.pressure.darwin.status

We then describe it in the description that it comes from psi.

@jeffland-consist
Copy link
Copy Markdown

As I've opened the original issue for the collector, I'd like to briefly chime in that from an end user perspective it would make sense to define the metrics similar to what they look like at the source. If it were my call I'd either go with system.linux.(cpu|memory|io).pressure.(ratio|total_time) and attributes for stall_type and window, or with system.linux.pressure.(ratio|total_time) and attributes for resource, stall_type and window. The former is a little more in line with how existing metrics are formatted, while the latter may be better suited for analytics (see last paragraph).

I can see an argument with adding window as part of the name, e.g. system.linux.cpu.pressure.ratio.10s, as it would be analoguous to the existing system.linux.cpu.load_average.5m. However, if my understanding is correct, for load average this was primarily done this way to adopt the pre-existing thought pattern/vocabulary in that there are three distinct "load averages", and I am not sure if the same way of thought applies to PSI. Naively, it would make more sense to me to have the time window as an attribute with load average too.

stall_type could also be part of the metric name (e.g. system.linux.cpu.pressure.ratio.some), but I'm not enough of a a sysadmin to have a strong opinion on that difference.

Per my understanding, functionally this point is relevant for analytics back ends, where creating statistics across metrics can be handled very differently. For example, when you want to find the maximum across 10s and 60s or the maximum across cpu and memory, having this detail as part of the metric name can apparently complicate the required query language with some back ends.

I'm very interested in learning of other arguments, and seeing how this is decided in the end. Thank you everyone who spends time and effort in making this whole thing possible.

@rogercoll
Copy link
Copy Markdown
Contributor

This topic was discussed during the System SemConv SIG on 08/01/2025. The resulting naming proposal combines the suggestions above:

  1. Split by resource type: As suggested by @thompson-tomo, metrics should start by defining the relevant system area: system.{cpu/memory/disk...}.
  2. Include OS for specific features: Since psi is a Linux-only feature, we should use the OS name to separate the resource (system.memory) from the OS-specific technology, per the design philosophy docs: system.cpu.linux.
  3. Standardize on pressure: To avoid redundancy, we should use either psi or pressure, but not both. The group preferred pressure to allow for future cross-OS terminology: system.cpu.linux.pressure.
  4. Window as part of the metric name (Under evaluation, cc @braydonk): As @jeffland-consist pointed out, the window should be part of the metric rather than an attribute. This aligns with general guidelines stating that aggregations over all the attributes... SHOULD be meaningful. This is analogous to: system.linux.cpu.load_average.5m:
    • system.cpu.linux.pressure_average.{10s/1m/5m},
    • system.cpu.linux.pressure.total

Copy link
Copy Markdown
Contributor

@thompson-tomo thompson-tomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogercoll an interesting question came up with when looking at what changes are needed.

We have spoken about metric naming but hasn't been considered is attribute naming. What is the recomendation for attribute naming.

Does system.linux.psi.stall_type become:

  • system.pressure.linux.stall_type?

Or do we need separate attributes for each resource type? Ie

  • system.memory.pressure.linux.stall_type?

Have raised https://github.com/open-telemetry/semantic-conventions/pull/3261/changes#r2678240965 to discuss this aspect.

Comment thread model/system/metrics.yaml
Comment thread model/system/registry.yaml Outdated
Comment thread model/system/registry.yaml Outdated
@github-project-automation github-project-automation Bot moved this from Awaiting codeowners approval to Blocked in Semantic Conventions Triage Jan 10, 2026
@rogercoll
Copy link
Copy Markdown
Contributor

rogercoll commented Jan 12, 2026

@rogercoll an interesting question came up with when looking at what changes are needed.

We have spoken about metric naming but hasn't been considered is attribute naming. What is the recomendation for attribute naming.

Does system.linux.psi.stall_type become:

* system.pressure.linux.stall_type?

Or do we need separate attributes for each resource type? Ie

* system.memory.pressure.linux.stall_type?

Have raised https://github.com/open-telemetry/semantic-conventions/pull/3261/changes#r2678240965 to discuss this aspect.

I would say not to include the resource type in this case, as stall_type possible values are shared across resources. Ie

system.pressure.stall_type

What I would leave for discussion in https://github.com/open-telemetry/semantic-conventions/pull/3261/files#r2678240965 is the OS part in the attribute. (the attribute is attached to a metric which already shares the OS uniqueness)

alpineQ and others added 2 commits January 18, 2026 14:17
Co-authored-by: James Thompson <thompson.tomo@outlook.com>
Co-authored-by: James Thompson <thompson.tomo@outlook.com>
@alpineQ
Copy link
Copy Markdown
Author

alpineQ commented Jan 18, 2026

I'm sorry but I kind of lost track of what word nitpicking you are trying to implement here. Edits by maintainers are enabled for this fork. If you need more editing freedom, you are free to open a new PR and reuse changes defined here without me

@thompson-tomo
Copy link
Copy Markdown
Contributor

thompson-tomo commented Jan 19, 2026

@alpineQ could you regenerate/update the docs based on the latest model changes.

The open topic is attribute naming in particular if and where the OS name goes. This should be discussed in the linked issue.

The options are:

  • system.pressure.stall_type
  • system.pressure.linux.stall_type
  • system.linux.pressure.stall_type

@alpineQ
Copy link
Copy Markdown
Author

alpineQ commented Jan 19, 2026

I tried that to avoid leaving the work unfinished, but got errors referencing undefined new fields—likely due to incomplete renaming—so I gave up.

Copy link
Copy Markdown
Contributor

@thompson-tomo thompson-tomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alpineQ this should hopefully allow the docs to update. If so, then go through the process of adding the 2 metrics to the cpu section and then adding thr io section.


This metric is [recommended][MetricRecommended].

<!-- semconv metric.system.linux.psi.total_time -->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!-- semconv metric.system.linux.psi.total_time -->
<!-- semconv metric.system.memory.linux.pressure.total -->


This metric is [recommended][MetricRecommended].

<!-- semconv metric.system.linux.psi.pressure -->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!-- semconv metric.system.linux.psi.pressure -->
<!-- semconv metric.system.memory.linux.pressure.average -->

<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

### Metric: `system.linux.psi.total_time`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Metric: `system.linux.psi.total_time`
### Metric: `system.memory.linux.pressure.total`


For more details, see the [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

### Metric: `system.linux.psi.pressure`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Metric: `system.linux.psi.pressure`
### Metric: `system.memory.linux.pressure.average`

Comment on lines +1204 to +1221

## Linux PSI (Pressure Stall Information) metrics

**Description:** Linux Pressure Stall Information (PSI) metrics captured under the namespace `system.linux.psi`.

PSI is a Linux kernel feature (available since kernel 4.20) that identifies and
quantifies resource contention. It measures the time impact that resource
crunches have on workloads by tracking the percentage of time tasks are stalled
waiting for CPU, memory, or I/O resources.

PSI helps in:

- Sizing workloads to hardware or provisioning hardware according to workload demand
- Detecting productivity losses caused by resource scarcity
- Dynamic system management (load shedding, job migration, strategic pausing)
- Maximizing hardware utilization without sacrificing workload health

For more details, see the [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Linux PSI (Pressure Stall Information) metrics
**Description:** Linux Pressure Stall Information (PSI) metrics captured under the namespace `system.linux.psi`.
PSI is a Linux kernel feature (available since kernel 4.20) that identifies and
quantifies resource contention. It measures the time impact that resource
crunches have on workloads by tracking the percentage of time tasks are stalled
waiting for CPU, memory, or I/O resources.
PSI helps in:
- Sizing workloads to hardware or provisioning hardware according to workload demand
- Detecting productivity losses caused by resource scarcity
- Dynamic system management (load shedding, job migration, strategic pausing)
- Maximizing hardware utilization without sacrificing workload health
For more details, see the [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

This should be on the metrics if not already.

Comment on lines +63 to +65
- [Linux PSI (Pressure Stall Information) metrics](#linux-psi-pressure-stall-information-metrics)
- [Metric: `system.linux.psi.pressure`](#metric-systemlinuxpsipressure)
- [Metric: `system.linux.psi.total_time`](#metric-systemlinuxpsitotal_time)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Linux PSI (Pressure Stall Information) metrics](#linux-psi-pressure-stall-information-metrics)
- [Metric: `system.linux.psi.pressure`](#metric-systemlinuxpsipressure)
- [Metric: `system.linux.psi.total_time`](#metric-systemlinuxpsitotal_time)
- [Metric: `system.memory.linux.pressure.average`](#metric-systemmemorylinuxpressureaverage)
- [Metric: `system.memory.linux.pressure.total`](#metric-systemmemorylinuxpressuretotal)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 3, 2026

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

@github-actions github-actions Bot added the Stale label Feb 3, 2026
@rogercoll
Copy link
Copy Markdown
Contributor

@alpineQ Would you still be able to work on this PR and revisit the suggestions (#3068 (comment))?

@github-actions github-actions Bot removed the Stale label Feb 6, 2026
@github-actions
Copy link
Copy Markdown

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

Add Pressure Stall Information (PSI) metrics

6 participants