Skip to content

Commit bd693a6

Browse files
sradcoCursor AI Agent
andcommitted
docs/ci/e2e: add alert management documentation, CI workflow, and e2e tests
Signed-off-by: avlitman <alitman@redhat.com> Signed-off-by: Shirly Radco <sradco@redhat.com> Signed-off-by: machadovilaca <machadovilaca@gmail.com> Co-authored-by: Cursor AI Agent <cursor-ai@users.noreply.github.com>
1 parent e1a7bdb commit bd693a6

6 files changed

Lines changed: 995 additions & 0 deletions

File tree

.github/workflows/unit-tests.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
name: Unit Tests
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- add-alert-management-api-base
7+
8+
jobs:
9+
test:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- name: Checkout code
13+
uses: actions/checkout@v4
14+
15+
- name: Set up Go
16+
uses: actions/setup-go@v5
17+
with:
18+
go-version-file: go.mod
19+
20+
- name: Run tests
21+
run: go test -count=1 $(go list ./... | grep -v /test/e2e)

docs/alert-management.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
## Alert Management Notes
2+
3+
This document covers alert management behavior and prerequisites for the monitoring plugin.
4+
5+
### User workload monitoring prerequisites
6+
7+
To include **user workload** alerts and rules in `/api/v1/alerting/alerts` and `/api/v1/alerting/rules`, the user workload monitoring stack must be enabled. Follow the OpenShift documentation for enabling and configuring UWM:
8+
9+
https://docs.redhat.com/en/documentation/monitoring_stack_for_red_hat_openshift/4.20/html/configuring_user_workload_monitoring/configuring-alerts-and-notifications-uwm
10+
11+
#### How the plugin reads user workload alerts/rules
12+
13+
The plugin prefers **Thanos tenancy** for user workload alerts/rules (RBAC-scoped, requires a namespace parameter). When the client does not provide a `namespace` filter, the plugin discovers candidate namespaces and queries Thanos tenancy per-namespace, using the end-user bearer token.
14+
15+
Routes in `openshift-user-workload-monitoring` are treated as **fallbacks** (and are also used for some health checks and pending state retrieval).
16+
17+
If you want to create the user workload Prometheus route (optional), you can expose the service:
18+
19+
```shell
20+
oc -n openshift-user-workload-monitoring expose svc/prometheus-user-workload-web --name=prometheus-user-workload-web --port=web
21+
```
22+
23+
If the route is missing/unreachable but tenancy is healthy, the plugin should still return user workload data and suppress route warnings.
24+
25+
#### Alert states
26+
27+
- `/api/v1/alerting/alerts?state=pending`: pending alerts come from Prometheus.
28+
- `/api/v1/alerting/alerts?state=firing`: firing alerts come from Alertmanager when available.
29+
- `/api/v1/alerting/alerts?state=silenced`: silenced alerts come from Alertmanager (requires an Alertmanager endpoint).
30+
31+
### Alertmanager routing choices
32+
33+
OpenShift supports routing user workload alerts to:
34+
35+
- The **platform Alertmanager** (default instance)
36+
- A **separate Alertmanager** for user workloads
37+
- **External Alertmanager** instances
38+
39+
This is a cluster configuration choice and does not change the plugin API shape. The plugin reads alerts from Alertmanager (for firing/silenced) and Prometheus (for pending), then merges platform and user workload results when available.
40+
41+
The plugin intentionally reads from only the in-cluster Alertmanager endpoints. Supporting multiple external Alertmanagers would introduce ambiguous alert state and silencing outcomes because each instance can apply different routing, inhibition, and silence configurations.

docs/alert-rule-classification.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# Alert Rule Classification - Design and Usage
2+
3+
## Overview
4+
The backend classifies Prometheus alerting rules into a “component” and an “impact layer”. It:
5+
- Computes an `openshift_io_alert_rule_id` per alerting rule.
6+
- Determines component/layer based on matcher logic and rule labels.
7+
- Allows users to override classification via a single, fixed-name ConfigMap per namespace.
8+
- Enriches the Alerts API response with `openshift_io_alert_rule_id`, `openshift_io_alert_component`, and `openshift_io_alert_layer`.
9+
10+
This document explains how it works, how to override, and how to test it.
11+
12+
13+
## Terminology
14+
- openshift_io_alert_rule_id: Identifier for an alerting rule. Computed from a canonicalized view of the rule definition and encoded as `rid_` + base64url(nopad(sha256(payload))). Independent of `PrometheusRule` name.
15+
- component: Logical owner of the alert (e.g., `kube-apiserver`, `etcd`, a namespace, etc.).
16+
- layer: Impact scope. Allowed values:
17+
- `cluster`
18+
- `namespace`
19+
20+
Notes:
21+
- **Stability**:
22+
- The id is **always derived from the rule spec**. If the rule definition changes (expr/for/business labels/name), the id may change.
23+
- For **platform rules**, this API currently only supports label updates via `AlertRelabelConfig` (not editing expr/for), so the id is effectively stable unless the upstream operator changes the rule definition.
24+
- For **user-defined rules**, the API stamps the computed id into the `PrometheusRule` rule labels. If you update the rule definition, the API returns the **new** id and migrates any existing classification override to the new id.
25+
- Layer values are validated as `cluster|namespace` when set. To remove an override, clear the field (via API `null` or by removing the ConfigMap entry); empty/invalid values are ignored at read time.
26+
27+
## Rule ID computation (openshift_io_alert_rule_id)
28+
Location: `pkg/alert_rule/alert_rule.go`
29+
30+
The backend computes a specHash-like value from:
31+
- `kind`/`name`: `alert` + `alert:` name or `record` + `record:` name
32+
- `expr`: trimmed with consecutive whitespace collapsed
33+
- `for`: trimmed (duration string as written in the rule)
34+
- `labels`: only non-system labels
35+
- excludes labels with `openshift_io_` prefix and the `alertname` label
36+
- drops empty values
37+
- keeps only valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`)
38+
- sorted by key and joined as `key=value` lines
39+
40+
Annotations are intentionally ignored to reduce id churn on documentation-only changes.
41+
42+
## Classification Logic (How component/layer are determined)
43+
Location: `pkg/alertcomponent/matcher.go`
44+
45+
1) The code adapts `cluster-health-analyzer` matchers:
46+
- CVO-related alerts (update/upgrade) → component/layer based on known patterns
47+
- Compute / node-related alerts
48+
- Core control plane components (renamed to layer `cluster`)
49+
- Workload/namespace-level alerts (renamed to layer `namespace`)
50+
51+
2) Fallback:
52+
- If the computed component is empty or “Others”, we set:
53+
- `component = other`
54+
- `layer` derived from source:
55+
- `openshift_io_alert_source=platform``cluster`
56+
- `openshift_io_prometheus_rule_namespace=openshift-monitoring``cluster`
57+
- `prometheus` label starting with `openshift-monitoring/``cluster`
58+
- otherwise → `namespace`
59+
60+
3) Result:
61+
- Each alerting rule is assigned a `(component, layer)` tuple following the above logic.
62+
63+
## Developer Overrides via Rule Labels (Recommended)
64+
If you want explicit component/layer values and do not want to rely on the matcher, set
65+
these labels on each rule in your `PrometheusRule`:
66+
- `openshift_io_alert_rule_component`
67+
- `openshift_io_alert_rule_layer`
68+
69+
Both are validated the same way as API overrides:
70+
- `component`: 1-253 chars, alphanumeric + `._-`, must start/end alphanumeric
71+
- `layer`: `cluster` or `namespace`
72+
73+
When these labels are present and valid, they override matcher-derived values.
74+
75+
## User Overrides (ConfigMap)
76+
Location: `pkg/management/update_classification.go`, `pkg/management/get_alerts.go`
77+
78+
- The backend stores overrides in the plugin namespace, sharded by target rule namespace:
79+
- Name: `alert-classification-overrides-<rule-namespace>`
80+
- Namespace: the monitoring plugin's namespace
81+
- Required label:
82+
- `monitoring.openshift.io/type=alert-classification-overrides`
83+
- Recommended label:
84+
- `app.kubernetes.io/managed-by=openshift-console`
85+
86+
- Data layout:
87+
- Key: base64url(nopad(UTF-8 bytes of `<openshift_io_alert_rule_id>`))
88+
- This keeps ConfigMap keys opaque and avoids relying on any particular id character set.
89+
- Value: JSON object with a `classification` field that holds component/layer.
90+
- Optional metadata fields such as `alertName`, `prometheusRuleName`, and
91+
`prometheusRuleNamespace` may be included for readability; they are ignored by
92+
the backend.
93+
- Dynamic overrides:
94+
- `openshift_io_alert_rule_component_from`: derive component from an alert label key.
95+
- `openshift_io_alert_rule_layer_from`: derive layer from an alert label key.
96+
97+
Example:
98+
```json
99+
{
100+
"alertName": "ClusterOperatorDown",
101+
"prometheusRuleName": "cluster-version",
102+
"prometheusRuleNamespace": "openshift-cluster-version",
103+
"classification": {
104+
"openshift_io_alert_rule_component_from": "name",
105+
"openshift_io_alert_rule_layer": "cluster"
106+
}
107+
}
108+
```
109+
110+
Notes:
111+
- Overrides are only read when the required `monitoring.openshift.io/type` label is present.
112+
- Invalid component/layer values are ignored for that entry.
113+
- `*_from` values must be valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`).
114+
- If a `*_from` label is present but the alert does not carry that label or the derived
115+
value is invalid, the backend falls back to static values (if present) or defaults.
116+
- If both component and layer are empty, the entry is removed.
117+
118+
119+
## Alerts API Enrichment
120+
Location: `pkg/management/get_alerts.go`, `pkg/k8s/prometheus_alerts.go`
121+
122+
- Endpoint: `GET /api/v1/alerting/alerts` (prom-compatible schema)
123+
- The backend fetches active alerts and enriches each alert with:
124+
- `openshift_io_alert_rule_id`
125+
- `openshift_io_alert_component`
126+
- `openshift_io_alert_layer`
127+
- `prometheusRuleName`: name of the PrometheusRule resource the alert originates from
128+
- `prometheusRuleNamespace`: namespace of that PrometheusRule resource
129+
- `alertingRuleName`: name of the AlertingRule CR that generated the PrometheusRule (empty when the PrometheusRule is not owned by an AlertingRule CR)
130+
- Prometheus compatibility:
131+
- Base response matches Prometheus `/api/v1/alerts`.
132+
- Additional fields are additive and safe for clients like Perses.
133+
134+
## Prometheus/Thanos Sources
135+
Location: `pkg/k8s/prometheus_alerts.go`
136+
137+
- Order of candidates:
138+
1) Thanos Route `thanos-querier` at `/api` + `/v1/alerts` (oauth-proxied)
139+
2) In-cluster Thanos service `https://thanos-querier.openshift-monitoring.svc:9091/api/v1/alerts`
140+
3) In-cluster Prometheus `https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts`
141+
4) In-cluster Prometheus (plain HTTP) `http://prometheus-k8s.openshift-monitoring.svc:9090/api/v1/alerts` (fallback)
142+
5) Prometheus Route `prometheus-k8s` at `/api/v1/alerts`
143+
144+
- TLS and Auth:
145+
- Bearer token: service account token from in-cluster config.
146+
- CA trust: system pool + `SSL_CERT_FILE` + `/var/run/configmaps/service-ca/service-ca.crt`.
147+
148+
RBAC:
149+
- Read routes in `openshift-monitoring`.
150+
- Access `prometheuses/api` as needed for oauth-proxied endpoints.
151+
152+
## Updating Rules Classification
153+
APIs:
154+
- Single update:
155+
- Method: `PATCH /api/v1/alerting/rules/{ruleId}`
156+
- Request body:
157+
```json
158+
{
159+
"classification": {
160+
"openshift_io_alert_rule_component": "team-x",
161+
"openshift_io_alert_rule_layer": "namespace",
162+
"openshift_io_alert_rule_component_from": "name",
163+
"openshift_io_alert_rule_layer_from": "layer"
164+
}
165+
}
166+
```
167+
- `openshift_io_alert_rule_layer`: `cluster` or `namespace`
168+
- To remove a classification override, set the field to `null` (e.g. `"openshift_io_alert_rule_layer": null`).
169+
- Response:
170+
- 200 OK with a status payload (same format as other rule PATCH responses), where `status_code` is 204 on success.
171+
- Standard error body on failure (400 validation, 404 not found, etc.)
172+
- Bulk update:
173+
- Method: `PATCH /api/v1/alerting/rules`
174+
- Request body:
175+
```json
176+
{
177+
"ruleIds": ["<id-a>", "<id-b>"],
178+
"classification": {
179+
"openshift_io_alert_rule_component": "etcd",
180+
"openshift_io_alert_rule_layer": "cluster"
181+
}
182+
}
183+
```
184+
- Response:
185+
- 200 OK with per-rule results (same format as other bulk rule PATCH responses). Clients should handle partial failures.
186+
187+
Direct K8s (supported for power users/GitOps):
188+
- PATCH/PUT the ConfigMap `alert-classification-overrides-<rule-namespace>` in the monitoring plugin namespace (respect `resourceVersion`).
189+
- Each entry is keyed by base64url(`<openshift_io_alert_rule_id>`) with a JSON payload that contains a `classification` object (`openshift_io_alert_rule_component`, `openshift_io_alert_rule_layer`).
190+
- UI should check update permissions with SelfSubjectAccessReview before showing an editor.
191+
192+
Notes:
193+
- These endpoints are intended for updating **classification only** (component/layer overrides),
194+
with permissions enforced based on the rule’s ownership (platform, user workload, operator-managed,
195+
GitOps-managed).
196+
- To update other rule fields (expr/labels/annotations/etc.), use `PATCH /api/v1/alerting/rules/{ruleId}`.
197+
Clients that need to update both should issue two requests. The combined operation is not atomic.
198+
- In the ConfigMap override entries, classification is nested under `classification`
199+
and validated as component/layer to keep it separate from generic label updates.
200+
201+
## Security Notes
202+
- Persist only minimal classification metadata in the fixed-name ConfigMap.
203+
204+
## Testing and Ops
205+
Unit tests:
206+
- `pkg/management/get_alerts_test.go`
207+
- Overrides from labeled ConfigMap, fallback behavior, label validation.
208+
209+
## Future Work
210+
- Optional CRD to formalize the schema (adds overhead; ConfigMap is sufficient today).
211+
- Optional composite update API if we need to update rule fields and classification atomically.
212+
- De-duplication/merge logic when aggregating alerts across sources.
213+

0 commit comments

Comments
 (0)