Skip to content

Commit 76d0f7e

Browse files
SamBarkerclaude
andcommitted
Strengthen observability section: lifecycle state is public API
Replace vague "implementation concern" deferral with concrete commitments: - Management endpoint must expose per-cluster state and failure reason - Metrics must capture current state, time in state, and transition count (with suggested names following Prometheus/Micrometer conventions) - Proposal is intentionally non-exhaustive; implementations may add more In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
1 parent cbc15c5 commit 76d0f7e

1 file changed

Lines changed: 10 additions & 1 deletion

File tree

proposals/016-virtual-cluster-lifecycle.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,16 @@ Graceful draining reduces unnecessary client errors during planned shutdowns, an
113113
114114
### Observability
115115
116-
Cluster lifecycle state should be observable — through management endpoints, logging, or metrics — so that operators and tooling can determine which clusters are accepting connections, which have failed, and why. The specific reporting mechanism is an implementation concern and not prescribed by this proposal.
116+
Cluster lifecycle state is public API. Two mechanisms must be provided:
117+
118+
**Management endpoint**: a queryable endpoint returning the current state and failure reason (where applicable) of each virtual cluster, for on-demand inspection by operators and tooling.
119+
120+
**Metrics**: at a minimum, metrics should capture:
121+
- Current state of each virtual cluster (e.g. `kroxylicious_virtual_cluster_state`)
122+
- Time spent in the current state (e.g. `kroxylicious_virtual_cluster_state_duration_seconds`) — enables alerting on clusters stuck in `failed` or `initializing`
123+
- Total state transitions per cluster (e.g. `kroxylicious_virtual_cluster_transitions_total`) — enables detection of instability or flapping
124+
125+
Implementations may expose additional metrics. Metric names and endpoint paths are confirmed in the implementation and documented as public API at that point.
117126

118127

119128
## Affected/not affected projects

0 commit comments

Comments
 (0)