observability: alert + prom rule + tile + catalog for instant_payment_probe_outcome_total (rule 25)#62
Merged
Conversation
…_probe_outcome_total (rule 25)
Rule-25 observability for the Layer-3 payment prober (the money heartbeat,
worker/internal/jobs/payment_probe.go — forum verdict §4). Ships in lockstep
with the worker PR that adds the metric.
- newrelic/alerts/payment-probe-fail.json — P1 page on
instant_payment_probe_outcome_total{result="fail"} > 0 in 10m (paid revenue
path down). result="degraded" EXCLUDED so the prober never false-pages
before the operator lights PAYMENT_PROBE_ENABLED + the test webhook secret.
- k8s/prometheus-rules.yaml — instant-worker-payment-probe group / PaymentProbeFail
(mirror of the NR alert).
- newrelic/dashboards/instanode-reliability.json — three tiles: outcomes per
leg, fails billboard (must be 0), P95 latency per leg.
- observability/METRICS-CATALOG.md — rows for the outcome counter + latency
histogram (both lazy *Vec, INERT until PAYMENT_PROBE_ENABLED=true).
Operator-apply (infra has no auto-apply). Awaiting operator
PAYMENT_PROBE_ENABLED=true (+ RAZORPAY_TEST_WEBHOOK_SECRET for the upgrade leg)
before any series materialises.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2f389f8 to
618e44b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rule-25 observability for the Layer-3 payment prober (the money heartbeat,
worker
internal/jobs/payment_probe.go— forum verdictdocs/ci/FORUM-PAYMENT-E2E-TOOLING.md§4). Ships in lockstep with the workerPR InstaNode-dev/worker#98 that adds the
instant_payment_probe_outcome_totalmetric, and common PR InstaNode-dev/common#48 (the NR event).
What
newrelic/alerts/payment-probe-fail.json— P1 page oninstant_payment_probe_outcome_total{result="fail"} > 0in 10m (paidrevenue path down).
result="degraded"EXCLUDED so the prober neverfalse-pages before the operator lights
PAYMENT_PROBE_ENABLED+ the testwebhook secret.
k8s/prometheus-rules.yaml— newinstant-worker-payment-probegroup /PaymentProbeFail(a distinct, self-contained section — no overlap with theconcurrent postgres-lockdown infra work).
newrelic/dashboards/instanode-reliability.json— three tiles: outcomes perleg, fails billboard (must be 0), P95 latency per leg.
observability/METRICS-CATALOG.md— rows for the outcome counter + latencyhistogram (both lazy *Vec, INERT until
PAYMENT_PROBE_ENABLED=true).Verification
json.load); prometheus-rules YAML valid + passes the CIyamllintconfig.Operator-apply
infra has no auto-apply. Awaiting operator
PAYMENT_PROBE_ENABLED=true(+RAZORPAY_TEST_WEBHOOK_SECRETfor the upgrade leg) before any seriesmaterialises, then apply via
newrelic/apply.sh+ the prometheus-rulesConfigMap.
🤖 Generated with Claude Code