Skip to content

observability: alert + prom rule + tile + catalog for instant_payment_probe_outcome_total (rule 25)#62

Merged
mastermanas805 merged 1 commit into
masterfrom
feat/payment-probe-observability
Jun 6, 2026
Merged

observability: alert + prom rule + tile + catalog for instant_payment_probe_outcome_total (rule 25)#62
mastermanas805 merged 1 commit into
masterfrom
feat/payment-probe-observability

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Rule-25 observability for the Layer-3 payment prober (the money heartbeat,
worker internal/jobs/payment_probe.go — forum verdict
docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4). Ships in lockstep with the worker
PR InstaNode-dev/worker#98 that adds the instant_payment_probe_outcome_total
metric, and common PR InstaNode-dev/common#48 (the NR event).

What

  • newrelic/alerts/payment-probe-fail.jsonP1 page on
    instant_payment_probe_outcome_total{result="fail"} > 0 in 10m (paid
    revenue path down). result="degraded" EXCLUDED so the prober never
    false-pages before the operator lights PAYMENT_PROBE_ENABLED + the test
    webhook secret.
  • k8s/prometheus-rules.yaml — new instant-worker-payment-probe group /
    PaymentProbeFail (a distinct, self-contained section — no overlap with the
    concurrent postgres-lockdown infra work).
  • newrelic/dashboards/instanode-reliability.json — three tiles: outcomes per
    leg, fails billboard (must be 0), P95 latency per leg.
  • observability/METRICS-CATALOG.md — rows for the outcome counter + latency
    histogram (both lazy *Vec, INERT until PAYMENT_PROBE_ENABLED=true).

Verification

  • All JSON valid (json.load); prometheus-rules YAML valid + passes the CI
    yamllint config.

Operator-apply

infra has no auto-apply. Awaiting operator PAYMENT_PROBE_ENABLED=true (+
RAZORPAY_TEST_WEBHOOK_SECRET for the upgrade leg) before any series
materialises, then apply via newrelic/apply.sh + the prometheus-rules
ConfigMap.

🤖 Generated with Claude Code

@mastermanas805 mastermanas805 enabled auto-merge (squash) June 6, 2026 14:50
…_probe_outcome_total (rule 25)

Rule-25 observability for the Layer-3 payment prober (the money heartbeat,
worker/internal/jobs/payment_probe.go — forum verdict §4). Ships in lockstep
with the worker PR that adds the metric.

- newrelic/alerts/payment-probe-fail.json — P1 page on
  instant_payment_probe_outcome_total{result="fail"} > 0 in 10m (paid revenue
  path down). result="degraded" EXCLUDED so the prober never false-pages
  before the operator lights PAYMENT_PROBE_ENABLED + the test webhook secret.
- k8s/prometheus-rules.yaml — instant-worker-payment-probe group / PaymentProbeFail
  (mirror of the NR alert).
- newrelic/dashboards/instanode-reliability.json — three tiles: outcomes per
  leg, fails billboard (must be 0), P95 latency per leg.
- observability/METRICS-CATALOG.md — rows for the outcome counter + latency
  histogram (both lazy *Vec, INERT until PAYMENT_PROBE_ENABLED=true).

Operator-apply (infra has no auto-apply). Awaiting operator
PAYMENT_PROBE_ENABLED=true (+ RAZORPAY_TEST_WEBHOOK_SECRET for the upgrade leg)
before any series materialises.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mastermanas805 mastermanas805 force-pushed the feat/payment-probe-observability branch from 2f389f8 to 618e44b Compare June 6, 2026 14:51
@mastermanas805 mastermanas805 merged commit e308143 into master Jun 6, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant