PAYMENTS-11567 Resque latency metrics#31
Conversation
|
bugbot review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 7be85b6. Configure here.
There was a problem hiding this comment.
Pull request overview
Adds opt-in, per-job Resque latency/performance histograms to bc-prometheus-ruby, implemented via a worker-side producer (parent-process Resque worker instrumentation that pushes envelopes) and an aggregator-side type collector that records Prometheus histograms.
Changes:
- Introduces
Integrations::Resque::JobMetricsplus payload DTOs (ActiveJobPayload,VanillaResquePayload,JobPayload) to record queue latency (ActiveJob-shaped payloads only) and perform duration (all jobs) and pushtype: 'resque_job'envelopes. - Adds a new aggregator
TypeCollectors::ResqueJobregistered by the Resque instrumentor to exposeresque_job_queue_latency_secondsandresque_job_perform_duration_seconds. - Documents and wires feature gating via
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED(configuration + README + CHANGELOG) and adds RSpec coverage for the new components.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| CHANGELOG.md | Notes the new opt-in per-job Resque histograms for the pending release. |
| README.md | Documents the new opt-in env var and describes the two new per-job histograms and their behavior. |
| lib/bigcommerce/prometheus.rb | Requires the new Resque per-job payload/metrics and type-collector files. |
| lib/bigcommerce/prometheus/configuration.rb | Adds resque_per_job_metrics_enabled configuration sourced from PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED. |
| lib/bigcommerce/prometheus/instrumentors/resque.rb | Registers the new TypeCollectors::ResqueJob on the exporter server. |
| lib/bigcommerce/prometheus/integrations/resque.rb | Starts per-job Resque job metrics (gated) alongside existing Resque instrumentation. |
| lib/bigcommerce/prometheus/integrations/resque/active_job_payload.rb | Implements ActiveJob-shaped payload parsing to extract job_class and queue-latency anchor time. |
| lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb | Implements producer hooks and envelope pushes for queue latency and perform duration (with env gating + Resque::Worker prepend). |
| lib/bigcommerce/prometheus/integrations/resque/job_payload.rb | Classifies Resque payload shape and returns an appropriate payload DTO without raising. |
| lib/bigcommerce/prometheus/integrations/resque/vanilla_resque_payload.rb | Implements vanilla Resque payload DTO (class label only; no queue-latency anchor). |
| lib/bigcommerce/prometheus/type_collectors/resque_job.rb | Adds the aggregator type collector for type: 'resque_job' and records the two histograms. |
| spec/bigcommerce/prometheus/integrations/resque/active_job_payload_spec.rb | Tests ActiveJob payload parsing and anchor selection behavior. |
| spec/bigcommerce/prometheus/integrations/resque/job_metrics_spec.rb | Tests envelope shape, clamping, error rescue, and env-var gating / idempotent hook installation. |
| spec/bigcommerce/prometheus/integrations/resque/job_payload_spec.rb | Tests payload classification between ActiveJob-shaped vs vanilla (including malformed payloads). |
| spec/bigcommerce/prometheus/integrations/resque/vanilla_resque_payload_spec.rb | Tests vanilla payload labeling and nil anchor behavior. |
| spec/bigcommerce/prometheus/type_collectors/resque_job_spec.rb | Tests collector type routing contract, histogram construction, and label merge behavior through Base#collect. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ee51981 to
7be85b6
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7be85b6. Configure here.
7be85b6 to
39b0fbd
Compare
… on worker processing time
39b0fbd to
1832211
Compare
j05h
left a comment
There was a problem hiding this comment.
Couple of minor comments, but generally looks good. Thanks for the PR @WillemHoman
|
|
||
| def parse_time(value) | ||
| return value if value.is_a?(Time) | ||
| return nil if value.nil? || value.to_s.empty? |
There was a problem hiding this comment.
| return nil if value.nil? || value.to_s.empty? | |
| return nil if value.to_s.empty? |
🍺 not strictly necessary as nil.to_s == ""
| inner = activejob_inner(payload) | ||
| inner ? ActiveJobPayload.new(inner) : VanillaResquePayload.new(payload) | ||
| rescue StandardError | ||
| VanillaResquePayload.new({}) |
There was a problem hiding this comment.
Should be logged here or we're hiding it.

What? Why?
Adds opt-in latency metrics for Resque jobs.
resque_job_queue_latency_seconds{job_class}: how long the job spent on the queue prior to being picked up for processingresque_job_perform_duration_seconds{job_class}: how long it takes to process the jobRequires Resque 1.27 or newer running in the default fork-per-job mode.
Workers with
FORK_PER_JOB=falseand platforms without fork are not instrumented; nothing is recorded and a warning is logged at worker boot.Off by default. Opt in per service by setting
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1.Supports both ActiveJob and vanilla Resque jobs.
resque_job_perform_duration_secondscovers every job.resque_job_queue_latency_secondsis supported for ActiveJob-enqueued jobs only, because it reads fields from the ActiveJob payload — see below.The metrics introduced
resque_job_queue_latency_seconds
Queue latency measures the time from when the job became eligible to run until a worker picked it up.
Buckets:
[0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300].Latency is clamped to
0.0minimum so that clock skew between the enqueuer and the worker host cannot produce a negative observation.This metric relies on the job having the ActiveJob payload shape.
i.e.
payload['args'][0](a Hash) must containjob_class(string) — used as the metric label.scheduled_at(ISO 8601 string, optional) — preferred anchor when present.enqueued_at(ISO 8601 string) — anchor whenscheduled_atis absent.A job is eligible to be picked up by a worker at
scheduled_atwhen scheduled with a delay.Otherwise a job is immediately eligible upon enqueuing, and so
enqueued_atis used.More precisely
scheduled_atis populated and is the time the job is eligible to be picked up by the workerscheduled_atisnil. In this caseenqueued_atis used as it as the job is eligible to be picked up by the worker as soon as it is enqueued.This holds for both initial attempts and retries.
The one caveat is for versions of Rails prior to 7.1.
scheduled_atis only serialized by ActiveJob from Rails 7.1; on older Rails the payload carries onlyenqueued_at, so intentional scheduling/backoff delay counts toward queue latency.For vanilla Resque jobs without these fields, the metric silently no-ops for them.
Sample payloads for both shapes are in Payload parsing below.
The gem detects the payload by the fields above rather than by matching the
JobWrapperclass name so that non ActiveJob Resque tasks can also opt in by populating the necessary fields.aConverting vanilla jobs to support resque_job_queue_latency_seconds
Vanilla Resque jobs that need queue latency should be either be
Details for both routes are at the end of this doc: Converting to ActiveJob and Wrapper class alternative.
Retries
ActiveJob re-stamps both
enqueued_atandscheduled_aton every retry (including each re-queue).This means that
resque_job_queue_latency_secondsonly reflects the latency for each individual retry.To measure the entire duration from initial enqueue until dequeue of the final attempt, a separate userland metric is needed.
For example, domain events should have an
occurred_attimestamp as per the docs.resque_job_perform_duration_seconds
How long does the processing of the job take.
This is the total child-process lifetime from fork to
Process.waitpidreturn.Buckets:
[0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60].This works for both vanilla Resque jobs and ActiveJobs and doesn't rely on any payload fields.
Why this is implemented in bc-prometheus-ruby instead of each service
In-child metric collection requires a synchronous flush before
exit!which is Resque's default child-exit.The flush is bounded by the
bc-prometheus-rubyworker thread's sleep cadence (0.5 s by default) — which makes it slow relative to fast publishes.For example this metric was initially introduced in https://github.com/bigcommerce/bigpay/pull/10597 however had to be reverted as it increased Resque job run-time from 20ms to 500ms.
In contrast, the parent process is long-lived, so the bc-prom worker thread drains naturally between jobs without any synchronous wait.
Moving the instrumentation to the parent eliminates the per-job latency tax entirely.
This also ensures consistent metric names and labels.
Opt-in env var
This is opt-in because per-job series can be high cardinality when a service has many Resque job classes.
This is enabled by
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED(default0).It is read once at boot via the existing
Bigcommerce::Prometheus.configure.This mirrors the existing
PROMETHEUS_ENABLEDopt-in pattern.Implementation
bc-prometheus-rubyfollows a producer/aggregator split.Worker processes push metric observations as JSON envelopes to a long-running aggregator server, which accumulates them and exposes
/metricsfor Prometheus to scrape.The new per-job metrics follow the same pattern:
type: 'resque_job'envelopes.TypeCollectors::ResqueJob) on the server side receives those envelopes and records them into two Prometheus histograms.Producer —
Bigcommerce::Prometheus::Integrations::Resque::JobMetricsRecords the metrics and pushes them to the aggregator.
Payload parsing
Resque payloads arrive in one of two shapes: ActiveJob-enqueued or vanilla.
An ActiveJob enqueued via
.perform_lateris stored with ActiveJob'sJobWrapperas the Resque class, and the job's own details serialized intoargs[0]:{ "class": "ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper", "args": [ { "job_class": "BigPay::SomePublishJob", "job_id": "07b08b09-8a1e-4f4d-9e0e-1f2a3b4c5d6e", "queue_name": "scheduled_action", "arguments": [12345], "enqueued_at": "2026-06-11T01:23:45.123456789Z", "scheduled_at": null } ] }A vanilla job enqueued via
Resque.enqueue(BigPay::Kount::ProcessNotificationJob, 12345)is stored with its own class at the top level and raw positional args:{ "class": "BigPay::Kount::ProcessNotificationJob", "args": [12345] }Payload parsing models each shape as its own DTO behind a common interface of
#job_classand#anchor_time:ActiveJobPayload— built from the innerargs[0]hash. From the first example:job_classis"BigPay::SomePublishJob", andanchor_timeparses fromscheduled_at, falling back toenqueued_at.VanillaResquePayload— built from the whole payload. From the second example:job_classis"BigPay::Kount::ProcessNotificationJob".anchor_timeis always nil, because the payload carries no timestamps. Malformed payloads yield'unknown'.JobPayload.for(resque_job)decides which to build, once per job.The criterion: a payload is ActiveJob-shaped when
args[0]is a Hash carrying a truthyjob_class; everything else is treated as vanilla.In the examples above, the first payload meets the criterion and the second does not.
Both DTOs eagerly extract everything in
initializeand hold no reference to theResque::Jobafterwards.The recording methods take the prebuilt payload rather than reparsing the job on each call.
Consequences:
resque_job_perform_duration_secondsemits a meaningfuljob_classlabel for vanilla Resque jobs too.However
resque_job_queue_latency_secondsno-ops for vanilla Resque jobs, since there is no#anchor_time.How the metrics are recorded
WorkerInstrumentationis a module prepended ontoResque::Workerthat wrapsperform_with_fork.It builds a payload object once per job via
JobPayload.for, records queue latency beforesuper, and records perform duration inensure:Class-method API
JobMetrics.start(client:)— no-op unless the env var is on. PrependsWorkerInstrumentationontoResque::Worker. Idempotent.JobMetrics.record_queue_latency(payload)— records how long the job sat on the queue: the seconds from the payload's#anchor_timeto now. Feedsresque_job_queue_latency_seconds, labelled by the payload's#job_class. Does nothing when the payload has no anchor.JobMetrics.record_perform_duration(payload, duration)— records how long the job took to process; the caller measures and supplies the duration. Feedsresque_job_perform_duration_seconds, labelled by the payload's#job_class.On the wire, each recording is sent to the aggregator with an envelope of
type: 'resque_job'.Both
record_*methods rescueStandardErrorand log a warning — metric push failures never propagate into the publish/perform path.Aggregator — type collectors
Everything above is the producer side, running in the worker process. The aggregator is the other half: the long-running server process that receives the pushed envelopes and exposes the accumulated metrics at
/metricsfor Prometheus to scrape.Two type collectors, one per envelope shape, are registered side-by-side in
Instrumentors::Resque#start.The upstream
PrometheusExporter::Server::Collectorroutes each envelope to whichever collector'stypematchesenvelope['type']— no in-collector dispatch needed.Bigcommerce::Prometheus::TypeCollectors::Resque: is the pre-existing aggregator and continues to own the aggregate worker/queue gauges (resque_workers_total,jobs_failed_total,jobs_pending_total,jobs_processed_total,queues_total,queue_sizes) fed byCollectors::Resque#collect.Bigcommerce::Prometheus::TypeCollectors::ResqueJob: the new aggregator owns the two new histograms:resque_job_queue_latency_secondswith buckets tuned for queue dwell ([0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300]).resque_job_perform_duration_secondswith buckets tuned for per-job work ([0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60]).Wiring
Integrations::Resque.start(client:)now also callsJobMetrics.start(client:).If the env var is off, that call returns immediately and nothing is hooked. If on, the hooks install and the metrics flow.
Integrations::Resque.startis invoked inside aResque.before_first_forkblock (registered byInstrumentors::Resque#setup_middleware).Converting to ActiveJob
Converting a vanilla job is mostly mechanical, with three things to plan for:
self.performshim.Wrapper class alternative
An alternative to converting a vanilla job to ActiveJob: a small service-local wrapper class that produces the ActiveJob payload shape at enqueue time.
The wrapper stamps the fields the gem reads into
args[0]and delegatesperformto the target job. The target job class is untouched. Callers opt in by enqueueing through the wrapper.This produces a payload that meets the classification criterion in Payload parsing:
args[0]is a Hash with a truthyjob_class.queue_latencyanchors on the stampedenqueued_at, and both metrics label with the target job's class name.Trade-offs versus converting to ActiveJob:
job_class,enqueued_at,scheduled_at.Testing
Note
Medium Risk
Prepends onto
Resque::Workeron every enabled worker and adds per-job_classcardinality; failures are rescued so jobs should not break, but misconfiguration could increase metrics load.Overview
Adds opt-in per-Resque-job Prometheus histograms, gated by
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED/resque_per_job_metrics_enabled(default off).When enabled, the parent worker process prepends instrumentation around
Resque::Worker#perform_with_forkand pushestype: 'resque_job'envelopes to the exporter:resque_job_queue_latency_seconds{job_class}(ActiveJob-shaped payloads only; anchors onscheduled_atthenenqueued_at, clamped for clock skew) andresque_job_perform_duration_seconds{job_class}(full fork→waitpid child lifetime for all jobs). Payload shape is detected viaJobPayload.for(ActiveJob inner hash vs vanilla Resque).A new
TypeCollectors::ResqueJobaggregates those observations into histograms; it is registered on the Resque instrumentor server alongside existing collectors.Integrations::Resque.startnow also startsJobMetrics. README, CHANGELOG, and specs cover the feature.Reviewed by Cursor Bugbot for commit 1832211. Bugbot is set up for automated code reviews on this repo. Configure here.