Skip to content

PAYMENTS-11567 Resque latency metrics#31

Open
WillemHoman wants to merge 1 commit into
mainfrom
PAYMENTS-11567-resque_latency_metrics
Open

PAYMENTS-11567 Resque latency metrics#31
WillemHoman wants to merge 1 commit into
mainfrom
PAYMENTS-11567-resque_latency_metrics

Conversation

@WillemHoman

@WillemHoman WillemHoman commented Jun 11, 2026

Copy link
Copy Markdown

What? Why?

Adds opt-in latency metrics for Resque jobs.

  • resque_job_queue_latency_seconds{job_class} : how long the job spent on the queue prior to being picked up for processing
  • resque_job_perform_duration_seconds{job_class} : how long it takes to process the job

Requires Resque 1.27 or newer running in the default fork-per-job mode.
Workers with FORK_PER_JOB=false and platforms without fork are not instrumented; nothing is recorded and a warning is logged at worker boot.

Off by default. Opt in per service by setting PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1.

Supports both ActiveJob and vanilla Resque jobs. resque_job_perform_duration_seconds covers every job. resque_job_queue_latency_seconds is supported for ActiveJob-enqueued jobs only, because it reads fields from the ActiveJob payload — see below.

The metrics introduced

resque_job_queue_latency_seconds

Queue latency measures the time from when the job became eligible to run until a worker picked it up.
Buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300].
Latency is clamped to 0.0 minimum so that clock skew between the enqueuer and the worker host cannot produce a negative observation.

This metric relies on the job having the ActiveJob payload shape.
i.e. payload['args'][0] (a Hash) must contain

  • job_class (string) — used as the metric label.
  • scheduled_at (ISO 8601 string, optional) — preferred anchor when present.
  • enqueued_at (ISO 8601 string) — anchor when scheduled_at is absent.

A job is eligible to be picked up by a worker at scheduled_at when scheduled with a delay.
Otherwise a job is immediately eligible upon enqueuing, and so enqueued_at is used.
More precisely

  • If the job is scheduled with a delay, scheduled_at is populated and is the time the job is eligible to be picked up by the worker
  • If the job is eligible to be picked up immediately by the worker, scheduled_at isnil. In this case enqueued_at is used as it as the job is eligible to be picked up by the worker as soon as it is enqueued.
    This holds for both initial attempts and retries.

The one caveat is for versions of Rails prior to 7.1.
scheduled_at is only serialized by ActiveJob from Rails 7.1; on older Rails the payload carries only enqueued_at, so intentional scheduling/backoff delay counts toward queue latency.

For vanilla Resque jobs without these fields, the metric silently no-ops for them.

Sample payloads for both shapes are in Payload parsing below.

The gem detects the payload by the fields above rather than by matching the JobWrapper class name so that non ActiveJob Resque tasks can also opt in by populating the necessary fields.a

Converting vanilla jobs to support resque_job_queue_latency_seconds

Vanilla Resque jobs that need queue latency should be either be

Retries

ActiveJob re-stamps both enqueued_at and scheduled_at on every retry (including each re-queue).
This means that resque_job_queue_latency_seconds only reflects the latency for each individual retry.
To measure the entire duration from initial enqueue until dequeue of the final attempt, a separate userland metric is needed.
For example, domain events should have an occurred_at timestamp as per the docs.

occurred_at is populated with the time at which the event occurred, typically the current time

resque_job_perform_duration_seconds

How long does the processing of the job take.
This is the total child-process lifetime from fork to Process.waitpid return.
Buckets: [0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60].

This works for both vanilla Resque jobs and ActiveJobs and doesn't rely on any payload fields.

Why this is implemented in bc-prometheus-ruby instead of each service

In-child metric collection requires a synchronous flush before exit! which is Resque's default child-exit.
The flush is bounded by the bc-prometheus-ruby worker thread's sleep cadence (0.5 s by default) — which makes it slow relative to fast publishes.
For example this metric was initially introduced in https://github.com/bigcommerce/bigpay/pull/10597 however had to be reverted as it increased Resque job run-time from 20ms to 500ms.
In contrast, the parent process is long-lived, so the bc-prom worker thread drains naturally between jobs without any synchronous wait.
Moving the instrumentation to the parent eliminates the per-job latency tax entirely.

This also ensures consistent metric names and labels.

Opt-in env var

This is opt-in because per-job series can be high cardinality when a service has many Resque job classes.
This is enabled by PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED (default 0).
It is read once at boot via the existing Bigcommerce::Prometheus.configure.
This mirrors the existing PROMETHEUS_ENABLED opt-in pattern.

Implementation

bc-prometheus-ruby follows a producer/aggregator split.
Worker processes push metric observations as JSON envelopes to a long-running aggregator server, which accumulates them and exposes /metrics for Prometheus to scrape.
The new per-job metrics follow the same pattern:

  • a producer hook installed in the worker process captures queue latency and perform duration and pushes type: 'resque_job' envelopes.
  • a new aggregator type collector (TypeCollectors::ResqueJob) on the server side receives those envelopes and records them into two Prometheus histograms.

Producer — Bigcommerce::Prometheus::Integrations::Resque::JobMetrics

Records the metrics and pushes them to the aggregator.

Payload parsing

Resque payloads arrive in one of two shapes: ActiveJob-enqueued or vanilla.

An ActiveJob enqueued via .perform_later is stored with ActiveJob's JobWrapper as the Resque class, and the job's own details serialized into args[0]:

{
  "class": "ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper",
  "args": [
    {
      "job_class": "BigPay::SomePublishJob",
      "job_id": "07b08b09-8a1e-4f4d-9e0e-1f2a3b4c5d6e",
      "queue_name": "scheduled_action",
      "arguments": [12345],
      "enqueued_at": "2026-06-11T01:23:45.123456789Z",
      "scheduled_at": null
    }
  ]
}

A vanilla job enqueued via Resque.enqueue(BigPay::Kount::ProcessNotificationJob, 12345) is stored with its own class at the top level and raw positional args:

{
  "class": "BigPay::Kount::ProcessNotificationJob",
  "args": [12345]
}

Payload parsing models each shape as its own DTO behind a common interface of #job_class and #anchor_time:

  • ActiveJobPayload — built from the inner args[0] hash. From the first example: job_class is "BigPay::SomePublishJob", and anchor_time parses from scheduled_at, falling back to enqueued_at.
  • VanillaResquePayload — built from the whole payload. From the second example: job_class is "BigPay::Kount::ProcessNotificationJob". anchor_time is always nil, because the payload carries no timestamps. Malformed payloads yield 'unknown'.

JobPayload.for(resque_job) decides which to build, once per job.
The criterion: a payload is ActiveJob-shaped when args[0] is a Hash carrying a truthy job_class; everything else is treated as vanilla.
In the examples above, the first payload meets the criterion and the second does not.

Both DTOs eagerly extract everything in initialize and hold no reference to the Resque::Job afterwards.
The recording methods take the prebuilt payload rather than reparsing the job on each call.

Consequences: resque_job_perform_duration_seconds emits a meaningful job_class label for vanilla Resque jobs too.
However resque_job_queue_latency_seconds no-ops for vanilla Resque jobs, since there is no #anchor_time.

How the metrics are recorded

WorkerInstrumentation is a module prepended onto Resque::Worker that wraps perform_with_fork.
It builds a payload object once per job via JobPayload.for, records queue latency before super, and records perform duration in ensure:

module WorkerInstrumentation
  def perform_with_fork(job, &block)
    payload = JobPayload.for(job)
    JobMetrics.record_queue_latency(payload)
    started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    super
  ensure
    JobMetrics.record_perform_duration(payload, started_at)
  end
end

Class-method API

  • JobMetrics.start(client:) — no-op unless the env var is on. Prepends WorkerInstrumentation onto Resque::Worker. Idempotent.
  • JobMetrics.record_queue_latency(payload) — records how long the job sat on the queue: the seconds from the payload's #anchor_time to now. Feeds resque_job_queue_latency_seconds, labelled by the payload's #job_class. Does nothing when the payload has no anchor.
  • JobMetrics.record_perform_duration(payload, duration) — records how long the job took to process; the caller measures and supplies the duration. Feeds resque_job_perform_duration_seconds, labelled by the payload's #job_class.

On the wire, each recording is sent to the aggregator with an envelope of type: 'resque_job'.

Both record_* methods rescue StandardError and log a warning — metric push failures never propagate into the publish/perform path.

Aggregator — type collectors

Everything above is the producer side, running in the worker process. The aggregator is the other half: the long-running server process that receives the pushed envelopes and exposes the accumulated metrics at /metrics for Prometheus to scrape.

Two type collectors, one per envelope shape, are registered side-by-side in Instrumentors::Resque#start.
The upstream PrometheusExporter::Server::Collector routes each envelope to whichever collector's type matches envelope['type'] — no in-collector dispatch needed.

  • Bigcommerce::Prometheus::TypeCollectors::Resque: is the pre-existing aggregator and continues to own the aggregate worker/queue gauges (resque_workers_total, jobs_failed_total, jobs_pending_total, jobs_processed_total, queues_total, queue_sizes) fed by Collectors::Resque#collect.
  • Bigcommerce::Prometheus::TypeCollectors::ResqueJob: the new aggregator owns the two new histograms:
    • resque_job_queue_latency_seconds with buckets tuned for queue dwell ([0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300]).
    • resque_job_perform_duration_seconds with buckets tuned for per-job work ([0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60]).

Wiring

Integrations::Resque.start(client:) now also calls JobMetrics.start(client:).
If the env var is off, that call returns immediately and nothing is hooked. If on, the hooks install and the metrics flow.

Integrations::Resque.start is invoked inside a Resque.before_first_fork block (registered by Instrumentors::Resque#setup_middleware).

Converting to ActiveJob

Converting a vanilla job is mostly mechanical, with three things to plan for:

  • In-flight old-shape payloads in Redis will fail against the converted class at deploy time. Drain the queue first, or keep a temporary self.perform shim.
  • resque-scheduler YAML entries can't enqueue ActiveJobs natively. Each scheduled entry needs a small shim class.
  • Arguments pass through ActiveJob serialization. ActiveRecord records become GlobalIDs.

Wrapper class alternative

An alternative to converting a vanilla job to ActiveJob: a small service-local wrapper class that produces the ActiveJob payload shape at enqueue time.

The wrapper stamps the fields the gem reads into args[0] and delegates perform to the target job. The target job class is untouched. Callers opt in by enqueueing through the wrapper.

class InstrumentedEnqueue
  def self.enqueue(job_class, *args)
    Resque.enqueue_to(
      Resque.queue_from_class(job_class),
      self,
      {
        'job_class' => job_class.name,
        'enqueued_at' => Time.now.utc.iso8601(9),
        'arguments' => args
      }
    )
  end

  def self.perform(payload)
    payload['job_class'].constantize.perform(*payload['arguments'])
  end
end

# the call site changes from
Resque.enqueue(BigPay::Kount::ProcessNotificationJob, 12345)
# to
InstrumentedEnqueue.enqueue(BigPay::Kount::ProcessNotificationJob, 12345)

This produces a payload that meets the classification criterion in Payload parsing: args[0] is a Hash with a truthy job_class. queue_latency anchors on the stamped enqueued_at, and both metrics label with the target job's class name.

Trade-offs versus converting to ActiveJob:

  • Per-call-site opt-in. Only enqueues that go through the wrapper get the metric. Un-migrated call sites coexist safely — their payloads simply skip the metric — so there is no deploy-ordering risk.
  • The key names must match the gem's contract exactly: job_class, enqueued_at, scheduled_at.
  • resque-scheduler YAML entries don't fit this pattern. Recurring scheduled jobs need a different approach.

Testing

# HELP ruby_resque_job_queue_latency_seconds Seconds between when a Resque job was due to run (scheduled_at if set, falling back to enqueued_at) and when a worker process picked it up. Recorded per attempt; retries-with-backoff anchor on scheduled_at, excluding the intentional backoff wait. Opt-in via PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED.
# TYPE ruby_resque_job_queue_latency_seconds histogram
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="+Inf"} 4
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="300"} 4
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="120"} 4
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="60"} 4
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="30"} 4
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="5"} 4
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="2.5"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="1"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.5"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.25"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.1"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.05"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.01"} 0
ruby_resque_job_queue_latency_seconds_count{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob"} 4
ruby_resque_job_queue_latency_seconds_sum{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob"} 17.02194
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="+Inf"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="300"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="120"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="60"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="30"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="5"} 2
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="2.5"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="1"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.5"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.25"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.1"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.05"} 0
ruby_resque_job_queue_latency_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.01"} 0
ruby_resque_job_queue_latency_seconds_count{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob"} 2
ruby_resque_job_queue_latency_seconds_sum{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob"} 7.9414940000000005

# HELP ruby_resque_job_perform_duration_seconds Total Resque child process lifetime (fork to waitpid). Includes fork overhead, Redis reconnect, after_fork hooks, perform, and exit. Used as the per-job throughput signal at the worker-pod level. Opt-in via PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED.
# TYPE ruby_resque_job_perform_duration_seconds histogram
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="+Inf"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="60"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="30"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="10"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="5"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="2"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="1"} 1
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.5"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.25"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.1"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob",le="0.05"} 0
ruby_resque_job_perform_duration_seconds_count{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob"} 2
ruby_resque_job_perform_duration_seconds_sum{job_class="BigPay::DomainEventing::Payment::Webhook::PublishWebhookTransactionCreatedEventJob"} 2.5267620000522584
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="+Inf"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="60"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="30"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="10"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="5"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="2"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="1"} 2
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.5"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.25"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.1"} 0
ruby_resque_job_perform_duration_seconds_bucket{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob",le="0.05"} 0
ruby_resque_job_perform_duration_seconds_count{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob"} 2
ruby_resque_job_perform_duration_seconds_sum{job_class="BigPay::DomainEventing::Payment::Transaction::PublishUnstablePaymentTransactionCreatedEventJob"} 1.4121759999543428


Note

Medium Risk
Prepends onto Resque::Worker on every enabled worker and adds per-job_class cardinality; failures are rescued so jobs should not break, but misconfiguration could increase metrics load.

Overview
Adds opt-in per-Resque-job Prometheus histograms, gated by PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED / resque_per_job_metrics_enabled (default off).

When enabled, the parent worker process prepends instrumentation around Resque::Worker#perform_with_fork and pushes type: 'resque_job' envelopes to the exporter: resque_job_queue_latency_seconds{job_class} (ActiveJob-shaped payloads only; anchors on scheduled_at then enqueued_at, clamped for clock skew) and resque_job_perform_duration_seconds{job_class} (full fork→waitpid child lifetime for all jobs). Payload shape is detected via JobPayload.for (ActiveJob inner hash vs vanilla Resque).

A new TypeCollectors::ResqueJob aggregates those observations into histograms; it is registered on the Resque instrumentor server alongside existing collectors. Integrations::Resque.start now also starts JobMetrics. README, CHANGELOG, and specs cover the feature.

Reviewed by Cursor Bugbot for commit 1832211. Bugbot is set up for automated code reviews on this repo. Configure here.

@WillemHoman

Copy link
Copy Markdown
Author

bugbot review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 7be85b6. Configure here.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opt-in, per-job Resque latency/performance histograms to bc-prometheus-ruby, implemented via a worker-side producer (parent-process Resque worker instrumentation that pushes envelopes) and an aggregator-side type collector that records Prometheus histograms.

Changes:

  • Introduces Integrations::Resque::JobMetrics plus payload DTOs (ActiveJobPayload, VanillaResquePayload, JobPayload) to record queue latency (ActiveJob-shaped payloads only) and perform duration (all jobs) and push type: 'resque_job' envelopes.
  • Adds a new aggregator TypeCollectors::ResqueJob registered by the Resque instrumentor to expose resque_job_queue_latency_seconds and resque_job_perform_duration_seconds.
  • Documents and wires feature gating via PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED (configuration + README + CHANGELOG) and adds RSpec coverage for the new components.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
CHANGELOG.md Notes the new opt-in per-job Resque histograms for the pending release.
README.md Documents the new opt-in env var and describes the two new per-job histograms and their behavior.
lib/bigcommerce/prometheus.rb Requires the new Resque per-job payload/metrics and type-collector files.
lib/bigcommerce/prometheus/configuration.rb Adds resque_per_job_metrics_enabled configuration sourced from PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED.
lib/bigcommerce/prometheus/instrumentors/resque.rb Registers the new TypeCollectors::ResqueJob on the exporter server.
lib/bigcommerce/prometheus/integrations/resque.rb Starts per-job Resque job metrics (gated) alongside existing Resque instrumentation.
lib/bigcommerce/prometheus/integrations/resque/active_job_payload.rb Implements ActiveJob-shaped payload parsing to extract job_class and queue-latency anchor time.
lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb Implements producer hooks and envelope pushes for queue latency and perform duration (with env gating + Resque::Worker prepend).
lib/bigcommerce/prometheus/integrations/resque/job_payload.rb Classifies Resque payload shape and returns an appropriate payload DTO without raising.
lib/bigcommerce/prometheus/integrations/resque/vanilla_resque_payload.rb Implements vanilla Resque payload DTO (class label only; no queue-latency anchor).
lib/bigcommerce/prometheus/type_collectors/resque_job.rb Adds the aggregator type collector for type: 'resque_job' and records the two histograms.
spec/bigcommerce/prometheus/integrations/resque/active_job_payload_spec.rb Tests ActiveJob payload parsing and anchor selection behavior.
spec/bigcommerce/prometheus/integrations/resque/job_metrics_spec.rb Tests envelope shape, clamping, error rescue, and env-var gating / idempotent hook installation.
spec/bigcommerce/prometheus/integrations/resque/job_payload_spec.rb Tests payload classification between ActiveJob-shaped vs vanilla (including malformed payloads).
spec/bigcommerce/prometheus/integrations/resque/vanilla_resque_payload_spec.rb Tests vanilla payload labeling and nil anchor behavior.
spec/bigcommerce/prometheus/type_collectors/resque_job_spec.rb Tests collector type routing contract, histogram construction, and label merge behavior through Base#collect.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
Comment thread lib/bigcommerce/prometheus/type_collectors/resque_job.rb
@WillemHoman WillemHoman marked this pull request as ready for review June 11, 2026 05:34
@WillemHoman WillemHoman requested a review from a team as a code owner June 11, 2026 05:34
@WillemHoman WillemHoman force-pushed the PAYMENTS-11567-resque_latency_metrics branch from ee51981 to 7be85b6 Compare June 11, 2026 07:41

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7be85b6. Configure here.

Comment thread lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb
@WillemHoman WillemHoman force-pushed the PAYMENTS-11567-resque_latency_metrics branch from 7be85b6 to 39b0fbd Compare June 11, 2026 09:36
@WillemHoman WillemHoman force-pushed the PAYMENTS-11567-resque_latency_metrics branch from 39b0fbd to 1832211 Compare June 11, 2026 09:47

@j05h j05h left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments, but generally looks good. Thanks for the PR @WillemHoman


def parse_time(value)
return value if value.is_a?(Time)
return nil if value.nil? || value.to_s.empty?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil if value.nil? || value.to_s.empty?
return nil if value.to_s.empty?

🍺 not strictly necessary as nil.to_s == ""

inner = activejob_inner(payload)
inner ? ActiveJobPayload.new(inner) : VanillaResquePayload.new(payload)
rescue StandardError
VanillaResquePayload.new({})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be logged here or we're hiding it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants