feat(transport): add HTTP retry with exponential backoff#1520
Open
feat(transport): add HTTP retry with exponential backoff#1520
Conversation
|
b083a57 to
a264f66
Compare
Collaborator
Author
|
@sentry review |
Collaborator
Author
|
@cursor review |
This was referenced Feb 16, 2026
Collaborator
Author
|
@sentry review |
Collaborator
Author
|
@cursor review |
Collaborator
Author
|
@cursor review |
Collaborator
Author
|
@cursor review |
243c880 to
d6aa792
Compare
Collaborator
Author
|
@cursor review |
fbbffb2 to
abd5815
Compare
Collaborator
Author
|
@cursor review |
f030cf9 to
22e3fc4
Compare
Collaborator
Author
|
@sentry review |
Collaborator
Author
|
@cursor review |
22e3fc4 to
ffce486
Compare
Collaborator
Author
|
@cursor review |
If sentry__retry_new() fails, retry_func was still set on the transport, causing can_retry to return true and envelopes to be dropped instead of cached. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Envelopes without an event_id (e.g. sessions) were silently dropped by sentry__retry_write_envelope. This was a workaround (cd57ff4) for the old retry_write_envelope approach that regenerated a random UUID on each attempt, orphaning files on disk. The rewrite to rename-based counter bumps in handle_result (0f37177) made this safe: the UUID is parsed from the filename and preserved across renames, so a random UUID assigned at initial write stays stable through all retry cycles. Generate a random UUIDv4 for nil event_id envelopes instead of skipping them. Extract unreachable_dsn to module level. Add UUID consistency assertions to existing retry tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clear the scheduled flag before scanning so that a concurrent retry_enqueue always sees 0 and successfully arms a new poll via CAS. Previously, the flag was cleared after the scan returned, creating a window where enqueue could see 1, skip scheduling, and then the poll would clear the flag — stranding the newly enqueued item. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n_write_cache Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_result Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The retry_count >= 0 branch passed the full source filename to %.36s, which would grab the timestamp prefix instead of the UUID for retry- format filenames. Extract the cache name (last 45 chars) before either branch so both use the correct UUID. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a mutex (sealed_lock) to serialize the SEALED check in retry_enqueue with the SEALED set in retry_dump_queue. Store the envelope address as uintptr_t so retry_dump_cb can skip envelopes already written by retry_enqueue without risking accidental dereferencing. The address is safe to compare because the task holds a ref that keeps the envelope alive during foreach_matching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…move_cache Move sentry__retry_parse_filename and sentry__retry_make_path into sentry_database as sentry__parse_cache_filename and sentry__run_make_cache_path. This consolidates cache filename format knowledge in one module and replaces the fragile `src_len > 45` heuristic in sentry__run_move_cache with proper parsing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SENTRY_RETRY_ATTEMPTS constant was bumped from 5 to 6 in 81d0f68 but the public API documentation was not updated to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a SENTRY_POLL_SHUTDOWN sentinel so that a concurrent retry_poll_task cannot resubmit the delayed poll that shutdown just dropped. The CAS(SCHEDULED→IDLE) in retry_poll_task is a no-op when scheduled is SHUTDOWN, and the subsequent CAS(IDLE→SCHEDULED) also fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Windows, WinHTTP TCP connect to an unreachable host takes ~2s, which can exceed the shutdown timeout. Add a cancel_client callback that closes just the WinHTTP request handle, unblocking the worker thread so it can process the failure and shut down cleanly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and worker Use InterlockedExchangePointer to atomically swap client->request to NULL in cancel, shutdown, and worker exit cleanup. Whichever thread wins the swap closes the handle; the loser gets NULL and skips. The worker also snapshots client->request into a local variable right after WinHttpOpenRequest and uses the local for all subsequent API calls, so it never reads NULL from the struct if cancel fires mid-function. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ncel Replace the unconditional cancel_client call before bgworker_shutdown with an on_timeout callback that fires only when the shutdown timeout expires. This avoids aborting in-flight requests that would have completed within the timeout, while still unblocking the worker when it's stuck (e.g. WinHTTP connect to unreachable host). The callback closes session/connect handles to cancel pending WinHTTP operations, then the shutdown loop falls through to the !running check which joins the worker thread. This ensures handle_result runs and the retry counter is properly bumped on disk. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd_task The local `HINTERNET request = client->request` snapshot was only needed for the cancel_client approach. Since shutdown_client only fires at the timeout point, mid-function reads of client->request are safe. Keep only the InterlockedExchangePointer in the exit block to prevent double-close. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move sentry__atomic_store(&bgw->running, 0) from before the on_timeout callback to the else branch (detach path). This lets the worker's shutdown_task set running=0 naturally after finishing in-flight work, making the dump_queue safety-net reachable if the callback fails to unblock the worker within another 250ms cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iness Tests that directly call sentry__retry_send were racing with the transport's background retry worker polling the same cache directory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e data loss After retry_enqueue writes an envelope and stores its address in sealed_envelope, the envelope is freed when the bgworker task completes. If a subsequent envelope is allocated at the same address and is still pending during shutdown, retry_dump_cb would incorrectly skip it, losing the envelope. Clear sealed_envelope after the first match so later iterations cannot false-match a reused address. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move cache_keep logic into http_send_task so envelopes are cached only when sending actually fails, rather than unconditionally in process_old_runs. For non-HTTP transports, process_old_runs still handles caching since there is no http_send_task. - Add cache_keep/run fields to http_transport_state_t - Cache in http_send_task when send fails and retry is unavailable - Simplify can_cache in process_old_runs to check transport capability - Fix NULL transport dereference in sentry__transport_flush - Update sentry_options_set_cache_keep/http_retry docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move envelope caching from process_old_runs (synchronous, before send) into http_send_task (on HTTP failure). This ensures envelopes are only cached when the send actually fails, rather than unconditionally. Add transport cleanup_func to submit cache cleanup as a FIFO task on the bg worker, guaranteeing it runs after pending send tasks from process_old_runs. Add explicit flush argument to sentry_example for deterministic test behavior across platforms with varying HTTP timeout characteristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Its only production use was the can_cache check in process_old_runs, which was removed in the previous commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bgworker_shutdown_cb times out, the thread is detached and sentry_close frees options while the cleanup task may still be queued. Incref options when submitting and use sentry_options_free as the task cleanup function to ensure safe lifetime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b133fd6 to
93db900
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add HTTP retry with exponential backoff for network failures, modeled after Crashpad's upload retry behavior.
Note
Adds an explicit 15s connect timeout to both curl and WinHTTP, matching the value used by Crashpad. This reduces the time the transport worker is blocked on unreachable hosts, previously ~75s on curl (OS TCP SYN retry limit) and 60s on WinHTTP.
Failed envelopes are stored as
<db>/cache/<ts>-<n>-<uuid>.envelopeand retried on startup after a 100ms throttle, and then with exponential backoff (15min, 30min, 1h, 2h, 4h, 8h). When retries are exhausted, and offline caching is enabled, envelopes are stored as<db>/cache/<uuid>.envelopeinstead of being discarded.flowchart TD startup --> sretry{retry?} sretry -->|yes| throttle throttle -. 100ms .-> send send -->|success| discard send -->|fail| retry{retry?} retry -->|no| cache{cache?} cache -->|no| discard cache -->|yes| cache-dump[<db>/cache/<br/><uuid>.envelope] retry -->|yes| cache-retry[<db>/cache/<br/><ts>-<n>-<uuid>.envelope] cache-retry --> backoff backoff -. 2ⁿ×15min .-> sendBuilds upon:
See also: