chore(tests): disk_v2 data-loss research scratchbook#25524
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
check-spelling found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
1ceda6a to
b0a9285
Compare
340ba97 to
7c97a68
Compare
Antithesis research artifacts for disk buffer v2: SUT analysis, 29-property catalog (incl. the silent data-loss cluster), property relationships, evaluation synthesis, per-property evidence files, and the run log. Internal references genericized.
7c97a68 to
c03d71f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c03d71fc64
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **Tuning for bug-finding:** set a small `max_data_file_size` (e.g. 1MB) and a | ||
| small `max_size` to maximize file-rotation frequency and reach the rotation/ | ||
| partial-write window faster; optionally set `flush_interval` low to widen the | ||
| durably-written set, or high to widen the loss window — test both. |
There was a problem hiding this comment.
Use only supported disk-buffer config knobs
This tuning advice points the harness at max_data_file_size and flush_interval, but the Vector disk-buffer config deserializer only accepts type, max_size, and when_full and returns unknown_field for anything else (lib/vector-buffers/src/config.rs). If a tester follows this topology and adds those fields to vector.yaml, the SUT fails config loading instead of exercising the intended rotation/fsync scenarios; those knobs would need to be changed through an instrumented/direct buffer build rather than Vector configuration.
Useful? React with 👍 / 👎.
| - `source`: an e2e-ack-capable source the workload can push to. Prefer | ||
| `datadog_agent` or `http_server` with `acknowledgements: true` (needed for | ||
| `every-written-event-eventually-delivered` and the durable-survival | ||
| properties). Keep one source. | ||
| - `sink`: an `http` sink with `buffer: { type: disk, max_size: <~256MB+>, |
There was a problem hiding this comment.
Enable acknowledgements on the sink
This topology puts acknowledgements: true on the source but leaves the HTTP sink config without acknowledgements. The current config model documents source-level acknowledgements as deprecated in favor of global/sink-level settings, and runs.md in this same scratchbook records that source-only acks were observed at acceptance/buffering rather than e2e delivery; if someone follows this setup, the durability and every-written-event-eventually-delivered properties can measure source acceptance instead of downstream delivery. Please move/add the ack setting to the sink/global config used by this topology.
Useful? React with 👍 / 👎.
| **Workload status (2026-05-29 data-loss battery):** the `disk_v2_lossfinder` | ||
| exerciser (`lib/vector-buffers/examples/disk_v2_lossfinder.rs`, harness | ||
| `tests/antithesis/config-lossfinder` + `test/v1/diskbuf_loss`) implements a | ||
| no-silent-loss oracle across a 7-scenario RNG fault menu, giving workload |
There was a problem hiding this comment.
Commit or relabel the referenced harness
These notes state that lib/vector-buffers/examples/disk_v2_lossfinder.rs, tests/antithesis/config-lossfinder, and test/v1/diskbuf_loss implement the data-loss battery, but I checked the repo tree and none of those paths exist in this commit. A reader trying to reproduce the advertised coverage from this scratchbook has no runnable harness to invoke, so either the files need to be included or this section should clearly mark them as external/uncommitted artifacts with their actual location.
Useful? React with 👍 / 👎.
| 4. Immediately scrape the `buffer_events_received_total` or `buffer_byte_size` | ||
| gauge from the `prometheus_exporter`. |
There was a problem hiding this comment.
Scrape an actual buffer-size gauge
This plan tells the workload to assert that buffer_events_received_total is near zero after a drain/restart, but the buffer-size gauges exposed by Vector are buffer_size_events/buffer_size_bytes (with legacy buffer_events/buffer_byte_size), while the similarly named receive metric is a monotonic total. A harness following this name will either fail to find the metric or check cumulative received events, causing false failures after any successful write instead of validating the drained buffer state.
Useful? React with 👍 / 👎.
| - Which e2e-ack-capable source is easiest to drive from the workload — | ||
| `http_server`, `datadog_agent`, or `socket`? (Affects workload protocol.) |
There was a problem hiding this comment.
Do not list socket as e2e-ack capable
This open question groups socket with e2e-ack-capable sources, but SocketConfig::can_acknowledge() returns false in src/sources/socket/mod.rs, unlike the HTTP and Datadog Agent sources. If the harness chooses socket for the durability properties, Vector will not propagate end-to-end acknowledgements from that source, so the crash-survival and eventual-delivery oracles can no longer distinguish buffered acceptance from downstream delivery.
Useful? React with 👍 / 👎.

Summary
Antithesis research scratchbook for disk buffer v2. The artifacts in this PR are created by the skill
antithesis-researchand act as knowledge base for the other skills present in https://github.com/antithesishq/antithesis-skills.How did you test this PR?
N/A — documentation only (
tests/antithesis/scratchbook/).Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References