Skip to content

persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044

Open
wen-coding wants to merge 2 commits intomainfrom
wen/use_wal_for_persistence
Open

persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044
wen-coding wants to merge 2 commits intomainfrom
wen/use_wal_for_persistence

Conversation

@wen-coding
Copy link
Contributor

@wen-coding wen-coding commented Mar 9, 2026

Summary

Replace file-per-block and file-per-commitQC persistence with sei-db/wal, extract common WAL mechanics into a generic indexedWAL[T], and expose per-lane truncate-then-append APIs so the caller (avail/state) controls all parallelism.

New: wal.go — generic indexedWAL[T]

Generic wrapper around sei-db/wal with monotonic index tracking, typed serialization via codec[T], and lifecycle methods: Write, ReadAll, TruncateBefore (with verify callback and bounds validation), TruncateAll, FirstIdx, Count, Close. Enforces INVARIANT: firstIdx <= nextIdx. Opens with AllowEmpty: true (critical for correct empty-log arithmetic). ReadAll includes a post-replay count check to detect silent data loss. Fsync enabled for additional durability beyond the prune anchor.

blocks.go — per-lane WAL persistence

  • One WAL per lane in blocks/<hex_lane_id>/ subdirectories with independent per-lane truncation.
  • PersistBlock creates lane WALs lazily via getOrCreateLane (double-checked locking under utils.RWMutex) and enforces strict contiguous block numbers.
  • MaybePruneAndPersistLane (new): truncate-then-append API for a single lane. Truncates the lane WAL below the anchor's block range, then appends proposals in order, calling afterEach per entry. Does not spawn goroutines — the caller schedules parallelism.
  • truncateLaneWAL (private): derives the per-lane prune cursor from the CommitQC, verifying block numbers before truncating via a defense-in-depth callback. Handles TruncateAll when the anchor advanced past all persisted blocks and re-anchors nextBlockNum.
  • loadAll detects gaps at replay time.
  • close() unexported — only used by tests and constructor error cleanup; uses errors.Join.
  • Removed: public DeleteBefore, DeleteBeforeLane, LaneIDsWithWAL.

commitqcs.go — single CommitQC WAL

  • Single WAL in commitqcs/, linear RoadIndex-to-WAL-index mapping via FirstIdx().
  • PersistCommitQC silently ignores duplicates (idx < next) for idempotent startup; rejects gaps (idx > next).
  • MaybePruneAndPersist (new): truncate-then-append API. Truncates the WAL below the anchor's road index (handling TruncateAll + cursor advance for anchor-past-all), re-persists the anchor QC for crash recovery, then appends new QCs with afterEach callback.
  • deleteBefore unexported — only called internally by MaybePruneAndPersist.
  • Close() is idempotent.
  • loadAllCommitQCs detects gaps at replay time.
  • Removed: public DeleteBefore, CommitQCPersistBatch struct.

state.go — orchestration and parallelism

  • NewState startup: prunes stale WAL entries via MaybePruneAndPersistLane (per committee lane) and MaybePruneAndPersist with anchor-only (nil proposals), replacing the old DeleteBefore calls.
  • runPersist write order:
    1. Prune anchor persisted sequentially — establishes crash-recovery watermark.
    2. scope.Parallel fans out: one task for MaybePruneAndPersist (commit-QCs), one task per committee lane for MaybePruneAndPersistLane (blocks). No early cancellation; first error returned after all tasks finish. Each path invokes markCommitQCsPersisted / markBlockPersisted per entry so voting unblocks ASAP.
  • anchorQC typed as utils.Option[*types.CommitQC] — drives both commit-QC and block WAL truncation.
  • blocksByLane initialized with all static committee lanes (ensuring truncation runs even for lanes with no new blocks).
  • No goroutines inside the persist package; all concurrency managed by state.go.

inner.go

  • TODO added for dynamic committee support (new members need entries in blocks, votes, nextBlockToPersist, persistedBlockStart).
  • nextBlockToPersist doc clarifies per-entry notification frequency.

Design decisions

  • No-op persisters: when stateDir is None, disk I/O is skipped but cursors advance via afterEach callbacks, preventing runPersist from spinning.
  • Stale lane removal deferred: with epoch-based fixed committees, lane cleanup is unnecessary until dynamic committee changes are supported.
  • Go 1.22+ loop variable semantics: go.mod requires Go 1.25+, so per-iteration loop variables are guaranteed — no explicit v := v in closures.

Concurrency design

Both BlockPersister and CommitQCPersister are internally thread-safe. Mutable state is protected with utils.Mutex/utils.RWMutex.

BlockPersister — two-level locking:

  • utils.RWMutex[map[LaneID]*laneWAL] on the lanes map: getOrCreateLane uses double-checked locking; close() takes write Lock.
  • utils.Mutex[*laneWALState] per lane: serializes writes and truncations. PersistBlock releases the map RLock before acquiring the per-lane lock, so writes to different lanes are fully parallel.

The caller (state.go) must not interleave truncation and PersistBlock on the same lane from different goroutines; MaybePruneAndPersistLane provides the safe truncate-then-append API, and scope.Parallel in runPersist schedules one task per lane.

CommitQCPersisterutils.Mutex[*commitQCState] protecting the WAL handle and cursor. deleteBefore calls persistCommitQC to re-persist the anchor QC without releasing the lock.

Test plan

  • All autobahn tests pass (avail, consensus, persist, data, types)
  • Race detector clean (-race) on persist and avail packages
  • gofmt, go vet, golangci-lint: 0 issues
  • TestState (no-op persist) and TestStateWithPersistence (disk persist with prune cycles)
  • TestStateRestartFromPersisted (full restart from WAL data)
  • Edge cases: anchor-past-all truncation, crash recovery, gap detection, concurrent multi-lane writes, no-op persister cursor advancement

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMar 23, 2026, 8:42 PM

@wen-coding wen-coding changed the title persist: refactor block/commitqc WAL into generic indexedWAL persist: replace file-per-entry with WAL and refactor into generic indexedWAL Mar 9, 2026
@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 71.76080% with 85 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.56%. Comparing base (9dc8d6e) to head (6b14936).

Files with missing lines Patch % Lines
...mint/internal/autobahn/consensus/persist/blocks.go 71.85% 25 Missing and 13 partials ⚠️
...dermint/internal/autobahn/consensus/persist/wal.go 63.93% 12 Missing and 10 partials ⚠️
...t/internal/autobahn/consensus/persist/commitqcs.go 75.32% 12 Missing and 7 partials ⚠️
sei-tendermint/internal/autobahn/avail/state.go 78.57% 4 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3044      +/-   ##
==========================================
- Coverage   58.59%   58.56%   -0.03%     
==========================================
  Files        2096     2092       -4     
  Lines      173372   173013     -359     
==========================================
- Hits       101583   101323     -260     
+ Misses      62746    62668      -78     
+ Partials     9043     9022      -21     
Flag Coverage Δ
sei-chain-pr 77.81% <71.76%> (?)
sei-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/autobahn/avail/inner.go 97.36% <ø> (ø)
sei-tendermint/internal/autobahn/avail/state.go 77.41% <78.57%> (+1.73%) ⬆️
...t/internal/autobahn/consensus/persist/commitqcs.go 75.55% <75.32%> (-3.17%) ⬇️
...dermint/internal/autobahn/consensus/persist/wal.go 63.93% <63.93%> (ø)
...mint/internal/autobahn/consensus/persist/blocks.go 70.27% <71.85%> (-6.88%) ⬇️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +155 to +168
for lane, first := range laneFirsts {
lw, ok := bp.lanes[lane]
if !ok {
continue // no WAL yet; PersistBlock will create one lazily
}
lane, fileN, err := parseBlockFilename(entry.Name())
if err != nil {
firstBN, ok := lw.firstBlockNum().Get()
if !ok || first <= firstBN {
continue
}
first, ok := laneFirsts[lane]
if ok && fileN >= first {
continue
walIdx := lw.firstIdx + uint64(first-firstBN)
if err := lw.TruncateBefore(walIdx); err != nil {
return fmt.Errorf("truncate lane %s WAL before block %d: %w", lane, first, err)
}
path := filepath.Join(bp.dir, entry.Name())
if err := os.Remove(path); err != nil && !os.IsNotExist(err) {
logger.Warn("failed to delete block file", "path", path, "err", err)
}

Check warning

Code scanning / CodeQL

Iteration over map

Iteration over map may be a possible source of non-determinism
@wen-coding wen-coding requested a review from pompon0 March 10, 2026 23:52
dbwal.Config{
WriteBufferSize: 0, // synchronous writes
WriteBatchSize: 1, // no batching
FsyncEnabled: true,
Copy link
Contributor

@yzang2019 yzang2019 Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware of the latency impact here, if putting write in critical path, it would introduce some noticeable latency. Would recommend synchronous write + nofsync for perf reason, fsync does provide stronger guarantees, but the chance of all validators hitting power off at the same time is pretty rare

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable, changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzang2019 Talked with @pompon0 about this, since lane block QC is only f+1, theoretically one OS crash can screw us. We are changing all lanes and commitqc writes to happen concurrently, so latency is less of a concern. Is it okay if I change the Fsync back to true here?

// Used when all entries are stale (e.g. the prune anchor advanced past
// everything persisted).
//
// TODO: sei-db/wal doesn't expose tidwall/wal's AllowEmpty option, so there's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to expose that in a separate PR? It should be pretty simple to add though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 3049 merged and switched to TruncateAll() here.

if err := w.wal.Write(entry); err != nil {
return err
}
if w.firstIdx == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend using Count() == 0 instead of relying on firstIdx == 0 as a sentinel for "WAL is empty" incase the assumption of wal starting index from 0 is not valid in the future if we switch the wal library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return nil, nil
}
entries := make([]T, 0, w.Count())
err := w.wal.Replay(w.firstIdx, w.nextIdx-1, func(_ uint64, entry T) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no validation that len(entries) == w.Count() after a successful replay. If Replay succeeds but returns fewer entries than expected (e.g., the underlying WAL silently truncated its tail on open due to corruption), ReadAll would return a short slice with no error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return nil
}
for _, lw := range bp.lanes {
if err := lw.Close(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one lane's WAL fails to close, the remaining lanes are never closed. Use errors.Join to accumulate errors and close all lanes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@wen-coding wen-coding force-pushed the wen/use_wal_for_persistence branch from c0727e6 to 422aa8f Compare March 11, 2026 04:29
if ok && fileN >= first {
if first >= lw.nextBlockNum {
// Anchor advanced past all persisted blocks for this lane.
if err := lw.TruncateAll(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is truncation synchronous? How expensive it is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TruncateAll is synchronous, it removes the segment files and then change some internal pointers, not too expensive. I don't expect this to happen very often though. In practice if every validator keeps emitting blocks, there should always be 1 or 2 lane blocks which are generated but not in AppQC yet. Unless they do "generate a block then wait for a while, generate another block then wait for a while", are you imagining that as an attack?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in a happy path. It would be nice to make sure that removing blocks does not affect the latency of persisting blocks. What I want to avoid is that we do synchronously: remove block -> fsync -> insert block -> fsync (if we have a steady stream of 1 insertion/removal of block per batch) in which case the latency of insertion is 2x larger than it should be (2 fsyncs).

@wen-coding wen-coding force-pushed the wen/use_wal_for_persistence branch from aeeddbf to 8de2a83 Compare March 16, 2026 17:04
@wen-coding wen-coding requested a review from pompon0 March 17, 2026 17:45
@wen-coding wen-coding force-pushed the wen/use_wal_for_persistence branch from b6be8e4 to 273ab80 Compare March 17, 2026 17:47
@wen-coding wen-coding requested a review from pompon0 March 19, 2026 16:07
}
}
last := proposals[len(proposals)-1]
s.markBlockPersisted(lane, last.Msg().Block().Header().BlockNumber()+1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole point of persisting blocks one by one (although we will need to eventually benchmark that) is so that validator can sign a LaneVote ASAP. Hence markBlockPersisted should be moved inside the for loop. For consistency, same could be done to the CommitQC persisting loop, although there might be no real gains there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, changed.

for _, qc := range batch.commitQCs {
if err := pers.commitQCs.PersistCommitQC(qc); err != nil {
return fmt.Errorf("persist commitqc %d: %w", qc.Index(), err)
if err := pers.commitQCs.DeleteBefore(anchor.CommitQC.Proposal().Index(), utils.Some(anchor.CommitQC)); err != nil {
Copy link
Contributor

@pompon0 pompon0 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know we are kind of optimizing in the void without benchmarks, but there are still some disk writes here which are unnecessarily sequential: commitQCs.DeleteBefore can be parallel to blocks.DeleteBefore. CommitQCs.persist can be parallel to blocks.DeleteBefore. In fact each DeleteBefore/persist for each WAL is independent and only Anchor write needs to happen sequentially to everything else. Also we might want to persist first, then do DeleteBefore. Also DeleteBefore doesn't require fsync (I'm not sure how WAL behaves there).

Copy link
Contributor Author

@wen-coding wen-coding Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In blocks, we are mainly doing DeleteBefore before persisting for the "after the anchor, the commitQC must be contiguous check" as a defense in depth. If you think that is unnecessary and guaranteed by caller, I can remove it. But then one screw up and our indices calculation would be wrong. (Of course I can also maintain a map of WAL index to commitQC RoadIndex and discard the late commitQC which tried to fill a hole.)
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your other question, DeleteBefore uses TruncateFront, it always fsyncs the critical new segment file via atomicWrite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganized the code so we do these in parallel now:

  1. truncate and add blocks per lane
  2. truncate and add commitqcs

blocks []*types.Signed[*types.LaneProposal]
commitQCs []*types.CommitQC
pruneAnchor utils.Option[*PruneAnchor]
laneFirsts map[types.LaneID]types.BlockNumber
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: laneFirsts are implied by the anchor, they do not have to be computed in the collectPersistBatch function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

type BlockPersister struct {
dir string // full path to the blocks/ subdirectory; empty when noop
noop bool
mu sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use utils.RWMutex to make it explicit what does the mutex protect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in particular dir is immutable, so there is no need to wrap it with the mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// is derived: nextBlockNum - Count().
// mu serializes all writes and truncations on this lane.
type laneWAL struct {
mu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use utils.Mutex for explicit mutex protection. In particular, since the whole laneWAL needs to be mutex protected you can write:

type BlockPersister struct {
	dir  Option[string]
    lanes utils.RWMutex[map[LaneID]*utils.Mutex[*laneWAL]]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this looks a bit weird, since laneWAL never changes after initialization, we are protecting the members of laneWAL, not the laneWAL value itself.

I changed to using utils.Mutex inside laneWAL.

func (bp *BlockPersister) getOrCreateLane(lane types.LaneID) (*laneWAL, error) {
dir, ok := bp.dir.Get()
if !ok {
return nil, nil // no-op persister (persistence disabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we return an error instead, and just make each call check noop mode directly? I would like us to avoid passing nil as a valid result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

type CommitQCPersister struct {
dir string // full path to the commitqcs/ subdirectory; empty when noop
noop bool
mu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit" ditto, here you probably will need an inner type to wrap it in utils.Mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// is derived directly from the anchor, making the safety invariant
// explicit: we only truncate entries the on-disk anchor covers.
if anchor, ok := batch.pruneAnchor.Get(); ok {
if err := pers.pruneAnchor.Persist(PruneAnchorConv.Encode(anchor)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaiu pruneAnchor will still use the 2-file persister. Is this supposed to stay like this? If so, then we perhaps need to productionize the 2-file persister more - like adding a checksum in the file would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did that on the first draft, but felt WAL is not the best fit here. We only need one most recent snapshot at all times, there is no history at all.
If you agree with that conclusion, I could of course productionize the 2-file persister more, but I feel that should be its own PR, because this PR is mostly about WAL replacing local files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wen-coding wen-coding force-pushed the wen/use_wal_for_persistence branch 6 times, most recently from 34fac21 to 734094e Compare March 23, 2026 19:39
Replace file-per-block and file-per-commitQC persistence with sei-db/wal,
extract common WAL mechanics into a generic indexedWAL[T], and expose
per-lane truncate-then-append APIs so avail/state controls all parallelism.

- wal.go: generic indexedWAL[T] with codec[T], monotonic index tracking,
  Write/ReadAll/TruncateBefore/TruncateAll/Close. AllowEmpty=true, fsync
  enabled, post-replay count check for silent data loss detection.

- blocks.go: one WAL per lane in blocks/<hex_lane_id>/ subdirectories.
  PersistBlock with lazy lane creation (double-checked locking) and strict
  contiguity. PersistLaneAfterAnchor for truncate-then-append per lane.
  truncateLaneWAL derives prune cursors from CommitQC with defense-in-depth
  verification. Removed public DeleteBefore/LaneIDsWithWAL.

- commitqcs.go: single WAL in commitqcs/, linear RoadIndex-to-WAL-index
  mapping. PersistAfterAnchor for truncate-then-append with internal
  crash-recovery (re-persists anchor QC after TruncateAll). deleteBefore
  unexported. Removed public DeleteBefore/CommitQCPersistBatch.

- state.go: runPersist fans out via scope.Parallel — one task for
  commit-QCs, one per committee lane for blocks. Anchor persisted
  sequentially first. anchorQC typed as utils.Option[*types.CommitQC].
  blocksByLane initialized with all static committee lanes. No goroutines
  inside persist package.

- Both persisters are internally thread-safe (utils.Mutex/RWMutex).
  afterEach callbacks are infallible (func(T), not func(T) error).
  No-op persisters skip disk I/O but advance cursors via afterEach.

Made-with: Cursor
@wen-coding wen-coding force-pushed the wen/use_wal_for_persistence branch from 734094e to 6024bba Compare March 23, 2026 19:54
The old comment warned about interleaving "internal truncation and
PersistBlock" as separate caller-facing operations. Since callers now
use MaybePruneAndPersistLane exclusively, restate the constraint as:
don't call it concurrently for the same lane.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants