Skip to content

feat(ocap-kernel): leverage libp2p v3 features in remote communications#914

Closed
sirtimid wants to merge 2 commits intomainfrom
sirtimid/fix-flaky-multi-peer-reconnection-test
Closed

feat(ocap-kernel): leverage libp2p v3 features in remote communications#914
sirtimid wants to merge 2 commits intomainfrom
sirtimid/fix-flaky-multi-peer-reconnection-test

Conversation

@sirtimid
Copy link
Copy Markdown
Contributor

@sirtimid sirtimid commented Apr 2, 2026

Closes #903

Summary

Adopts several libp2p v3 features to improve error handling, observability, and robustness of the remote communications layer:

  • Async protocol handlers with auto-abort — inbound connection handler is now async, so handler rejections automatically abort the stream instead of silently dropping errors
  • Fine-grained stream close events — v3 streams emit close events with { error, local } properties for better diagnostics distinguishing local vs remote and clean vs error closes
  • Stream inactivity timeout — configurable streamInactivityTimeoutMs (default 120s) detects dead streams via bidirectional silence, with a 5s minimum floor to prevent self-DoS
  • peer:disconnect safety net — listens for peer-level disconnect events as a fallback when readChannel cleanup misses a disconnection
  • StreamResetError detection — recognizes remote stream resets without permanently marking the peer as intentionally closed (prevents malicious peers from suppressing reconnection)
  • Graceful closeChannel rewrite — uses channel.stream.close() directly instead of the fragile unwrap() pattern, with abort fallback
  • Graceful shutdownstop() sends FIN to remote peers so they see clean stream ends rather than StreamResetError
  • Channel type gains stream — direct access to stream lifecycle APIs, avoiding unwrap()
  • TooManyOutboundProtocolStreamsError handling — re-throws immediately since trying other addresses won't help
  • writeWithTimeout stream status guard — fast-fails on already-closed streams before attempting a write

Also fixes:

  • Flaky multi-peer reconnection E2E test — root cause was the fast test backoff (10-50ms) exhausting the default connection rate limit (10/min) in <1s. Fixed with maxConnectionAttemptsPerMinute: 500, parallel restarts, and ackTimeoutMs: 5000
  • RemoteHandle.giveUp() method — transport-level give-up now properly cleans up (stops ACK retransmission + rejects pending messages + rejects pending URL redemptions)
  • #handleAckTimeout double-invoke — delegates to giveUp() instead of duplicating rejection logic

Test plan

New unit tests cover: stream status fast-fail (it.each over 4 statuses), peer:disconnect handler (registration, forwarding, relay filtering, shutdown guard), stream close event diagnostics (4 scenarios via it.each), TooManyOutboundProtocolStreamsError rethrow, StreamResetError reconnection behavior, closeChannel graceful/abort/double-fail paths, stop() stream closing, RemoteHandle.giveUp(), and streamInactivityTimeoutMs query string round-trip. New E2E test file (libp2p-v3-features.test.ts) covers stream inactivity timeout recovery end-to-end.

🤖 Generated with Claude Code

sirtimid and others added 2 commits April 2, 2026 15:26
The "handles multiple simultaneous reconnections to different peers"
test was flaky because the fast test backoff (10-50ms) exhausted the
default connection rate limit (10/min) in <1s, blocking reconnection
for ~60s — right at the edge of the redemption timeout.

Fix the test with three changes:
- Add maxConnectionAttemptsPerMinute: 500 to prevent rate limiting
- Restart both peers in parallel (Promise.all) to halve wall-clock time
- Use ackTimeoutMs: 5000 for a 20s redemption window

Also add RemoteHandle.giveUp() method so the transport-level give-up
(via onRemoteGiveUp) properly cleans up: stops ACK retransmission,
rejects pending messages, and rejects pending URL redemptions. The
previous code path only rejected redemptions, missing timer cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- StreamResetError no longer marks peer as intentionally closed —
  prevents a malicious peer from permanently suppressing reconnection
- Add signal.aborted guard to peer:disconnect handler to avoid
  spurious reconnection during shutdown
- Add MIN_STREAM_INACTIVITY_TIMEOUT_MS (5s) floor to prevent self-DoS
  via extremely small timeout values
- Clean up double-invoke in #handleAckTimeout — delegate to giveUp()
- Add unit test for RemoteHandle.giveUp()
- Add test for stop() calling channel.stream.close()
- Add test for closeChannel when abort() throws
- Add streamInactivityTimeoutMs to comms-query-string round-trip test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sirtimid sirtimid requested a review from a team as a code owner April 2, 2026 13:34
@sirtimid sirtimid changed the title fix(ocap-kernel): fix flaky multi-peer reconnection E2E test feat(ocap-kernel): leverage libp2p v3 features in remote communications Apr 2, 2026
@sirtimid sirtimid closed this Apr 2, 2026
@sirtimid sirtimid deleted the sirtimid/fix-flaky-multi-peer-reconnection-test branch April 2, 2026 13:36
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 78.35%
⬆️ +0.06%
8519 / 10872
🔵 Statements 78.17%
⬆️ +0.06%
8655 / 11071
🔵 Functions 75.89%
⬆️ +0.04%
1987 / 2618
🔵 Branches 76.16%
⬆️ +0.09%
3662 / 4808
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/kernel-browser-runtime/src/utils/comms-query-string.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/types.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/kernel/RemoteHandle.ts 89%
⬆️ +0.42%
82.88%
🟰 ±0%
87.75%
⬆️ +0.25%
89.27%
⬆️ +0.42%
354, 373-414, 467, 510, 520-522, 563-576, 915, 988, 1039
packages/ocap-kernel/src/remotes/kernel/RemoteManager.ts 97.8%
⬆️ +0.03%
100%
🟰 ±0%
95.65%
🟰 ±0%
97.8%
⬆️ +0.03%
185, 371-373
packages/ocap-kernel/src/remotes/platform/channel-utils.ts 100%
🟰 ±0%
85.71%
⬆️ +5.71%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/platform/connection-factory.ts 94.56%
⬆️ +0.71%
88.75%
⬆️ +0.60%
97.29%
⬆️ +0.33%
94.5%
⬆️ +0.72%
112-114, 124, 365, 550-551, 556-560, 619, 632
packages/ocap-kernel/src/remotes/platform/constants.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/platform/transport.ts 82.47%
⬆️ +0.58%
81.9%
⬆️ +2.13%
74.28%
⬆️ +0.09%
82.47%
⬆️ +0.58%
134, 137, 158, 202-203, 208-212, 254-263, 322, 356-374, 398, 481, 525-528, 552, 576-581, 584-585, 589-592, 633, 663, 682-684, 693, 706, 786, 815
Generated in workflow #4165 for commit 6ade570 by the Vitest Coverage Report Action

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

// Wait longer than the inactivity timeout (2s + buffer).
// The stream should auto-abort due to inactivityTimeout,
// triggering connection loss handling and reconnection.
await delay(4_000);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E2E test delay shorter than clamped inactivity timeout

Medium Severity

The test sets streamInactivityTimeoutMs: 2_000 and waits delay(4_000), expecting the stream to have auto-aborted. However, registerChannel clamps the value via Math.max(streamInactivityTimeoutMs, MIN_STREAM_INACTIVITY_TIMEOUT_MS), which evaluates to Math.max(2000, 5000) = 5000. Since 4s < 5s, the stream hasn't timed out yet, so the test succeeds on the existing connection without ever triggering inactivity-based reconnection. The test passes but doesn't exercise the intended code path.

Additional Locations (2)
Fix in Cursor Fix in Web

);
this.#rejectAllPending(`not acknowledged after ${MAX_RETRIES} retries`);
this.rejectPendingRedemptions(
`Remote connection lost after ${MAX_RETRIES} failed retries`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double giveUp() call in ACK timeout path

Low Severity

#handleAckTimeout calls this.giveUp(reason) on line 384, then this.#onGiveUp?.(this.#peerId) on line 385 triggers #handleRemoteGiveUp, which calls remote.giveUp(reason) again with a different reason string. The second invocation is a no-op since collections were already cleared, but it contradicts the comment stating "notify RemoteManager which calls giveUp()" — implying RemoteManager is the intended caller. The redundant path may confuse future maintainers.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: leverage libp2p v3 features in remote communications

1 participant