Skip to content

Fix async release before HTLC decode#4542

Open
tnull wants to merge 1 commit intolightningdevkit:mainfrom
tnull:2026-04-async-held-htlc-ordering
Open

Fix async release before HTLC decode#4542
tnull wants to merge 1 commit intolightningdevkit:mainfrom
tnull:2026-04-async-held-htlc-ordering

Conversation

@tnull
Copy link
Copy Markdown
Contributor

@tnull tnull commented Apr 6, 2026

Discovered this while working on #4463. Would be interested to hear alternative ideas how to fix this:

Handle `ReleaseHeldHtlc` messages that arrive before the sender-side LSP
has even queued the held HTLC for onion decoding. Unlike #4106, which
covers releases arriving after the HTLC is in `decode_update_add_htlcs`
but before it reaches `pending_intercepted_htlcs`, this preserves
releases that arrive one step earlier and would otherwise be dropped as
HTLC not found.

Co-Authored-By: HAL 9000

Handle `ReleaseHeldHtlc` messages that arrive before the sender-side LSP
has even queued the held HTLC for onion decoding. Unlike lightningdevkit#4106, which
covers releases arriving after the HTLC is in `decode_update_add_htlcs`
but before it reaches `pending_intercepted_htlcs`, this preserves
releases that arrive one step earlier and would otherwise be dropped as
HTLC not found.

Co-Authored-By: HAL 9000
Signed-off-by: Elias Rohrer <dev@tnull.de>
@tnull tnull requested a review from valentinewallace April 6, 2026 15:36
@ldk-reviews-bot
Copy link
Copy Markdown

ldk-reviews-bot commented Apr 6, 2026

👋 Thanks for assigning @valentinewallace as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull added this to the 0.3 milestone Apr 6, 2026
Comment on lines +7579 to 7604
if self
.pending_released_async_htlcs
.lock()
.unwrap()
.remove(&intercept_id)
{
self.release_held_htlc(intercept_id, pending_add);
} else {
let mut held_htlcs = self.pending_intercepted_htlcs.lock().unwrap();
match held_htlcs.entry(intercept_id) {
hash_map::Entry::Vacant(entry) => {
log_debug!(
logger,
"Intercepted held HTLC with id {intercept_id}, holding until the recipient is online"
);
entry.insert(pending_add);
},
hash_map::Entry::Occupied(_) => {
debug_assert!(false, "Should never have two HTLCs with the same channel id and htlc id");
log_error!(logger, "Duplicate intercept id for HTLC");
fail_htlc_continue_to_next!(
LocalHTLCFailureReason::TemporaryNodeFailure
);
},
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a TOCTOU race between this code and handle_release_held_htlc. The pending_released_async_htlcs lock is acquired and dropped (temporary in the if condition) before pending_intercepted_htlcs is locked in the else branch. Meanwhile, handle_release_held_htlc acquires locks in the opposite order: it holds pending_intercepted_htlcs while inserting into pending_released_async_htlcs.

Concrete interleaving:

  1. Process handler (here): pending_released_async_htlcs.remove() → false (drops lock)
  2. Release handler: pending_intercepted_htlcs.lock() → HTLC not found
  3. Release handler: inserts into pending_released_async_htlcs, returns (drops both locks)
  4. Process handler (here): pending_intercepted_htlcs.lock() → inserts HTLC

Result: HTLC stuck in pending_intercepted_htlcs and a stale entry in pending_released_async_htlcs. The HTLC falls back to timeout behavior (same as pre-PR), so this isn't a regression, but the fix is incomplete for this narrow window.

Fixing this is tricky because nesting pending_released → pending_intercepted here would deadlock with the opposite order in handle_release_held_htlc. One option would be to adopt a consistent lock ordering across both paths, or protect both structures under a single lock.

Comment on lines 17297 to 17316
None => {
log_trace!(
self.logger,
"Failed to release HTLC with intercept_id {}: HTLC not found",
intercept_id
);
return;
},
}
};
let next_hop_scid = match htlc.forward_info.routing {
PendingHTLCRouting::Forward { ref mut hold_htlc, short_channel_id, .. } => {
debug_assert!(hold_htlc.is_some());
*hold_htlc = None;
short_channel_id
},
_ => {
debug_assert!(false, "HTLC intercepts can only be forwards");
// Let the HTLC be auto-failed before it expires.
return;
},
};

let logger = WithContext::from(
&self.logger,
Some(htlc.prev_counterparty_node_id),
Some(htlc.prev_channel_id),
Some(htlc.forward_info.payment_hash),
);
log_trace!(logger, "Releasing held htlc with intercept_id {}", intercept_id);

let prev_chan_public = {
let per_peer_state = self.per_peer_state.read().unwrap();
let peer_state = per_peer_state
.get(&htlc.prev_counterparty_node_id)
.map(|mtx| mtx.lock().unwrap());
let chan_state = peer_state
.as_ref()
.map(|state| state.channel_by_id.get(&htlc.prev_channel_id))
.flatten();
if let Some(chan_state) = chan_state {
chan_state.context().should_announce()
} else {
// If the inbound channel has closed since the HTLC was held, we really
// shouldn't forward it - forwarding it now would result in, at best,
// having to claim the HTLC on chain. Instead, drop the HTLC and let the
// counterparty claim their money on chain.
return;
}
};

let should_intercept = self
.do_funded_channel_callback(next_hop_scid, |chan| {
self.forward_needs_intercept_to_known_chan(prev_chan_public, chan)
})
.unwrap_or_else(|| self.forward_needs_intercept_to_unknown_chan(next_hop_scid));

if should_intercept {
let intercept_id = InterceptId::from_htlc_id_and_chan_id(
htlc.prev_htlc_id,
&htlc.prev_channel_id,
&htlc.prev_counterparty_node_id,
);
let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap();
match pending_intercepts.entry(intercept_id) {
hash_map::Entry::Vacant(entry) => {
if let Ok(intercept_ev) =
create_htlc_intercepted_event(intercept_id, &htlc)
let mut pending_releases =
self.pending_released_async_htlcs.lock().unwrap();
if pending_releases.len() < MAX_PENDING_ASYNC_EARLY_RELEASES
|| pending_releases.contains(&intercept_id)
{
self.pending_events.lock().unwrap().push_back((intercept_ev, None));
entry.insert(htlc);
pending_releases.insert(intercept_id);
log_trace!(
self.logger,
"Queued release request for HTLC with intercept_id {} until it is fully held",
intercept_id
);
} else {
debug_assert!(false);
// Let the HTLC be auto-failed before it expires.
return;
log_trace!(
self.logger,
"Dropping early release request for intercept_id {} because the pending queue is full",
intercept_id
);
}
},
hash_map::Entry::Occupied(_) => {
log_error!(
logger,
"Failed to forward incoming HTLC: detected duplicate intercepted payment",
);
debug_assert!(
false,
"Should never have two HTLCs with the same channel id and htlc id",
);
// Let the HTLC be auto-failed before it expires.
return;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entries inserted here are only ever removed by process_pending_update_add_htlcs (line 7583). If the corresponding HTLC never arrives (e.g., sender crashed before committing), or the TOCTOU race described above occurs, the entry stays in the set indefinitely (until node restart, since it is not persisted). Over time, stale entries can accumulate up to MAX_PENDING_ASYNC_EARLY_RELEASES, after which legitimate early releases are silently dropped.

Consider adding a TTL or periodic cleanup sweep (e.g., during the existing held-HTLC timeout check) to evict entries older than some threshold, so a long-running node doesn't gradually lose the ability to handle early releases.

Comment on lines +6054 to +6060
if should_intercept {
let intercept_id = InterceptId::from_htlc_id_and_chan_id(
htlc.prev_htlc_id,
&htlc.prev_channel_id,
&htlc.prev_counterparty_node_id,
);
let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: intercept_id is already passed as a parameter to this method but is recomputed here from the HTLC fields, shadowing the parameter. The values will always match (callers derive it the same way), but using the parameter directly would be simpler and avoid redundant hashing.

Suggested change
if should_intercept {
let intercept_id = InterceptId::from_htlc_id_and_chan_id(
htlc.prev_htlc_id,
&htlc.prev_channel_id,
&htlc.prev_counterparty_node_id,
);
let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap();
if should_intercept {
let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap();

@ldk-claude-review-bot
Copy link
Copy Markdown
Collaborator

Review Summary

Three issues found, one of which is a concurrency concern:

Inline comments posted

  1. lightning/src/ln/channelmanager.rs:7579-7604 — TOCTOU race between process_pending_update_add_htlcs and handle_release_held_htlc. The two paths acquire pending_released_async_htlcs and pending_intercepted_htlcs in a non-atomic, opposite order, creating a window where both sides miss each other's writes. The HTLC ends up stuck in pending_intercepted_htlcs (same outcome as pre-PR), so this is an incomplete fix rather than a regression. Fixing properly requires consistent lock ordering or a single shared lock.

  2. lightning/src/ln/channelmanager.rs:17297-17316 — No cleanup of stale entries in pending_released_async_htlcs. Entries for HTLCs that never arrive (or that hit the race above) accumulate until restart. Bounded by MAX_PENDING_ASYNC_EARLY_RELEASES (4096), but once full, legitimate early releases are silently dropped. A periodic sweep would prevent long-running nodes from degrading.

  3. lightning/src/ln/channelmanager.rs:6054-6060release_held_htlc recomputes intercept_id from HTLC fields despite already receiving it as a parameter. The parameter is shadowed. Harmless but unnecessarily redundant.

Copy link
Copy Markdown
Contributor Author

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer addressing Claude's review comments and generally iterating on the code here until this got a concept ACK. Maybe there is much better way to address this race/edge case?

@valentinewallace
Copy link
Copy Markdown
Contributor

I think we can avoid the new hashset -- check out the top commit on this branch https://github.com/valentinewallace/rust-lightning/tree/2026-04-async-held-htlc-ordering-wip

There's an extra check in process_pending_decode_update_add_htlcs that might be somewhat unintuitive. Claude basically pointed out a race condition:

  1. Thread A receives RAA, updates Channel::pending_inbound_htlc entry to Committed, adds the update_add to monitor_pending_update_add_htlcs with the hold_htlc flag still set
  2. Thread B releases+clears monitor_pending_update_adds, so it exists in-memory in the ChannelManager but isn't yet in decode_update_add_htlcs
  3. Thread A receives release_held_htlc, updates the Committed pending_inbound_htlc, can't find the htlc in decode_update_add_htlcs yet
  4. Thread B finally pushes the htlc into decode_update_add_htlcs with the hold_htlc flag still set

Copy link
Copy Markdown
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to re-request if there's feedback on the suggested approach

@tnull
Copy link
Copy Markdown
Contributor Author

tnull commented Apr 7, 2026

I think we can avoid the new hashset -- check out the top commit on this branch https://github.com/valentinewallace/rust-lightning/tree/2026-04-async-held-htlc-ordering-wip

Ah, that indeed looks much cleaner. Do you just want to open a PR for this, maybe?

@TheBlueMatt
Copy link
Copy Markdown
Collaborator

Hmm, I kinda wonder if we shouldn't mandate in the spec that the release message not fly until the HTLC is fully locked in to the HTLC holder? There may still be a race in LDK around the HTLC getting marked ready to forward and the BP firing to handle it, but I have to imagine we won't be the only ones with this kind of issue?

@valentinewallace
Copy link
Copy Markdown
Contributor

Hmm, I kinda wonder if we shouldn't mandate in the spec that the release message not fly until the HTLC is fully locked in to the HTLC holder? There may still be a race in LDK around the HTLC getting marked ready to forward and the BP firing to handle it, but I have to imagine we won't be the only ones with this kind of issue?

How can the sender know that the LSP has processed the final revoke_and_ack for the held HTLC, such that it's safe to send the held_htlc_available OM?

@TheBlueMatt
Copy link
Copy Markdown
Collaborator

Mmm, right, yea, you can't know they actually processed it. Indeed we'd need to support it in any case but that probably implies we should definitely wait until we get a ways into the state machine cause we don't want to send update_add_htlc + commitment_signed and miss that they havne't received/processed that yet either :)

@valentinewallace
Copy link
Copy Markdown
Contributor

Mmm, right, yea, you can't know they actually processed it. Indeed we'd need to support it in any case but that probably implies we should definitely wait until we get a ways into the state machine cause we don't want to send update_add_htlc + commitment_signed and miss that they havne't received/processed that yet either :)

Ok cool, that should be covered by the spec already because as an often-offline sender we don't even have reply paths to use to send held_htlc_available until we receive our counterparty's RAA. Will proceed with the suggested patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants