Fix async release before HTLC decode#4542
Conversation
Handle `ReleaseHeldHtlc` messages that arrive before the sender-side LSP has even queued the held HTLC for onion decoding. Unlike lightningdevkit#4106, which covers releases arriving after the HTLC is in `decode_update_add_htlcs` but before it reaches `pending_intercepted_htlcs`, this preserves releases that arrive one step earlier and would otherwise be dropped as HTLC not found. Co-Authored-By: HAL 9000 Signed-off-by: Elias Rohrer <dev@tnull.de>
|
👋 Thanks for assigning @valentinewallace as a reviewer! |
| if self | ||
| .pending_released_async_htlcs | ||
| .lock() | ||
| .unwrap() | ||
| .remove(&intercept_id) | ||
| { | ||
| self.release_held_htlc(intercept_id, pending_add); | ||
| } else { | ||
| let mut held_htlcs = self.pending_intercepted_htlcs.lock().unwrap(); | ||
| match held_htlcs.entry(intercept_id) { | ||
| hash_map::Entry::Vacant(entry) => { | ||
| log_debug!( | ||
| logger, | ||
| "Intercepted held HTLC with id {intercept_id}, holding until the recipient is online" | ||
| ); | ||
| entry.insert(pending_add); | ||
| }, | ||
| hash_map::Entry::Occupied(_) => { | ||
| debug_assert!(false, "Should never have two HTLCs with the same channel id and htlc id"); | ||
| log_error!(logger, "Duplicate intercept id for HTLC"); | ||
| fail_htlc_continue_to_next!( | ||
| LocalHTLCFailureReason::TemporaryNodeFailure | ||
| ); | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
There is a TOCTOU race between this code and handle_release_held_htlc. The pending_released_async_htlcs lock is acquired and dropped (temporary in the if condition) before pending_intercepted_htlcs is locked in the else branch. Meanwhile, handle_release_held_htlc acquires locks in the opposite order: it holds pending_intercepted_htlcs while inserting into pending_released_async_htlcs.
Concrete interleaving:
- Process handler (here):
pending_released_async_htlcs.remove()→ false (drops lock) - Release handler:
pending_intercepted_htlcs.lock()→ HTLC not found - Release handler: inserts into
pending_released_async_htlcs, returns (drops both locks) - Process handler (here):
pending_intercepted_htlcs.lock()→ inserts HTLC
Result: HTLC stuck in pending_intercepted_htlcs and a stale entry in pending_released_async_htlcs. The HTLC falls back to timeout behavior (same as pre-PR), so this isn't a regression, but the fix is incomplete for this narrow window.
Fixing this is tricky because nesting pending_released → pending_intercepted here would deadlock with the opposite order in handle_release_held_htlc. One option would be to adopt a consistent lock ordering across both paths, or protect both structures under a single lock.
| None => { | ||
| log_trace!( | ||
| self.logger, | ||
| "Failed to release HTLC with intercept_id {}: HTLC not found", | ||
| intercept_id | ||
| ); | ||
| return; | ||
| }, | ||
| } | ||
| }; | ||
| let next_hop_scid = match htlc.forward_info.routing { | ||
| PendingHTLCRouting::Forward { ref mut hold_htlc, short_channel_id, .. } => { | ||
| debug_assert!(hold_htlc.is_some()); | ||
| *hold_htlc = None; | ||
| short_channel_id | ||
| }, | ||
| _ => { | ||
| debug_assert!(false, "HTLC intercepts can only be forwards"); | ||
| // Let the HTLC be auto-failed before it expires. | ||
| return; | ||
| }, | ||
| }; | ||
|
|
||
| let logger = WithContext::from( | ||
| &self.logger, | ||
| Some(htlc.prev_counterparty_node_id), | ||
| Some(htlc.prev_channel_id), | ||
| Some(htlc.forward_info.payment_hash), | ||
| ); | ||
| log_trace!(logger, "Releasing held htlc with intercept_id {}", intercept_id); | ||
|
|
||
| let prev_chan_public = { | ||
| let per_peer_state = self.per_peer_state.read().unwrap(); | ||
| let peer_state = per_peer_state | ||
| .get(&htlc.prev_counterparty_node_id) | ||
| .map(|mtx| mtx.lock().unwrap()); | ||
| let chan_state = peer_state | ||
| .as_ref() | ||
| .map(|state| state.channel_by_id.get(&htlc.prev_channel_id)) | ||
| .flatten(); | ||
| if let Some(chan_state) = chan_state { | ||
| chan_state.context().should_announce() | ||
| } else { | ||
| // If the inbound channel has closed since the HTLC was held, we really | ||
| // shouldn't forward it - forwarding it now would result in, at best, | ||
| // having to claim the HTLC on chain. Instead, drop the HTLC and let the | ||
| // counterparty claim their money on chain. | ||
| return; | ||
| } | ||
| }; | ||
|
|
||
| let should_intercept = self | ||
| .do_funded_channel_callback(next_hop_scid, |chan| { | ||
| self.forward_needs_intercept_to_known_chan(prev_chan_public, chan) | ||
| }) | ||
| .unwrap_or_else(|| self.forward_needs_intercept_to_unknown_chan(next_hop_scid)); | ||
|
|
||
| if should_intercept { | ||
| let intercept_id = InterceptId::from_htlc_id_and_chan_id( | ||
| htlc.prev_htlc_id, | ||
| &htlc.prev_channel_id, | ||
| &htlc.prev_counterparty_node_id, | ||
| ); | ||
| let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap(); | ||
| match pending_intercepts.entry(intercept_id) { | ||
| hash_map::Entry::Vacant(entry) => { | ||
| if let Ok(intercept_ev) = | ||
| create_htlc_intercepted_event(intercept_id, &htlc) | ||
| let mut pending_releases = | ||
| self.pending_released_async_htlcs.lock().unwrap(); | ||
| if pending_releases.len() < MAX_PENDING_ASYNC_EARLY_RELEASES | ||
| || pending_releases.contains(&intercept_id) | ||
| { | ||
| self.pending_events.lock().unwrap().push_back((intercept_ev, None)); | ||
| entry.insert(htlc); | ||
| pending_releases.insert(intercept_id); | ||
| log_trace!( | ||
| self.logger, | ||
| "Queued release request for HTLC with intercept_id {} until it is fully held", | ||
| intercept_id | ||
| ); | ||
| } else { | ||
| debug_assert!(false); | ||
| // Let the HTLC be auto-failed before it expires. | ||
| return; | ||
| log_trace!( | ||
| self.logger, | ||
| "Dropping early release request for intercept_id {} because the pending queue is full", | ||
| intercept_id | ||
| ); | ||
| } | ||
| }, | ||
| hash_map::Entry::Occupied(_) => { | ||
| log_error!( | ||
| logger, | ||
| "Failed to forward incoming HTLC: detected duplicate intercepted payment", | ||
| ); | ||
| debug_assert!( | ||
| false, | ||
| "Should never have two HTLCs with the same channel id and htlc id", | ||
| ); | ||
| // Let the HTLC be auto-failed before it expires. | ||
| return; |
There was a problem hiding this comment.
Entries inserted here are only ever removed by process_pending_update_add_htlcs (line 7583). If the corresponding HTLC never arrives (e.g., sender crashed before committing), or the TOCTOU race described above occurs, the entry stays in the set indefinitely (until node restart, since it is not persisted). Over time, stale entries can accumulate up to MAX_PENDING_ASYNC_EARLY_RELEASES, after which legitimate early releases are silently dropped.
Consider adding a TTL or periodic cleanup sweep (e.g., during the existing held-HTLC timeout check) to evict entries older than some threshold, so a long-running node doesn't gradually lose the ability to handle early releases.
| if should_intercept { | ||
| let intercept_id = InterceptId::from_htlc_id_and_chan_id( | ||
| htlc.prev_htlc_id, | ||
| &htlc.prev_channel_id, | ||
| &htlc.prev_counterparty_node_id, | ||
| ); | ||
| let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap(); |
There was a problem hiding this comment.
Minor: intercept_id is already passed as a parameter to this method but is recomputed here from the HTLC fields, shadowing the parameter. The values will always match (callers derive it the same way), but using the parameter directly would be simpler and avoid redundant hashing.
| if should_intercept { | |
| let intercept_id = InterceptId::from_htlc_id_and_chan_id( | |
| htlc.prev_htlc_id, | |
| &htlc.prev_channel_id, | |
| &htlc.prev_counterparty_node_id, | |
| ); | |
| let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap(); | |
| if should_intercept { | |
| let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap(); |
Review SummaryThree issues found, one of which is a concurrency concern: Inline comments posted
|
tnull
left a comment
There was a problem hiding this comment.
I'll defer addressing Claude's review comments and generally iterating on the code here until this got a concept ACK. Maybe there is much better way to address this race/edge case?
|
I think we can avoid the new hashset -- check out the top commit on this branch https://github.com/valentinewallace/rust-lightning/tree/2026-04-async-held-htlc-ordering-wip There's an extra check in
|
valentinewallace
left a comment
There was a problem hiding this comment.
feel free to re-request if there's feedback on the suggested approach
Ah, that indeed looks much cleaner. Do you just want to open a PR for this, maybe? |
|
Hmm, I kinda wonder if we shouldn't mandate in the spec that the release message not fly until the HTLC is fully locked in to the HTLC holder? There may still be a race in LDK around the HTLC getting marked ready to forward and the BP firing to handle it, but I have to imagine we won't be the only ones with this kind of issue? |
How can the sender know that the LSP has processed the final |
|
Mmm, right, yea, you can't know they actually processed it. Indeed we'd need to support it in any case but that probably implies we should definitely wait until we get a ways into the state machine cause we don't want to send |
Ok cool, that should be covered by the spec already because as an often-offline sender we don't even have reply paths to use to send |
Discovered this while working on #4463. Would be interested to hear alternative ideas how to fix this: