Fix async release before HTLC decode by tnull · Pull Request #4542 · lightningdevkit/rust-lightning

tnull · 2026-04-06T15:36:35Z

Discovered this while working on #4463. Would be interested to hear alternative ideas how to fix this:

Handle `ReleaseHeldHtlc` messages that arrive before the sender-side LSP
has even queued the held HTLC for onion decoding. Unlike #4106, which
covers releases arriving after the HTLC is in `decode_update_add_htlcs`
but before it reaches `pending_intercepted_htlcs`, this preserves
releases that arrive one step earlier and would otherwise be dropped as
HTLC not found.

Co-Authored-By: HAL 9000

Handle `ReleaseHeldHtlc` messages that arrive before the sender-side LSP has even queued the held HTLC for onion decoding. Unlike lightningdevkit#4106, which covers releases arriving after the HTLC is in `decode_update_add_htlcs` but before it reaches `pending_intercepted_htlcs`, this preserves releases that arrive one step earlier and would otherwise be dropped as HTLC not found. Co-Authored-By: HAL 9000 Signed-off-by: Elias Rohrer <dev@tnull.de>

ldk-reviews-bot · 2026-04-06T15:36:38Z

👋 Thanks for assigning @valentinewallace as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

ldk-claude-review-bot · 2026-04-06T15:44:35Z

lightning/src/ln/channelmanager.rs

+							if self
+								.pending_released_async_htlcs
+								.lock()
+								.unwrap()
+								.remove(&intercept_id)
+							{
+								self.release_held_htlc(intercept_id, pending_add);
+							} else {
+								let mut held_htlcs = self.pending_intercepted_htlcs.lock().unwrap();
+								match held_htlcs.entry(intercept_id) {
+									hash_map::Entry::Vacant(entry) => {
+										log_debug!(
+											logger,
+											"Intercepted held HTLC with id {intercept_id}, holding until the recipient is online"
+										);
+										entry.insert(pending_add);
+									},
+									hash_map::Entry::Occupied(_) => {
+										debug_assert!(false, "Should never have two HTLCs with the same channel id and htlc id");
+										log_error!(logger, "Duplicate intercept id for HTLC");
+										fail_htlc_continue_to_next!(
+											LocalHTLCFailureReason::TemporaryNodeFailure
+										);
+									},
+								}
 							}


There is a TOCTOU race between this code and handle_release_held_htlc. The pending_released_async_htlcs lock is acquired and dropped (temporary in the if condition) before pending_intercepted_htlcs is locked in the else branch. Meanwhile, handle_release_held_htlc acquires locks in the opposite order: it holds pending_intercepted_htlcs while inserting into pending_released_async_htlcs.

Concrete interleaving:

Process handler (here): pending_released_async_htlcs.remove() → false (drops lock)

Release handler: pending_intercepted_htlcs.lock() → HTLC not found

Release handler: inserts into pending_released_async_htlcs, returns (drops both locks)

Process handler (here): pending_intercepted_htlcs.lock() → inserts HTLC

Result: HTLC stuck in pending_intercepted_htlcs and a stale entry in pending_released_async_htlcs. The HTLC falls back to timeout behavior (same as pre-PR), so this isn't a regression, but the fix is incomplete for this narrow window.

Fixing this is tricky because nesting pending_released → pending_intercepted here would deadlock with the opposite order in handle_release_held_htlc. One option would be to adopt a consistent lock ordering across both paths, or protect both structures under a single lock.

ldk-claude-review-bot · 2026-04-06T15:44:44Z

lightning/src/ln/channelmanager.rs

 						None => {
-							log_trace!(
-								self.logger,
-								"Failed to release HTLC with intercept_id {}: HTLC not found",
-								intercept_id
-							);
-							return;
-						},
-					}
-				};
-				let next_hop_scid = match htlc.forward_info.routing {
-					PendingHTLCRouting::Forward { ref mut hold_htlc, short_channel_id, .. } => {
-						debug_assert!(hold_htlc.is_some());
-						*hold_htlc = None;
-						short_channel_id
-					},
-					_ => {
-						debug_assert!(false, "HTLC intercepts can only be forwards");
-						// Let the HTLC be auto-failed before it expires.
-						return;
-					},
-				};
-
-				let logger = WithContext::from(
-					&self.logger,
-					Some(htlc.prev_counterparty_node_id),
-					Some(htlc.prev_channel_id),
-					Some(htlc.forward_info.payment_hash),
-				);
-				log_trace!(logger, "Releasing held htlc with intercept_id {}", intercept_id);
-
-				let prev_chan_public = {
-					let per_peer_state = self.per_peer_state.read().unwrap();
-					let peer_state = per_peer_state
-						.get(&htlc.prev_counterparty_node_id)
-						.map(|mtx| mtx.lock().unwrap());
-					let chan_state = peer_state
-						.as_ref()
-						.map(|state| state.channel_by_id.get(&htlc.prev_channel_id))
-						.flatten();
-					if let Some(chan_state) = chan_state {
-						chan_state.context().should_announce()
-					} else {
-						// If the inbound channel has closed since the HTLC was held, we really
-						// shouldn't forward it - forwarding it now would result in, at best,
-						// having to claim the HTLC on chain. Instead, drop the HTLC and let the
-						// counterparty claim their money on chain.
-						return;
-					}
-				};
-
-				let should_intercept = self
-					.do_funded_channel_callback(next_hop_scid, |chan| {
-						self.forward_needs_intercept_to_known_chan(prev_chan_public, chan)
-					})
-					.unwrap_or_else(|| self.forward_needs_intercept_to_unknown_chan(next_hop_scid));
-
-				if should_intercept {
-					let intercept_id = InterceptId::from_htlc_id_and_chan_id(
-						htlc.prev_htlc_id,
-						&htlc.prev_channel_id,
-						&htlc.prev_counterparty_node_id,
-					);
-					let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap();
-					match pending_intercepts.entry(intercept_id) {
-						hash_map::Entry::Vacant(entry) => {
-							if let Ok(intercept_ev) =
-								create_htlc_intercepted_event(intercept_id, &htlc)
+							let mut pending_releases =
+								self.pending_released_async_htlcs.lock().unwrap();
+							if pending_releases.len() < MAX_PENDING_ASYNC_EARLY_RELEASES
+								|| pending_releases.contains(&intercept_id)
 							{
-								self.pending_events.lock().unwrap().push_back((intercept_ev, None));
-								entry.insert(htlc);
+								pending_releases.insert(intercept_id);
+								log_trace!(
+									self.logger,
+									"Queued release request for HTLC with intercept_id {} until it is fully held",
+									intercept_id
+								);
 							} else {
-								debug_assert!(false);
-								// Let the HTLC be auto-failed before it expires.
-								return;
+								log_trace!(
+									self.logger,
+									"Dropping early release request for intercept_id {} because the pending queue is full",
+									intercept_id
+								);
 							}
-						},
-						hash_map::Entry::Occupied(_) => {
-							log_error!(
-								logger,
-								"Failed to forward incoming HTLC: detected duplicate intercepted payment",
-							);
-							debug_assert!(
-								false,
-								"Should never have two HTLCs with the same channel id and htlc id",
-							);
-							// Let the HTLC be auto-failed before it expires.
 							return;


Entries inserted here are only ever removed by process_pending_update_add_htlcs (line 7583). If the corresponding HTLC never arrives (e.g., sender crashed before committing), or the TOCTOU race described above occurs, the entry stays in the set indefinitely (until node restart, since it is not persisted). Over time, stale entries can accumulate up to MAX_PENDING_ASYNC_EARLY_RELEASES, after which legitimate early releases are silently dropped.

Consider adding a TTL or periodic cleanup sweep (e.g., during the existing held-HTLC timeout check) to evict entries older than some threshold, so a long-running node doesn't gradually lose the ability to handle early releases.

ldk-claude-review-bot · 2026-04-06T15:44:51Z

lightning/src/ln/channelmanager.rs

+		if should_intercept {
+			let intercept_id = InterceptId::from_htlc_id_and_chan_id(
+				htlc.prev_htlc_id,
+				&htlc.prev_channel_id,
+				&htlc.prev_counterparty_node_id,
+			);
+			let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap();


Minor: intercept_id is already passed as a parameter to this method but is recomputed here from the HTLC fields, shadowing the parameter. The values will always match (callers derive it the same way), but using the parameter directly would be simpler and avoid redundant hashing.

Suggested change

if should_intercept {

let intercept_id = InterceptId::from_htlc_id_and_chan_id(

htlc.prev_htlc_id,

&htlc.prev_channel_id,

&htlc.prev_counterparty_node_id,

);

let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap();

if should_intercept {

let mut pending_intercepts = self.pending_intercepted_htlcs.lock().unwrap();

ldk-claude-review-bot · 2026-04-06T15:45:16Z

Review Summary

Three issues found, one of which is a concurrency concern:

Inline comments posted

lightning/src/ln/channelmanager.rs:7579-7604 — TOCTOU race between process_pending_update_add_htlcs and handle_release_held_htlc. The two paths acquire pending_released_async_htlcs and pending_intercepted_htlcs in a non-atomic, opposite order, creating a window where both sides miss each other's writes. The HTLC ends up stuck in pending_intercepted_htlcs (same outcome as pre-PR), so this is an incomplete fix rather than a regression. Fixing properly requires consistent lock ordering or a single shared lock.
lightning/src/ln/channelmanager.rs:17297-17316 — No cleanup of stale entries in pending_released_async_htlcs. Entries for HTLCs that never arrive (or that hit the race above) accumulate until restart. Bounded by MAX_PENDING_ASYNC_EARLY_RELEASES (4096), but once full, legitimate early releases are silently dropped. A periodic sweep would prevent long-running nodes from degrading.
lightning/src/ln/channelmanager.rs:6054-6060 — release_held_htlc recomputes intercept_id from HTLC fields despite already receiving it as a parameter. The parameter is shadowed. Harmless but unnecessarily redundant.

tnull

I'll defer addressing Claude's review comments and generally iterating on the code here until this got a concept ACK. Maybe there is much better way to address this race/edge case?

valentinewallace · 2026-04-07T17:39:49Z

I think we can avoid the new hashset -- check out the top commit on this branch https://github.com/valentinewallace/rust-lightning/tree/2026-04-async-held-htlc-ordering-wip

There's an extra check in process_pending_decode_update_add_htlcs that might be somewhat unintuitive. Claude basically pointed out a race condition:

Thread A receives RAA, updates Channel::pending_inbound_htlc entry to Committed, adds the update_add to monitor_pending_update_add_htlcs with the hold_htlc flag still set
Thread B releases+clears monitor_pending_update_adds, so it exists in-memory in the ChannelManager but isn't yet in decode_update_add_htlcs
Thread A receives release_held_htlc, updates the Committed pending_inbound_htlc, can't find the htlc in decode_update_add_htlcs yet
Thread B finally pushes the htlc into decode_update_add_htlcs with the hold_htlc flag still set

valentinewallace

feel free to re-request if there's feedback on the suggested approach

tnull · 2026-04-07T17:55:19Z

I think we can avoid the new hashset -- check out the top commit on this branch https://github.com/valentinewallace/rust-lightning/tree/2026-04-async-held-htlc-ordering-wip

Ah, that indeed looks much cleaner. Do you just want to open a PR for this, maybe?

TheBlueMatt · 2026-04-07T20:02:27Z

Hmm, I kinda wonder if we shouldn't mandate in the spec that the release message not fly until the HTLC is fully locked in to the HTLC holder? There may still be a race in LDK around the HTLC getting marked ready to forward and the BP firing to handle it, but I have to imagine we won't be the only ones with this kind of issue?

valentinewallace · 2026-04-07T20:31:20Z

Hmm, I kinda wonder if we shouldn't mandate in the spec that the release message not fly until the HTLC is fully locked in to the HTLC holder? There may still be a race in LDK around the HTLC getting marked ready to forward and the BP firing to handle it, but I have to imagine we won't be the only ones with this kind of issue?

How can the sender know that the LSP has processed the final revoke_and_ack for the held HTLC, such that it's safe to send the held_htlc_available OM?

TheBlueMatt · 2026-04-07T21:36:32Z

Mmm, right, yea, you can't know they actually processed it. Indeed we'd need to support it in any case but that probably implies we should definitely wait until we get a ways into the state machine cause we don't want to send update_add_htlc + commitment_signed and miss that they havne't received/processed that yet either :)

valentinewallace · 2026-04-07T21:44:18Z

Mmm, right, yea, you can't know they actually processed it. Indeed we'd need to support it in any case but that probably implies we should definitely wait until we get a ways into the state machine cause we don't want to send update_add_htlc + commitment_signed and miss that they havne't received/processed that yet either :)

Ok cool, that should be covered by the spec already because as an often-offline sender we don't even have reply paths to use to send held_htlc_available until we receive our counterparty's RAA. Will proceed with the suggested patch.

tnull requested a review from valentinewallace April 6, 2026 15:36

tnull added this to the 0.3 milestone Apr 6, 2026

ldk-claude-review-bot reviewed Apr 6, 2026

View reviewed changes

tnull mentioned this pull request Apr 6, 2026

Add BOLT12 support to LSPS2 via custom Router implementation #4463

Open

tnull commented Apr 7, 2026

View reviewed changes

valentinewallace reviewed Apr 7, 2026

View reviewed changes

Conversation

tnull commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldk-reviews-bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldk-claude-review-bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ldk-claude-review-bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ldk-claude-review-bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ldk-claude-review-bot commented Apr 6, 2026

Review Summary

Inline comments posted

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

valentinewallace commented Apr 7, 2026

Uh oh!

valentinewallace left a comment

Choose a reason for hiding this comment

Uh oh!

tnull commented Apr 7, 2026

Uh oh!

TheBlueMatt commented Apr 7, 2026

Uh oh!

valentinewallace commented Apr 7, 2026

Uh oh!

TheBlueMatt commented Apr 7, 2026

Uh oh!

valentinewallace commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tnull commented Apr 6, 2026 •

edited

Loading

ldk-reviews-bot commented Apr 6, 2026 •

edited

Loading