Skip to content

Fix #6169: elect call offerer using same-namespace session IDs#6375

Draft
alauzon wants to merge 1 commit into
nextcloud:masterfrom
alauzon:fix/6169-call-offerer-namespace
Draft

Fix #6169: elect call offerer using same-namespace session IDs#6375
alauzon wants to merge 1 commit into
nextcloud:masterfrom
alauzon:fix/6169-call-offerer-namespace

Conversation

@alauzon

@alauzon alauzon commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Fix #6169: elect call offerer using same-namespace session IDs

Fixes #6169.

Problem

In High-Performance Backend (external signaling) mode, 2-party calls can deadlock:
neither client creates the WebRTC offer, no media flows.

Root cause: the offerer is elected by a lexicographic comparison of the local and remote
session IDs (remoteSession.compareTo(localSession) < 0). This yields exactly one offerer
only if both peers compare the same ordered pair of strings. But at the
PeerConnectionWrapper construction site, the remote sessionId is a signaling-layer
session ID while the local session passed in (callSession) is the Nextcloud session ID.
The two peers therefore compare unrelated strings and can both compute the same role:

  • both "not offerer" → nobody offers → deadlock, or
  • both "offerer" → offer collision.

The requestoffer fallback only fires in MCU mode, so a direct (no-MCU) 2-party call deadlocks
outright. Because the session IDs are effectively random per call, the failure is intermittent.

Fix

Pass the local signaling session to the wrapper (the same webSocketClient.sessionId that is
already used for currentSessionId a few lines above), so both peers compare IDs from the same
namespace. In internal signaling mode webSocketClient == null, so callSession is used as
before (unchanged behaviour).

// before
callSession,
// after
if (webSocketClient != null) webSocketClient!!.sessionId else callSession,

The election is also extracted into PeerConnectionWrapper.isOfferer() to make it unit-testable.

How it was verified

Measured directly via a debug build logging the two compared session IDs, on a test NC instance
using the same HPB (no MCU):

  • Before the fixlocalSession is a Nextcloud session ID (omX5MFK/…, base64), remoteSession
    is an HPB session ID (73gsILAp…-g_…). A↔A automated calls connected 1 / 3 (deadlock 2/3).
  • After the fixlocalSession is now an HPB session ID, same namespace as remoteSession; both
    peers compare the same pair. A↔A automated calls connected 6 / 6 (0 deadlock).

New unit tests (PeerConnectionWrapperOffererRoleTest) pin the contract (exactly one offerer for a
consistent pair) and the cross-namespace failure mode as a regression guard. They are
mutation-verified (breaking isOfferer turns them red).

Not covered / scope

  • Web and desktop clients use a different offerer-election path and are unaffected.
  • This changes only the offerer election; media/ICE setup is unchanged.

In HPB (external signaling) mode the remote participant sessionId passed to
PeerConnectionWrapper is a signaling-layer session ID, while the local session
used for the offerer election (callSession) is the Nextcloud session ID. The two
peers then compare unrelated strings and can both compute the same role: both
"not offerer" (no offer is created — the call deadlocks) or both "offerer" (offer
collision). The requestoffer fallback only applies in MCU mode, so direct 2-party
calls deadlock outright.

Pass the local signaling session (webSocketClient.sessionId when external
signaling is used, callSession otherwise — matching the existing currentSessionId
resolution) so both peers compare the same namespace.

Also extract the election into PeerConnectionWrapper.isOfferer() and add
PeerConnectionWrapperOffererRoleTest covering the contract and the cross-namespace
failure mode.

Signed-off-by: Alain Lauzon <alauzon@alainlauzon.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alauzon

alauzon commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Since this fixes a call-breaking regression (2-party Android calls deadlock intermittently in HPB mode, no media at all), would it be possible to backport it to the 24.0.2 patch (#111) rather than waiting for 24.1.0?

The change is small and self-contained (just the offerer election at the PeerConnectionWrapper construction site + a unit test), it's mergeable with no conflicts, and it's validated with automated 2-party Android↔Android calls on the same HPB (no MCU): ~1/3 connected before, 6/6 after. Happy to rebase onto the stable branch if that helps.

@mahibi

mahibi commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Thank you for contributing @alauzon i will review this soon!

@alauzon

alauzon commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Please hold this — I'm converting it back to draft.

On real-device testing (two phones on different networks), with this change the call now correctly rings and an offer is created where previously the offerer election deadlocked — so the cross-namespace offerer bug itself is addressed. However, after the callee answers, the call does not establish media (it keeps "connecting"). So there is a separate problem in the post-answer ICE/media negotiation that this change does not resolve, and I don't want to give the impression #6169 is fully fixed by this commit alone.

My earlier "connects 6/6" validation was on emulators sharing the same host network, which isn't representative of two real-world NAT'd devices — that's on me. I'll characterize the remaining media-connection issue (TURN relay / answer handling) on real devices before marking this ready again.

@alauzon alauzon marked this pull request as draft June 22, 2026 10:05
@mahibi

mahibi commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Thanks to let me know and for the further testing @alauzon 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Call deadlocks in HPB mode when local & remote session IDs are compared across namespaces

2 participants