Skip to content

Fix race in WithWaitForLeave that flaked TestPartialStateJoin/Outgoing_device_list_updates#878

Merged
erikjohnston merged 5 commits into
mainfrom
erikj/fix_flakey_fed_device_lists
Jun 16, 2026
Merged

Fix race in WithWaitForLeave that flaked TestPartialStateJoin/Outgoing_device_list_updates#878
erikjohnston merged 5 commits into
mainfrom
erikj/fix_flakey_fed_device_lists

Conversation

@erikjohnston

@erikjohnston erikjohnston commented Jun 8, 2026

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

Background

TestPartialStateJoin/Outgoing_device_list_updates (in tests/msc3902) flakes in downstream CI (seen in Synapse), failing during test cleanup with:

federation_room_join_partial_state_test.go:77: Received unexpected PDU: {…"membership":"leave"…}

The "unexpected" PDU is the joining user's own leave, sent legitimately as the test tears down — so this is a race in the test helper, not a homeserver bug.

Root cause

WithWaitForLeave registers a PDU handler whose callback pushes matching leaves onto leaveChannel, then decides whether to block by peeking at room.CurrentState:

removePDUHandler := s.AddPDUHandler(func(e) bool { /* push leaveChannel; return true */ })
defer removePDUHandler()

leaveAction()

if room.CurrentState("m.room.member", userID) == "leave" {  // "already seen" shortcut
    // return without consuming the channel
} else {
    <-leaveChannel // …or 1s timeout
}

When a transaction arrives, HandleTransactionRequests calls room.AddEvent (which updates CurrentState) before invoking the PDU callback. So this interleaving is possible:

  1. fed goroutine: room.AddEvent(leave)CurrentState(user) == "leave", callback not yet run
  2. test goroutine: shortcut sees "leave" → returns → defer removePDUHandler() deletes the handler
  3. fed goroutine: PDU callback runs → no handler matches → t.Errorf("Received unexpected PDU")

The helper used two different signals for the same event — the callback channel and CurrentState — and tore its handler down based on the one that fires first. It's flaky because it only fails when the scheduler switches between
AddEvent and the callback (more likely under CI load).

Fix

[...] to outline how this PR fixes the problem [...]

We first check whether userInRoom according to their own server. It they aren't participating in the room, their server shouldn't send us any more PDU's over federation which means we won't run into the t.Errorf("Received unexpected PDU") problem. Note: This logic is slightly flawed as some server implementation could have some queued events that it will still send.

We also check whether our engineered server is in the room as we might never have joined in the first place if the test failed early.

This means, there isn't a logical reason for our engineered server to receive an event over federation and cause the t.Errorf("Received unexpected PDU") error. Although it's still technically possible (as outlined above) for this problem to happen. Ideally, it would be better to write some better code here so this isn't possible at all.

-- @MadLittleMods, #878 (comment)

@erikjohnston erikjohnston force-pushed the erikj/fix_flakey_fed_device_lists branch from 27a2bb1 to 59d4118 Compare June 8, 2026 18:40
@erikjohnston erikjohnston marked this pull request as ready for review June 8, 2026 20:11
@erikjohnston erikjohnston requested review from a team as code owners June 8, 2026 20:11
…ave-handler race

`WithWaitForLeave` registered a PDU handler (whose callback feeds
`leaveChannel`) but then decided whether to wait by peeking at
`room.CurrentState`. Because an incoming PDU is added to the room
(`HandleTransactionRequests` calls `room.AddEvent`, updating
`CurrentState`) *before* the PDU callback runs, the test goroutine could
observe the updated state, take the "already seen" shortcut and return,
firing the deferred `removePDUHandler()` in the window before the
callback ran. The now-handler-less callback then flagged the (expected)
leave as an unexpected PDU, failing the test.

Wait on the channel fed by the handler instead of polling
`room.CurrentState`, so the handler stays registered until its own
callback has run.

Every caller passes a joined user and an action that performs the leave, so the
leave always arrives fresh via that callback. Make this contract
explicit: snapshot membership before the action and fail if the user has
already left (rather than silently treating it as "already seen"), and
fail on timeout, since neither should happen in correct usage. The
action still runs in the already-left case so any cleanup it performs is
preserved.

The `HandleTransactionRequests` ordering (callback after `room.AddEvent`) is left
unchanged, as other tests rely on it (e.g.
`federation_redaction_test.go` reads `serverRoom.Timeline` immediately
after a callback-driven waiter).
@erikjohnston erikjohnston force-pushed the erikj/fix_flakey_fed_device_lists branch from 59d4118 to b744d2d Compare June 9, 2026 08:43
Comment thread tests/msc3902/federation_room_join_partial_state_test.go
Comment thread tests/msc3902/federation_room_join_partial_state_test.go Outdated
delete(s.eduHandlers, eduHandlerKey)
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, can you paste the full test run flake?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/element-hq/synapse/actions/runs/27136501582/job/80090514774?pr=19832

📦 github.com/matrix-org/complement/tests/msc3902
2026/06/08 12:22:24 config: &{BaseImageURI:localhost/complement-synapse DebugLoggingEnabled:false AlwaysPrintServerLogs:false EnvVarsPropagatePrefix:PASS_ SpawnHSTimeout:30s ContainerCPUCores:0 ContainerMemoryBytes:0 KeepBlueprints:[] HostMounts:[] BaseImageURIs:map[] PackageNamespace:msc3902 CACertificate:0xc000964c08 CAPrivateKey:0xc000804380 BestEffort:false HostnameRunningComplement:host.docker.internal EnableDirtyRuns:true HSPortBindingIP:127.0.0.1 PostTestScript:}
❌ TestPartialStateJoin (1m56.99s)
❌ TestPartialStateJoin/Outgoing_device_list_updates (19.95s)
❌ TestPartialStateJoin/Outgoing_device_list_updates/Device_list_updates_reach_incorrectly_kicked_servers_once_partial_state_join_completes (4.2s)
      client.go:860: [CSAPI] POST hs1/_matrix/client/v3/register => 200 OK (15.620036ms)
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/capabilities => 200 OK (3.801544ms)
      server.go:210: Creating room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 with version 10
      federation_room_join_partial_state_test.go:4499: Registered state_ids handler for event $mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U
      federation_room_join_partial_state_test.go:4540: Registered /state handler for event $mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U
  2026/06/08 12:23:47 Received send-join of event $kEyw4F8XinpYV9_VahFs5JkXlLv0JIgGb0MkavOKyjE
      client.go:860: [CSAPI] POST hs1/_matrix/client/v3/join/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 => 200 OK (80.730602ms)
      federation_room_join_partial_state_test.go:4393: /join request completed
      federation_room_join_partial_state_test.go:4478: Incoming state_ids request for event [$mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U] in room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563
  2026/06/08 12:23:48 Received send-join of event $awFsIhzHfpfMVd_m_ei2qJfZt-YtGhCjhTYrYJzq9U0
      federation_room_join_partial_state_test.go:2405: Server.MustJoinRoom joined room ID !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (49.114711ms)
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (9.370922ms)
      federation_room_join_partial_state_test.go:2415: Alice successfully received event $awFsIhzHfpfMVd_m_ei2qJfZt-YtGhCjhTYrYJzq9U0 via /sync
      client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (4.413737ms)
      federation_room_join_partial_state_test.go:2418: @user-35-t26alice:hs1 sent device list update.
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 1","device_id":"GPHZZFZGEP","prev_id":[],"stream_id":54,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 1","device_id":"GPHZZFZGEP","prev_id":[],"stream_id":54,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2421: @charlie, @derek and @elsie received device list update.
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (6.652265ms)
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (48.310036ms)
      federation_room_join_partial_state_test.go:2439: Alice successfully received event $SxL9gU1hmed77R46CGYKccn6aDKuZBFGFcKN2eBkKRc via /sync
      client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (5.036587ms)
      federation_room_join_partial_state_test.go:2497: @user-35-t26alice:hs1 sent device list update.
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"GPHZZFZGEP","prev_id":[54],"stream_id":56,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2500: @charlie and @derek received device list update.
      federation_room_join_partial_state_test.go:4485: Replying to /state_ids request for event [$mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U]
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/members => 200 OK (69.344577ms)
      federation_room_join_partial_state_test.go:2513: @user-35-t26alice:hs1's partial state join to !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 completed.
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"GPHZZFZGEP","prev_id":[54],"stream_id":57,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2517: @elsie received missed device list update.
      client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (5.751307ms)
      federation_room_join_partial_state_test.go:2520: @user-35-t26alice:hs1 sent device list update.
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"GPHZZFZGEP","prev_id":[56],"stream_id":59,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"GPHZZFZGEP","prev_id":[57],"stream_id":59,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2523: @charlie, @derek and @elsie received device list update.
      federation_room_join_partial_state_test.go:4416: Cleaning up after test...
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/members => 200 OK (3.937128ms)
      federation_room_join_partial_state_test.go:4418: @user-35-t26alice:hs1's partial state join to !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 completed.
      client.go:860: [CSAPI] POST hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/leave => 200 OK (25.974991ms)
      federation_room_join_partial_state_test.go:156: host.docker.internal:45563 saw @user-35-t26alice:hs1 leave test room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563.
      federation_room_join_partial_state_test.go:152: host.docker.internal:33565 has already seen @user-35-t26alice:hs1 leave test room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563.
      federation_room_join_partial_state_test.go:77: Received unexpected PDU: {"auth_events":["$613ehdee8-7NYiEo9wLU_E89o8bDbIvOvNdMfB8c_aw","$kEyw4F8XinpYV9_VahFs5JkXlLv0JIgGb0MkavOKyjE","$jnQmT8p0NUxGgUChevtHR2Qu027MlckWRQLfTsnaRXs"],"content":{"membership":"leave"},"depth":11,"hashes":{"sha256":"oaWwqeddMuOgqMSQs/IDHmDxHcxwNdjTTVJrhcHk0JM"},"origin_server_ts":1780921429379,"prev_events":["$SxL9gU1hmed77R46CGYKccn6aDKuZBFGFcKN2eBkKRc"],"room_id":"!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563","sender":"@user-35-t26alice:hs1","signatures":{"hs1":{"ed25519:a_zXSM":"y5zkhzV66kRDjAI1A/BUHRHnz7Ww1dki3S3o3nza3tBOnNOBA2E/C7Lf7WdOfiPTR9VLstNNufapfqaDYWhECA"}},"state_key":"@user-35-t26alice:hs1","type":"m.room.member"}

Co-authored-by: Eric Eastwood <erice@element.io>
We need to always run the `leaveAction`, as the rest of the code assumes
it will always run. We also need to handle the case the user isn't in
the room anymore, as that is a valid situation too.
@erikjohnston

Copy link
Copy Markdown
Member Author

Right, this is now cleaned up a bit. The problem was that a late assumption I made, that the user in question was always joined to the room, isn't always true.

The existing checks also try and infer via various mechanisms when we expect the server to receive the leave PDU. This boils down to checking both a) the user is still in the room, and b) the remote server in question is in the room. I found it easier to reason about to check these two conditions explicitly.

@erikjohnston erikjohnston requested a review from a team June 10, 2026 15:42
Comment thread tests/msc3902/federation_room_join_partial_state_test.go
Comment thread tests/msc3902/federation_room_join_partial_state_test.go
// room, before performing the `leaveAction` (to avoid races). If they are
// in the room we need to wait until the server sees the leave, but if they
// aren't no leave event will be sent out.
userInRoom := userIsJoinedTo(t, user, room.RoomID)

@MadLittleMods MadLittleMods Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Complement, the way we typically do this is using user.MustSyncUntil(t, client.SyncReq{}, client.SyncJoinedTo(user.UserID, room.RoomID))

And since we seem to only use this for the negative case and return, I'm guessing that using the typical flow and failing the test is also fine (related to the points below about programming errors)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Complement, the way we typically do this is using user.MustSyncUntil(t, client.SyncReq{Since: aliceSince}, client.SyncJoinedTo(user.UserID, room.RoomID))

That only tests whether a new join has happened, not whether the user is joined at all

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That only tests whether a new join has happened, not whether the user is joined at all

I accidentally copied an incremental sync here but it will work just fine to tell you whether you're joined to a room if you use an initial sync.

But overall, I think this review point was more in the context when I didn't understand what we're trying to accomplish with WithWaitForLeave. It's a general clean-up function (that could be run at any stage of the test in the case of failures), not something used to actually wait for someone to leave as part of the test. Probably mostly a case of this being a bad name and awkward usage.

Comment thread tests/msc3902/federation_room_join_partial_state_test.go Outdated
Comment thread tests/msc3902/federation_room_join_partial_state_test.go
Co-authored-by: Eric Eastwood <erice@element.io>
@MadLittleMods MadLittleMods removed their request for review June 11, 2026 16:50
// room, before performing the `leaveAction` (to avoid races). If they are
// in the room we need to wait until the server sees the leave, but if they
// aren't no leave event will be sent out.
userInRoom := userIsJoinedTo(t, user, room.RoomID)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That only tests whether a new join has happened, not whether the user is joined at all

I accidentally copied an incremental sync here but it will work just fine to tell you whether you're joined to a room if you use an initial sync.

But overall, I think this review point was more in the context when I didn't understand what we're trying to accomplish with WithWaitForLeave. It's a general clean-up function (that could be run at any stage of the test in the case of failures), not something used to actually wait for someone to leave as part of the test. Probably mostly a case of this being a bad name and awkward usage.

Comment thread tests/msc3902/federation_room_join_partial_state_test.go
Comment thread tests/msc3902/federation_room_join_partial_state_test.go
Comment thread tests/msc3902/federation_room_join_partial_state_test.go
// the timeout.
t.Logf("%s is not in test room %s; not waiting for %s to leave.", s.ServerName(), room.RoomID, userID)
return
}

@MadLittleMods MadLittleMods Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the faster joins test flakes are due to the homeserver under test failing to contact Complement homeservers after they have been torn down. When this happens, subsequent tests can fail if they use a Complement homeserver that happens to have the same hostname:port as one which the homeserver under test has previously marked as offline.

-- #626

Is there anything about a federation request where our engineered homeserver can reject requests meant for some old homeserver. I'd assume keys used in the federation requests would be unique to each engineered Complement homeserver that we create and should fail authentication. Perhaps this is missing? For example, why isn't this already rejecting the stray requests:

// Check federation signature
fedReq, errResp := fclient.VerifyHTTPRequest(
req, time.Now(), srv.serverName, nil, srv.keyRing,
)
if fedReq == nil {
log.Printf(
"complement: Transaction '%s': HTTP Code %d. Invalid http request: %s",
transactionID, errResp.Code, errResp.JSON,
)
w.WriteHeader(errResp.Code)
b, _ := json.Marshal(errResp.JSON)
w.Write(b)
return
}

If this kind of thing is possible, it would eliminate the need for all of this cleanup logic while still being able to share the real homeserver deployment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, I haven't really looked whether there are architectural changes we could make to obviate the need for this cleanup code.

I think landing this as is makes sense to reduce the number of flakey tests we see though

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urghh, it can make sense to merge as this pattern is already here and we can follow-up with a better strategy ⏩ . I find this existing approach so bad that I'm trying to avoid us taking this any further and trying to reason about it. Crap on top of crap.

@MadLittleMods MadLittleMods Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a working solution to this in #880

As explained in the PR, I couldn't quite do what I was thinking here but there is still a good general solution to this problem by avoiding port re-use which lead to the same server names being used.

We can still merge this PR if you'd like but #880 would also solve the problem. I plan to strip all of this logic out. Just waiting on this to merge or not.

// the timeout.
t.Logf("%s is not in test room %s; not waiting for %s to leave.", s.ServerName(), room.RoomID, userID)
return
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urghh, it can make sense to merge as this pattern is already here and we can follow-up with a better strategy ⏩ . I find this existing approach so bad that I'm trying to avoid us taking this any further and trying to reason about it. Crap on top of crap.

Comment thread tests/msc3902/federation_room_join_partial_state_test.go Outdated
@erikjohnston erikjohnston merged commit edfe298 into main Jun 16, 2026
4 checks passed
@erikjohnston erikjohnston deleted the erikj/fix_flakey_fed_device_lists branch June 16, 2026 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants