Fix race in WithWaitForLeave that flaked TestPartialStateJoin/Outgoing_device_list_updates#878
Conversation
27a2bb1 to
59d4118
Compare
…ave-handler race `WithWaitForLeave` registered a PDU handler (whose callback feeds `leaveChannel`) but then decided whether to wait by peeking at `room.CurrentState`. Because an incoming PDU is added to the room (`HandleTransactionRequests` calls `room.AddEvent`, updating `CurrentState`) *before* the PDU callback runs, the test goroutine could observe the updated state, take the "already seen" shortcut and return, firing the deferred `removePDUHandler()` in the window before the callback ran. The now-handler-less callback then flagged the (expected) leave as an unexpected PDU, failing the test. Wait on the channel fed by the handler instead of polling `room.CurrentState`, so the handler stays registered until its own callback has run. Every caller passes a joined user and an action that performs the leave, so the leave always arrives fresh via that callback. Make this contract explicit: snapshot membership before the action and fail if the user has already left (rather than silently treating it as "already seen"), and fail on timeout, since neither should happen in correct usage. The action still runs in the already-left case so any cleanup it performs is preserved. The `HandleTransactionRequests` ordering (callback after `room.AddEvent`) is left unchanged, as other tests rely on it (e.g. `federation_redaction_test.go` reads `serverRoom.Timeline` immediately after a callback-driven waiter).
59d4118 to
b744d2d
Compare
| delete(s.eduHandlers, eduHandlerKey) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
For reference, can you paste the full test run flake?
There was a problem hiding this comment.
https://github.com/element-hq/synapse/actions/runs/27136501582/job/80090514774?pr=19832
📦 github.com/matrix-org/complement/tests/msc3902
2026/06/08 12:22:24 config: &{BaseImageURI:localhost/complement-synapse DebugLoggingEnabled:false AlwaysPrintServerLogs:false EnvVarsPropagatePrefix:PASS_ SpawnHSTimeout:30s ContainerCPUCores:0 ContainerMemoryBytes:0 KeepBlueprints:[] HostMounts:[] BaseImageURIs:map[] PackageNamespace:msc3902 CACertificate:0xc000964c08 CAPrivateKey:0xc000804380 BestEffort:false HostnameRunningComplement:host.docker.internal EnableDirtyRuns:true HSPortBindingIP:127.0.0.1 PostTestScript:}
❌ TestPartialStateJoin (1m56.99s)
❌ TestPartialStateJoin/Outgoing_device_list_updates (19.95s)
❌ TestPartialStateJoin/Outgoing_device_list_updates/Device_list_updates_reach_incorrectly_kicked_servers_once_partial_state_join_completes (4.2s)
client.go:860: [CSAPI] POST hs1/_matrix/client/v3/register => 200 OK (15.620036ms)
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/capabilities => 200 OK (3.801544ms)
server.go:210: Creating room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 with version 10
federation_room_join_partial_state_test.go:4499: Registered state_ids handler for event $mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U
federation_room_join_partial_state_test.go:4540: Registered /state handler for event $mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U
2026/06/08 12:23:47 Received send-join of event $kEyw4F8XinpYV9_VahFs5JkXlLv0JIgGb0MkavOKyjE
client.go:860: [CSAPI] POST hs1/_matrix/client/v3/join/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 => 200 OK (80.730602ms)
federation_room_join_partial_state_test.go:4393: /join request completed
federation_room_join_partial_state_test.go:4478: Incoming state_ids request for event [$mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U] in room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563
2026/06/08 12:23:48 Received send-join of event $awFsIhzHfpfMVd_m_ei2qJfZt-YtGhCjhTYrYJzq9U0
federation_room_join_partial_state_test.go:2405: Server.MustJoinRoom joined room ID !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (49.114711ms)
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (9.370922ms)
federation_room_join_partial_state_test.go:2415: Alice successfully received event $awFsIhzHfpfMVd_m_ei2qJfZt-YtGhCjhTYrYJzq9U0 via /sync
client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (4.413737ms)
federation_room_join_partial_state_test.go:2418: @user-35-t26alice:hs1 sent device list update.
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 1","device_id":"GPHZZFZGEP","prev_id":[],"stream_id":54,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 1","device_id":"GPHZZFZGEP","prev_id":[],"stream_id":54,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2421: @charlie, @derek and @elsie received device list update.
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (6.652265ms)
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (48.310036ms)
federation_room_join_partial_state_test.go:2439: Alice successfully received event $SxL9gU1hmed77R46CGYKccn6aDKuZBFGFcKN2eBkKRc via /sync
client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (5.036587ms)
federation_room_join_partial_state_test.go:2497: @user-35-t26alice:hs1 sent device list update.
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"GPHZZFZGEP","prev_id":[54],"stream_id":56,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2500: @charlie and @derek received device list update.
federation_room_join_partial_state_test.go:4485: Replying to /state_ids request for event [$mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U]
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/members => 200 OK (69.344577ms)
federation_room_join_partial_state_test.go:2513: @user-35-t26alice:hs1's partial state join to !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 completed.
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"GPHZZFZGEP","prev_id":[54],"stream_id":57,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2517: @elsie received missed device list update.
client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (5.751307ms)
federation_room_join_partial_state_test.go:2520: @user-35-t26alice:hs1 sent device list update.
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"GPHZZFZGEP","prev_id":[56],"stream_id":59,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"GPHZZFZGEP","prev_id":[57],"stream_id":59,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2523: @charlie, @derek and @elsie received device list update.
federation_room_join_partial_state_test.go:4416: Cleaning up after test...
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/members => 200 OK (3.937128ms)
federation_room_join_partial_state_test.go:4418: @user-35-t26alice:hs1's partial state join to !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 completed.
client.go:860: [CSAPI] POST hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/leave => 200 OK (25.974991ms)
federation_room_join_partial_state_test.go:156: host.docker.internal:45563 saw @user-35-t26alice:hs1 leave test room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563.
federation_room_join_partial_state_test.go:152: host.docker.internal:33565 has already seen @user-35-t26alice:hs1 leave test room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563.
federation_room_join_partial_state_test.go:77: Received unexpected PDU: {"auth_events":["$613ehdee8-7NYiEo9wLU_E89o8bDbIvOvNdMfB8c_aw","$kEyw4F8XinpYV9_VahFs5JkXlLv0JIgGb0MkavOKyjE","$jnQmT8p0NUxGgUChevtHR2Qu027MlckWRQLfTsnaRXs"],"content":{"membership":"leave"},"depth":11,"hashes":{"sha256":"oaWwqeddMuOgqMSQs/IDHmDxHcxwNdjTTVJrhcHk0JM"},"origin_server_ts":1780921429379,"prev_events":["$SxL9gU1hmed77R46CGYKccn6aDKuZBFGFcKN2eBkKRc"],"room_id":"!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563","sender":"@user-35-t26alice:hs1","signatures":{"hs1":{"ed25519:a_zXSM":"y5zkhzV66kRDjAI1A/BUHRHnz7Ww1dki3S3o3nza3tBOnNOBA2E/C7Lf7WdOfiPTR9VLstNNufapfqaDYWhECA"}},"state_key":"@user-35-t26alice:hs1","type":"m.room.member"}
Co-authored-by: Eric Eastwood <erice@element.io>
We need to always run the `leaveAction`, as the rest of the code assumes it will always run. We also need to handle the case the user isn't in the room anymore, as that is a valid situation too.
|
Right, this is now cleaned up a bit. The problem was that a late assumption I made, that the user in question was always joined to the room, isn't always true. The existing checks also try and infer via various mechanisms when we expect the server to receive the leave PDU. This boils down to checking both a) the user is still in the room, and b) the remote server in question is in the room. I found it easier to reason about to check these two conditions explicitly. |
| // room, before performing the `leaveAction` (to avoid races). If they are | ||
| // in the room we need to wait until the server sees the leave, but if they | ||
| // aren't no leave event will be sent out. | ||
| userInRoom := userIsJoinedTo(t, user, room.RoomID) |
There was a problem hiding this comment.
In Complement, the way we typically do this is using user.MustSyncUntil(t, client.SyncReq{}, client.SyncJoinedTo(user.UserID, room.RoomID))
And since we seem to only use this for the negative case and return, I'm guessing that using the typical flow and failing the test is also fine (related to the points below about programming errors)
There was a problem hiding this comment.
In Complement, the way we typically do this is using
user.MustSyncUntil(t, client.SyncReq{Since: aliceSince}, client.SyncJoinedTo(user.UserID, room.RoomID))
That only tests whether a new join has happened, not whether the user is joined at all
There was a problem hiding this comment.
That only tests whether a new join has happened, not whether the user is joined at all
I accidentally copied an incremental sync here but it will work just fine to tell you whether you're joined to a room if you use an initial sync.
But overall, I think this review point was more in the context when I didn't understand what we're trying to accomplish with WithWaitForLeave. It's a general clean-up function (that could be run at any stage of the test in the case of failures), not something used to actually wait for someone to leave as part of the test. Probably mostly a case of this being a bad name and awkward usage.
Co-authored-by: Eric Eastwood <erice@element.io>
| // room, before performing the `leaveAction` (to avoid races). If they are | ||
| // in the room we need to wait until the server sees the leave, but if they | ||
| // aren't no leave event will be sent out. | ||
| userInRoom := userIsJoinedTo(t, user, room.RoomID) |
There was a problem hiding this comment.
That only tests whether a new join has happened, not whether the user is joined at all
I accidentally copied an incremental sync here but it will work just fine to tell you whether you're joined to a room if you use an initial sync.
But overall, I think this review point was more in the context when I didn't understand what we're trying to accomplish with WithWaitForLeave. It's a general clean-up function (that could be run at any stage of the test in the case of failures), not something used to actually wait for someone to leave as part of the test. Probably mostly a case of this being a bad name and awkward usage.
| // the timeout. | ||
| t.Logf("%s is not in test room %s; not waiting for %s to leave.", s.ServerName(), room.RoomID, userID) | ||
| return | ||
| } |
There was a problem hiding this comment.
Many of the faster joins test flakes are due to the homeserver under test failing to contact Complement homeservers after they have been torn down. When this happens, subsequent tests can fail if they use a Complement homeserver that happens to have the same hostname:port as one which the homeserver under test has previously marked as offline.
-- #626
Is there anything about a federation request where our engineered homeserver can reject requests meant for some old homeserver. I'd assume keys used in the federation requests would be unique to each engineered Complement homeserver that we create and should fail authentication. Perhaps this is missing? For example, why isn't this already rejecting the stray requests:
complement/federation/handle.go
Lines 472 to 486 in bc2b638
If this kind of thing is possible, it would eliminate the need for all of this cleanup logic while still being able to share the real homeserver deployment
There was a problem hiding this comment.
Possibly, I haven't really looked whether there are architectural changes we could make to obviate the need for this cleanup code.
I think landing this as is makes sense to reduce the number of flakey tests we see though
There was a problem hiding this comment.
urghh, it can make sense to merge as this pattern is already here and we can follow-up with a better strategy ⏩ . I find this existing approach so bad that I'm trying to avoid us taking this any further and trying to reason about it. Crap on top of crap.
There was a problem hiding this comment.
I've got a working solution to this in #880
As explained in the PR, I couldn't quite do what I was thinking here but there is still a good general solution to this problem by avoiding port re-use which lead to the same server names being used.
We can still merge this PR if you'd like but #880 would also solve the problem. I plan to strip all of this logic out. Just waiting on this to merge or not.
| // the timeout. | ||
| t.Logf("%s is not in test room %s; not waiting for %s to leave.", s.ServerName(), room.RoomID, userID) | ||
| return | ||
| } |
There was a problem hiding this comment.
urghh, it can make sense to merge as this pattern is already here and we can follow-up with a better strategy ⏩ . I find this existing approach so bad that I'm trying to avoid us taking this any further and trying to reason about it. Crap on top of crap.
🤖 Generated with Claude Code
Background
TestPartialStateJoin/Outgoing_device_list_updates(intests/msc3902) flakes in downstream CI (seen in Synapse), failing during test cleanup with:The "unexpected" PDU is the joining user's own
leave, sent legitimately as the test tears down — so this is a race in the test helper, not a homeserver bug.Root cause
WithWaitForLeaveregisters a PDU handler whose callback pushes matching leaves ontoleaveChannel, then decides whether to block by peeking atroom.CurrentState:When a transaction arrives,
HandleTransactionRequestscallsroom.AddEvent(which updatesCurrentState) before invoking the PDU callback. So this interleaving is possible:room.AddEvent(leave)→CurrentState(user) == "leave", callback not yet run"leave"→ returns →defer removePDUHandler()deletes the handlert.Errorf("Received unexpected PDU")The helper used two different signals for the same event — the callback channel and
CurrentState— and tore its handler down based on the one that fires first. It's flaky because it only fails when the scheduler switches betweenAddEventand the callback (more likely under CI load).Fix