Skip to content

CBG-5375 fix FlushChannelCache test data races#8290

Merged
gregns1 merged 3 commits into
mainfrom
CBG-5375
May 27, 2026
Merged

CBG-5375 fix FlushChannelCache test data races#8290
gregns1 merged 3 commits into
mainfrom
CBG-5375

Conversation

@torcolvin
Copy link
Copy Markdown
Collaborator

@torcolvin torcolvin commented May 22, 2026

CBG-5375 fix FlushChannelCache test data races

  • (Refactor only): Moved RestartListener to db/util_testing.go so it could directly take testing.TB. This function is moved verbatim except removing the unnecessary sleep.
  • (Refactor only): Moved FlushChannelCache to Database level, since this does not flush a collection level channel cache.
  • (Guts of change): Stop the DCP feed before flushing the cache to avoid the data race.

The data race here occurs when flushing the cache resets c.initialSequence and processPrincipalDocument hits

if sequence <= c.initialSequence {
which accesses c.initialSequence. This is reproducible with SG_TEST_CACHING_FEED_DELAY=10ms or some other non zero values. When I tried to fix one of the tests, I discovered there are issues with all tests that called FlushChannelCache.

Initially I thought that a WaitForPendingSequences after creating a user would address the problem, but it is more subtle than this.

Given a user "bernard" with access to "ch1", "ch2", "ch3"

If a user is created directly using an authenticator object, using:

	a := testDb.Authenticator(ctx)
	testUser, err := a.NewUser(username, "letmein", channels.BaseSetOf(t, "ch1", "ch2", "ch3"))
	assert.NoError(t, err)
	assert.NoError(t, a.Save(testUser))

the output would look like this:

{
  "name": "user_TestMultichannelChangesQueryBackfillWithLimit",
  "admin_channels": { "ch1": 1, "ch2": 1, "ch3": 1 },
  "all_channels": { "ch3": 1, "!": 1, "ch1": 1, "ch2": 1 },
  "sequence": 0,
  "updated_at": "0001-01-01T00:00:00Z",
  "created_at": "0001-01-01T00:00:00Z",
  "passwordhash_bcrypt": "JDJhJDA0JE04RUc4bkRDeFhCZlgybi44YjQxTmU0WENtemd0by5QaDYwWm5Edm1QbTdYenM2dTYwL1lp",
  "jwt_last_updated": "0001-01-01T00:00:00Z",
  "rolesSince": {},
  "session_uuid": "5d5ba6c2-fc76-48e7-90e9-46bf9a2ef54e"
}

This isn't allocated a sequence correctly, so waiting for a sequence isn't right because it would never come so a WaitForPendingSequence wouldn't happen.

If the instead the user is created using rt.CreateUser("bernard", []string{"ch1", "ch2", "ch3"} then the object will look like this, with sequence=1 and cas=1:

{
  "name": "bernard",
  "admin_channels": { "ch2": 1, "ch3": 1, "ch1": 1 },
  "all_channels": { "!": 1 },
  "sequence": 1,
  "channel_inval_seq": 1,
  "updated_at": "2026-05-22T14:17:55.275095Z",
  "created_at": "2026-05-22T14:17:55.263436Z",
  "passwordhash_bcrypt": "JDJhJDA0JEV6TThjLnd1OXpRUmZrTEl3M09ESi5PRmI3MVBBRElGU000NjU3WE9uLjRtdFlqN3VTcDVh",
  "jwt_last_updated": "0001-01-01T00:00:00Z",
  "rolesSince": {},
  "session_uuid": "1fdbcac5-9682-4280-a3d6-09289258fae1"
}

Naively, you might think that you could WaitForPendingSequence and have the DCP waiter find the principal document. However, this principal document gets mutated on the first login of bernard to have cas=2 but sequence=1, in order to turn channel_inval_seq to all_channels.

{
  "name": "bernard",
  "admin_channels": { "ch1": 1, "ch2": 1, "ch3": 1 },
  "all_channels": { "ch2": 1, "ch3": 1, "ch1": 1, "!": 1 },
  "sequence": 1,
  "updated_at": "2026-05-22T14:17:55.275095Z",
  "created_at": "2026-05-22T14:17:55.263436Z",
  "passwordhash_bcrypt": "JDJhJDA0JEV6TThjLnd1OXpRUmZrTEl3M09ESi5PRmI3MVBBRElGU000NjU3WE9uLjRtdFlqN3VTcDVh",
  "jwt_last_updated": "0001-01-01T00:00:00Z",
  "rolesSince": {},
  "session_uuid": "1fdbcac5-9682-4280-a3d6-09289258fae1"
}

This means that we can't wait for sequence 1, since we already saw this with cas=1, but we can't be sure that cas=2 document will hit. So WaitForPendingSequences doesn't solve this problem. The DCPReceivedCount stat isn't incremented for principal documents

c.db.DbStats.Database().DCPReceivedCount.Add(1)
so this also doesn't work.

Since I don't think it is necessary to change the code for the authenticator, I solve both problems by stopping the DCP feed before flushing the cache, so that we know that second DCP entry isn't active. In these tests, it is expected that all the DCP events were received. This is a test-only change that preserves the functionality of the tests without modifying any other code.

Pre-review checklist

  • Removed debug logging (fmt.Print, log.Print, ...)
  • Logging sensitive data? Make sure it's tagged (e.g. base.UD(docID), base.MD(dbName))
  • Updated relevant information in the API specifications (such as endpoint descriptions, schemas, ...) in docs/api

Integration Tests

@torcolvin torcolvin self-assigned this May 22, 2026
Copilot AI review requested due to automatic review settings May 22, 2026 14:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Sync Gateway’s test utilities and tests to eliminate data races when flushing the channel/change cache by stopping the caching DCP feed before clearing caches, and centralizes the flush helper at the database level.

Changes:

  • Introduces DatabaseContext.RestartChangeListener / DatabaseContext.FlushChannelCache test helpers in db/util_testing.go, and removes the older RestartListener/collection-level flush helpers.
  • Updates multiple REST/replicator tests to call the new database-level FlushChannelCache(t) helper.
  • Skips TestChangesLoopingWhenLowSequence with a temporary t.Skip("FIXME") marker.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
rest/replicatortest/replicator_test.go Switches cache flush usage to database-level test helper.
rest/changestest/changes_api_test.go Uses new FlushChannelCache(t) helper in many tests; adds a skip for one test.
rest/blip_api_crud_test.go Uses new FlushChannelCache(t) helper.
rest/api_test.go Switches listener restart to RestartChangeListener(t, false).
db/util_testing.go Adds test helper APIs to stop/restart listener and flush caches safely.
db/database.go Removes the old testing-only RestartListener method.
db/database_collection.go Removes the collection-level FlushChannelCache helper.

Comment thread db/util_testing.go
Comment thread rest/changestest/changes_api_test.go
@torcolvin torcolvin requested a review from bbrks May 22, 2026 14:58
@torcolvin torcolvin requested review from gregns1 and removed request for bbrks May 22, 2026 15:23
@torcolvin torcolvin assigned gregns1 and unassigned torcolvin and bbrks May 22, 2026
@torcolvin torcolvin force-pushed the CBG-5375 branch 2 times, most recently from 01d5646 to 6b174bd Compare May 22, 2026 16:56
Copy link
Copy Markdown
Contributor

@gregns1 gregns1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one small change then ready I think

Comment thread rest/api_test.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment I think needs updating now?

@gregns1 gregns1 assigned torcolvin and unassigned gregns1 May 27, 2026
@torcolvin torcolvin requested a review from gregns1 May 27, 2026 14:41
@torcolvin torcolvin assigned gregns1 and unassigned torcolvin May 27, 2026
@gregns1 gregns1 enabled auto-merge (squash) May 27, 2026 14:43
@gregns1 gregns1 merged commit 848bf30 into main May 27, 2026
65 of 68 checks passed
@gregns1 gregns1 deleted the CBG-5375 branch May 27, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants