Skip to content

use fabric to read/write _replicator docs#6031

Closed
rnewson wants to merge 1 commit into
mainfrom
replicator-quorum-ops
Closed

use fabric to read/write _replicator docs#6031
rnewson wants to merge 1 commit into
mainfrom
replicator-quorum-ops

Conversation

@rnewson

@rnewson rnewson commented Jun 10, 2026

Copy link
Copy Markdown
Member

closes #6029

Overview

serialize_worker_startup=true fixes the "spurious conflict" update case in most circumstances. A prior attempt to address that problem in the replicator now conflicts with that work. The replicator can update replicator docs whenever the state changes. For jobs that crash quickly, stored conflicts could be generated. We changed the replicator to only write to the local copy on the "owner" node, to avoid this. That is fundamentally the same trick that serialize_worker_startup=true does. However these two mechanisms are not coordinated and can easily choose different primary nodes.

This PR returns couch replicator to use fabric:open_doc and fabric:update_doc to align its changes with serialize_worker_startup=true behaviour.

Testing recommendations

covered by existing tests

Related Issues or Pull Requests

#6029

Checklist

  • This is my own work, I did not use AI, LLM's or similar technology
  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • Documentation changes were made in the src/docs folder
  • Documentation changes were backported (separated PR) to affected branches

@rnewson rnewson force-pushed the replicator-quorum-ops branch from f58ad1b to 8e34eee Compare June 10, 2026 09:10
ioq:maybe_set_io_priority({system, DbName}),
defer_call(fun() ->
try
fabric:update_doc(DbName, Doc, [?CTX])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With serialize_worker_startup=false currently we're using at cloudant, wouldn't this make the conflict generation worse?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, true. I've made the fabric calls conditional on 1) a sharded database and 2) sws must be set to true, otherwise we just do the local operation.

Ret;
{'DOWN', Ref, process, Pid, {exit_throw, Reason}} ->
throw(Reason);
{'DOWN', Ref, process_, Pid, {exit_error, Reason}} ->

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo :process_ will deadlock on exits

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. fixed


save_rep_doc(<<"shards/", _/binary>> = ShardDbName, Doc) ->
DbName = mem3:dbname(ShardDbName),
ioq:maybe_set_io_priority({system, DbName}),

@nickva nickva Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't do anything, since we're setting a pdict value on a process

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


open_rep_doc(<<"shards/", _/binary>> = ShardDbName, DocId) ->
DbName = mem3:dbname(ShardDbName),
ioq:maybe_set_io_priority({system, DbName}),

@nickva nickva Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ioq setting here will fail. It's a pdict setting and since defer spawns a process it won't be applied

couch_db:close(Db)
end.

defer_call(Fun) ->

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not really deferring anything? It's isolating call. We should have a fabric util function somewhere that may do this already

Else
end.

save_rep_doc(<<"shards/", _/binary>> = ShardDbName, Doc) ->

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this call happens from a multidb changes listener (I pretty sure it is) we'd have a case when the changes replication scanner on a local set of _replicator shards will block on a clustered call. It's a cleaner design but may have performance implications, which is why I think? we moved to a node-local update mechanism from a clustered one in the past. (That's why in my PR I kept the local write as is and spawned an async internal-replicator-like push to other shards, just more eager than the actual internal replicator)

@nickva nickva left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left a few notes after a quick skim. Most are nits and typos. The main things I worry about are behavior with sws=false (may generate more conflicts) and performance bottlenecks from running in the main changes callback for the cluster. Another concern is any new failures from the fabric layer (timeouts, etc) on any of the possible replication docs in a cluster could crash the replicator, but at the same time we wouldn't want to necessarily catch and ignore errors as we might leave docs in an inconsistent state.

{'DOWN', Ref, process_, Pid, {exit_error, Reason}} ->
error(Reason);
{'DOWN', Ref, process, Pid, {exit_exit, Reason}} ->
exit(Reason)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to be careful here. Any errors like timeout, cluster internal errors etc might kill the multdb changes feed and kill the replicator app

@rnewson rnewson force-pushed the replicator-quorum-ops branch from 8e34eee to 88a58e1 Compare June 10, 2026 19:52
@rnewson rnewson force-pushed the replicator-quorum-ops branch from 88a58e1 to 28abd18 Compare June 11, 2026 18:45
@rnewson

rnewson commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

obsoleted by #6034

@rnewson rnewson closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: _replicator docs 409s with serialize_worker_startup=true

2 participants