use fabric to read/write _replicator docs by rnewson · Pull Request #6031 · apache/couchdb

rnewson · 2026-06-10T09:09:51Z

Overview

serialize_worker_startup=true fixes the "spurious conflict" update case in most circumstances. A prior attempt to address that problem in the replicator now conflicts with that work. The replicator can update replicator docs whenever the state changes. For jobs that crash quickly, stored conflicts could be generated. We changed the replicator to only write to the local copy on the "owner" node, to avoid this. That is fundamentally the same trick that serialize_worker_startup=true does. However these two mechanisms are not coordinated and can easily choose different primary nodes.

This PR returns couch replicator to use fabric:open_doc and fabric:update_doc to align its changes with serialize_worker_startup=true behaviour.

Testing recommendations

covered by existing tests

Related Issues or Pull Requests

#6029

Checklist

This is my own work, I did not use AI, LLM's or similar technology
Code is written and works correctly
Changes are covered by tests
Any new configurable parameters are documented in rel/overlay/etc/default.ini
Documentation changes were made in the src/docs folder
Documentation changes were backported (separated PR) to affected branches

nickva · 2026-06-10T14:28:36Z

+    ioq:maybe_set_io_priority({system, DbName}),
+    defer_call(fun() ->
+        try
+            fabric:update_doc(DbName, Doc, [?CTX])


With serialize_worker_startup=false currently we're using at cloudant, wouldn't this make the conflict generation worse?

hm, true. I've made the fabric calls conditional on 1) a sharded database and 2) sws must be set to true, otherwise we just do the local operation.

nickva · 2026-06-10T14:34:05Z

+            Ret;
+        {'DOWN', Ref, process, Pid, {exit_throw, Reason}} ->
+            throw(Reason);
+        {'DOWN', Ref, process_, Pid, {exit_error, Reason}} ->


typo :process_ will deadlock on exits

oops. fixed

nickva · 2026-06-10T14:47:30Z


+save_rep_doc(<<"shards/", _/binary>> = ShardDbName, Doc) ->
+    DbName = mem3:dbname(ShardDbName),
+    ioq:maybe_set_io_priority({system, DbName}),


This doesn't do anything, since we're setting a pdict value on a process

nickva · 2026-06-10T14:53:20Z


+open_rep_doc(<<"shards/", _/binary>> = ShardDbName, DocId) ->
+    DbName = mem3:dbname(ShardDbName),
+    ioq:maybe_set_io_priority({system, DbName}),


ioq setting here will fail. It's a pdict setting and since defer spawns a process it won't be applied

nickva · 2026-06-10T14:57:46Z

        couch_db:close(Db)
    end.

+defer_call(Fun) ->


We're not really deferring anything? It's isolating call. We should have a fabric util function somewhere that may do this already

nickva · 2026-06-10T15:07:17Z

            Else
    end.

+save_rep_doc(<<"shards/", _/binary>> = ShardDbName, Doc) ->


If this call happens from a multidb changes listener (I pretty sure it is) we'd have a case when the changes replication scanner on a local set of _replicator shards will block on a clustered call. It's a cleaner design but may have performance implications, which is why I think? we moved to a node-local update mechanism from a clustered one in the past. (That's why in my PR I kept the local write as is and spawned an async internal-replicator-like push to other shards, just more eager than the actual internal replicator)

nickva

I just left a few notes after a quick skim. Most are nits and typos. The main things I worry about are behavior with sws=false (may generate more conflicts) and performance bottlenecks from running in the main changes callback for the cluster. Another concern is any new failures from the fabric layer (timeouts, etc) on any of the possible replication docs in a cluster could crash the replicator, but at the same time we wouldn't want to necessarily catch and ignore errors as we might leave docs in an inconsistent state.

nickva · 2026-06-10T15:25:46Z

+        {'DOWN', Ref, process_, Pid, {exit_error, Reason}} ->
+            error(Reason);
+        {'DOWN', Ref, process, Pid, {exit_exit, Reason}} ->
+            exit(Reason)


We have to be careful here. Any errors like timeout, cluster internal errors etc might kill the multdb changes feed and kill the replicator app

closes #6029

rnewson · 2026-06-12T07:22:50Z

obsoleted by #6034

rnewson force-pushed the replicator-quorum-ops branch from f58ad1b to 8e34eee Compare June 10, 2026 09:10

nickva reviewed Jun 10, 2026

View reviewed changes

rnewson force-pushed the replicator-quorum-ops branch from 8e34eee to 88a58e1 Compare June 10, 2026 19:52

use fabric to read/write _replicator docs if sws=true

28abd18

closes #6029

rnewson force-pushed the replicator-quorum-ops branch from 88a58e1 to 28abd18 Compare June 11, 2026 18:45

rnewson closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use fabric to read/write _replicator docs#6031

use fabric to read/write _replicator docs#6031
rnewson wants to merge 1 commit into
mainfrom
replicator-quorum-ops

rnewson commented Jun 10, 2026

Uh oh!

nickva Jun 10, 2026

Uh oh!

rnewson Jun 10, 2026

Uh oh!

nickva Jun 10, 2026

Uh oh!

rnewson Jun 10, 2026

Uh oh!

nickva Jun 10, 2026 •

edited

Loading

Uh oh!

rnewson Jun 10, 2026

Uh oh!

nickva Jun 10, 2026 •

edited

Loading

Uh oh!

nickva Jun 10, 2026

Uh oh!

nickva Jun 10, 2026

Uh oh!

nickva left a comment •

edited

Loading

Uh oh!

nickva Jun 10, 2026

Uh oh!

rnewson commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rnewson commented Jun 10, 2026

Overview

Testing recommendations

Related Issues or Pull Requests

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickva Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickva Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickva left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnewson commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nickva Jun 10, 2026 •

edited

Loading

nickva Jun 10, 2026 •

edited

Loading

nickva left a comment •

edited

Loading