use fabric to read/write _replicator docs#6031
Conversation
f58ad1b to
8e34eee
Compare
| ioq:maybe_set_io_priority({system, DbName}), | ||
| defer_call(fun() -> | ||
| try | ||
| fabric:update_doc(DbName, Doc, [?CTX]) |
There was a problem hiding this comment.
With serialize_worker_startup=false currently we're using at cloudant, wouldn't this make the conflict generation worse?
There was a problem hiding this comment.
hm, true. I've made the fabric calls conditional on 1) a sharded database and 2) sws must be set to true, otherwise we just do the local operation.
| Ret; | ||
| {'DOWN', Ref, process, Pid, {exit_throw, Reason}} -> | ||
| throw(Reason); | ||
| {'DOWN', Ref, process_, Pid, {exit_error, Reason}} -> |
There was a problem hiding this comment.
typo :process_ will deadlock on exits
|
|
||
| save_rep_doc(<<"shards/", _/binary>> = ShardDbName, Doc) -> | ||
| DbName = mem3:dbname(ShardDbName), | ||
| ioq:maybe_set_io_priority({system, DbName}), |
There was a problem hiding this comment.
This doesn't do anything, since we're setting a pdict value on a process
|
|
||
| open_rep_doc(<<"shards/", _/binary>> = ShardDbName, DocId) -> | ||
| DbName = mem3:dbname(ShardDbName), | ||
| ioq:maybe_set_io_priority({system, DbName}), |
There was a problem hiding this comment.
ioq setting here will fail. It's a pdict setting and since defer spawns a process it won't be applied
| couch_db:close(Db) | ||
| end. | ||
|
|
||
| defer_call(Fun) -> |
There was a problem hiding this comment.
We're not really deferring anything? It's isolating call. We should have a fabric util function somewhere that may do this already
| Else | ||
| end. | ||
|
|
||
| save_rep_doc(<<"shards/", _/binary>> = ShardDbName, Doc) -> |
There was a problem hiding this comment.
If this call happens from a multidb changes listener (I pretty sure it is) we'd have a case when the changes replication scanner on a local set of _replicator shards will block on a clustered call. It's a cleaner design but may have performance implications, which is why I think? we moved to a node-local update mechanism from a clustered one in the past. (That's why in my PR I kept the local write as is and spawned an async internal-replicator-like push to other shards, just more eager than the actual internal replicator)
There was a problem hiding this comment.
I just left a few notes after a quick skim. Most are nits and typos. The main things I worry about are behavior with sws=false (may generate more conflicts) and performance bottlenecks from running in the main changes callback for the cluster. Another concern is any new failures from the fabric layer (timeouts, etc) on any of the possible replication docs in a cluster could crash the replicator, but at the same time we wouldn't want to necessarily catch and ignore errors as we might leave docs in an inconsistent state.
| {'DOWN', Ref, process_, Pid, {exit_error, Reason}} -> | ||
| error(Reason); | ||
| {'DOWN', Ref, process, Pid, {exit_exit, Reason}} -> | ||
| exit(Reason) |
There was a problem hiding this comment.
We have to be careful here. Any errors like timeout, cluster internal errors etc might kill the multdb changes feed and kill the replicator app
8e34eee to
88a58e1
Compare
88a58e1 to
28abd18
Compare
|
obsoleted by #6034 |
closes #6029
Overview
serialize_worker_startup=true fixes the "spurious conflict" update case in most circumstances. A prior attempt to address that problem in the replicator now conflicts with that work. The replicator can update replicator docs whenever the state changes. For jobs that crash quickly, stored conflicts could be generated. We changed the replicator to only write to the local copy on the "owner" node, to avoid this. That is fundamentally the same trick that serialize_worker_startup=true does. However these two mechanisms are not coordinated and can easily choose different primary nodes.
This PR returns couch replicator to use fabric:open_doc and fabric:update_doc to align its changes with serialize_worker_startup=true behaviour.
Testing recommendations
covered by existing tests
Related Issues or Pull Requests
#6029
Checklist
rel/overlay/etc/default.inisrc/docsfolder