Skip to content
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
/output
/test/output
build/
.cache

# Ignore hidden files
.*
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ You can use it to:
* [bthread or not](docs/cn/bthread_or_not.md)
* [thread-local](docs/cn/thread_local.md)
* [Execution Queue](docs/cn/execution_queue.md)
* [Active Task (experimental)](docs/cn/bthread_active_task.md)
* Client
* [Basics](docs/en/client.md)
* [Error code](docs/en/error_code.md)
Expand Down
1 change: 1 addition & 0 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
* [bthread or not](docs/cn/bthread_or_not.md)
* [thread-local](docs/cn/thread_local.md)
* [Execution Queue](docs/cn/execution_queue.md)
* [Active Task(实验性)](docs/cn/bthread_active_task.md)
* [bthread tracer](docs/cn/bthread_tracer.md)
* Client
* [基础功能](docs/cn/client.md)
Expand Down
386 changes: 386 additions & 0 deletions docs/cn/bthread_active_task.md

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/cn/io.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ linux一般使用non-blocking IO提高IO并发度。当IO并发度很低时,no

“消息”指从连接读入的有边界的二进制串,可能是来自上游client的request或来自下游server的response。brpc使用一个或多个[EventDispatcher](https://github.com/apache/brpc/blob/master/src/brpc/event_dispatcher.h)(简称为EDISP)等待任一fd发生事件。和常见的“IO线程”不同,EDISP不负责读取。IO线程的问题在于一个线程同时只能读一个fd,当多个繁忙的fd聚集在一个IO线程中时,一些读取就被延迟了。多租户、复杂分流算法,[Streaming RPC](streaming_rpc.md)等功能会加重这个问题。高负载下常见的某次读取卡顿会拖慢一个IO线程中所有fd的读取,对可用性的影响幅度较大。

由于epoll的[一个bug](https://web.archive.org/web/20150423184820/https://patchwork.kernel.org/patch/1970231/)(开发brpc时仍有)及epoll_ctl较大的开销,EDISP使用Edge triggered模式。当收到事件时,EDISP给一个原子变量加1,只有当加1前的值是0时启动一个bthread处理对应fd上的数据。在背后,EDISP把所在的pthread让给了新建的bthread,使其有更好的cache locality,可以尽快地读取fd上的数据。而EDISP所在的bthread会被偷到另外一个pthread继续执行,这个过程即是bthread的work stealing调度。要准确理解那个原子变量的工作方式可以先阅读[atomic instructions](atomic_instructions.md),再看[Socket::StartInputEvent](https://github.com/apache/brpc/blob/master/src/brpc/socket.cpp)。这些方法使得brpc读取同一个fd时产生的竞争是[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)的。
由于epoll的[一个bug](https://web.archive.org/web/20150423184820/https://patchwork.kernel.org/patch/1970231/)(开发brpc时仍有)及epoll_ctl较大的开销,EDISP使用Edge triggered模式。当收到事件时,EDISP给一个原子变量加1,只有当加1前的值是0时才触发对应fd的数据处理。默认配置下(`usercode_in_coroutine=false` 且 `EventDispatcherUnsched()` 为 `false`),EDISP通过 `bthread_start_urgent` 拉起处理逻辑,并把当前worker让给新任务,使其有更好的cache locality,可以尽快读取fd上的数据。若 `EventDispatcherUnsched()` 为 `true`,则改为 `bthread_start_background`,EDISP不会主动让出当前调度。若 `usercode_in_coroutine=true`,则直接在当前协程执行处理逻辑,不额外创建bthread。要准确理解那个原子变量的工作方式可以先阅读[atomic instructions](atomic_instructions.md),再看[Socket::StartInputEvent](https://github.com/apache/brpc/blob/master/src/brpc/socket.cpp)。这些方法使得brpc读取同一个fd时产生的竞争是[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)的。

[InputMessenger](https://github.com/apache/brpc/blob/master/src/brpc/input_messenger.h)负责从fd上切割和处理消息,它通过用户回调函数理解不同的格式。Parse一般是把消息从二进制流上切割下来,运行时间较固定;Process则是进一步解析消息(比如反序列化为protobuf)后调用用户回调,时间不确定。若一次从某个fd读取出n个消息(n > 1),InputMessenger会启动n-1个bthread分别处理前n-1个消息,最后一个消息则会在原地被Process。InputMessenger会逐一尝试多种协议,由于一个连接上往往只有一种消息格式,InputMessenger会记录下上次的选择,而避免每次都重复尝试。
[InputMessenger](https://github.com/apache/brpc/blob/master/src/brpc/input_messenger.h)负责从fd上切割和处理消息,它通过用户回调函数理解不同的格式。Parse一般是把消息从二进制流上切割下来,运行时间较固定;Process则是进一步解析消息(比如反序列化为protobuf)后调用用户回调,时间不确定。默认配置下(TCP连接或RDMA事件驱动,且 `usercode_in_coroutine=false`),若一次从某个fd读取出n个消息(n > 1),InputMessenger会启动n-1个bthread分别处理前n-1个消息,最后一个消息会在原地被Process。在RDMA轮询模式(`rdma_use_polling=true`)下,为避免阻塞轮询bthread,最后一个消息也会切到新bthread(除非开启 `usercode_in_coroutine` 或 `rdma_disable_bthread`)。InputMessenger会逐一尝试多种协议,由于一个连接上往往只有一种消息格式,InputMessenger会记录下上次的选择,而避免每次都重复尝试。

可以看到,fd间和fd内的消息都会在brpc中获得并发,这使brpc非常擅长大消息的读取,在高负载时仍能及时处理不同来源的消息,减少长尾的存在。

Expand Down
26 changes: 24 additions & 2 deletions docs/cn/rdma.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,28 @@ RDMA要求数据收发所使用的内存空间必须被注册(memory register

RDMA是硬件相关的通信技术,有很多独特的概念,比如device、port、GID、LID、MaxSge等。这些参数在初始化时会从对应的网卡中读取出来,并且做出默认的选择(参见src/brpc/rdma/rdma_helper.cpp)。有时默认的选择并非用户的期望,则可以通过flag参数方式指定。

RDMA支持事件驱动和轮询两种模式,默认是事件驱动模式,通过设置rdma_use_polling可以开启轮询模式。轮询模式下还可以设置轮询器数目(rdma_poller_num),以及是否主动放弃CPU(rdma_poller_yield)。轮询模式下还可以设置一个回调函数,在每次轮询时调用,可以配合io_uring/spdk等使用。在配合使用spdk等驱动的时候,因为spdk只支持轮询模式,并且只能在单线程使用(或者叫Run To Completion模式上使用)执行一个任务过程中不允许被调度到别的线程上,所以这时候需要设置(rdma_edisp_unsched)为true,使事件驱动程序一直占用一个worker线程,不能调度别的任务。
RDMA支持事件驱动和轮询两种模式,默认是事件驱动模式,通过设置rdma_use_polling可以开启轮询模式。轮询模式下还可以设置轮询器数目(rdma_poller_num),以及是否主动放弃CPU(rdma_poller_yield)。轮询模式下还可以设置一个回调函数,在每次轮询时调用,可以配合io_uring/spdk等使用。

`event_dispatcher_edisp_unsched` 是全局开关,同时影响普通模式(TCP)和RDMA模式的EventDispatcher调度行为。兼容历史配置,`rdma_edisp_unsched` 仍保留,但已标记为废弃,未来版本会移除。

历史说明:
1. 旧实现中 RDMA 路径曾出现 `if` 判断 bug,分支行为与 flag 语义不一致(`unsched=true` 反而走可调度分支)。该问题已修复为与 flag 语义一致。
2. `event_dispatcher_edisp_unsched` 是 `rdma_edisp_unsched` 的替代参数;当前保留 `rdma_edisp_unsched` 仅用于兼容历史配置。两者语义保持一致:`true` 表示不可调度,`false` 表示可调度。

最终生效条件统一为:
`event_dispatcher_edisp_unsched || rdma_edisp_unsched`

启动时不会再改写用户传入的 flag,运行时严格按用户配置值生效。

推荐使用方式:
1. 新部署:只配置 `event_dispatcher_edisp_unsched`。
2. 存量部署:`rdma_edisp_unsched` 仅作过渡兼容,逐步迁移到 `event_dispatcher_edisp_unsched`。
3. 避免脚本中给出“冲突值”;在统一 OR 语义下,只要任一 flag 为 `true`,EventDispatcher 就不可调度。

行为示例:
1. 仅设置 `-rdma_edisp_unsched=true`:`rdma_edisp_unsched=true`、`event_dispatcher_edisp_unsched=false`;TCP和RDMA均不可调度。
2. 仅设置 `-event_dispatcher_edisp_unsched=true`:`rdma_edisp_unsched=false`、`event_dispatcher_edisp_unsched=true`;TCP和RDMA均不可调度。
3. 同时设置 `-rdma_edisp_unsched=true -event_dispatcher_edisp_unsched=true`:两个flag同为`true`;TCP和RDMA均不可调度。

# 参数

Expand All @@ -73,5 +94,6 @@ RDMA支持事件驱动和轮询两种模式,默认是事件驱动模式,通
* rdma_use_polling: 是否使用RDMA的轮询模式,默认false。
* rdma_poller_num: 轮询模式下的poller数目,默认1。
* rdma_poller_yield: 轮询模式下的poller是否主动放弃CPU,默认是false。
* rdma_edisp_unsched: 让事件驱动器不可以被调度,默认是false。
* event_dispatcher_edisp_unsched: 全局开关,控制EventDispatcher是否不可被调度(true时不可调度),默认是false。
* rdma_edisp_unsched: 废弃兼容参数(未来版本计划移除)。当前仍参与统一生效判断,默认是false。
* rdma_disable_bthread: 禁用bthread,默认是false。
25 changes: 25 additions & 0 deletions docs/cn/threading_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,28 @@
异步编程中的流程控制对于专家也充满了陷阱。任何挂起操作,如sleep一会儿或等待某事完成,都意味着用户需要显式地保存状态,并在回调函数中恢复状态。异步代码往往得写成状态机的形式。当挂起较少时,这有点麻烦,但还是可把握的。问题在于一旦挂起发生在条件判断、循环、子函数中,写出这样的状态机并能被很多人理解和维护,几乎是不可能的,而这在分布式系统中又很常见,因为一个节点往往要与多个节点同时交互。另外如果唤醒可由多种事件触发(比如fd有数据或超时了),挂起和恢复的过程容易出现race condition,对多线程编码能力要求很高。语法糖(比如lambda)可以让编码不那么“麻烦”,但无法降低难度。

共享指针在异步编程中很普遍,这看似方便,但也使内存的ownership变得难以捉摸,如果内存泄漏了,很难定位哪里没有释放;如果segment fault了,也不知道哪里多释放了一下。大量使用引用计数的用户代码很难控制代码质量,容易长期在内存问题上耗费时间。如果引用计数还需要手动维护,保持质量就更难了,维护者也不会愿意改进。没有上下文会使得[RAII](http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization)无法充分发挥作用, 有时需要在callback之外lock,callback之内unlock,实践中很容易出错。

## butex wait/wake 顺序规则(实用)

直接使用 `butex_wait`/`butex_wake*` 时,务必遵守:

1. 唤醒方先写结果/状态,再调用 `butex_wake*`。
2. 等待方在每次 `butex_wait` 返回后都要重检谓词条件。

`butex_wait` 返回 `0` 只表示“从 butex 等待队列被唤醒”,不代表“业务条件已经满足”。

常见写法:

```cpp
// 唤醒方
state.store(new_value, butil::memory_order_release);
bthread::butex_wake(&state);

// 等待方
while (state.load(butil::memory_order_acquire) == expected_value) {
if (bthread::butex_wait(&state, expected_value, NULL) < 0 &&
errno != EWOULDBLOCK && errno != EINTR) {
// 处理超时/中断/停止等错误
}
}
```
4 changes: 2 additions & 2 deletions docs/en/io.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ Non-blocking IO is usually used for increasing IO concurrency in Linux. When the

A message is a bounded binary data read from a connection, which may be a request from upstream clients or a response from downstream servers. brpc uses one or several [EventDispatcher](https://github.com/apache/brpc/blob/master/src/brpc/event_dispatcher.cpp)(referred to as EDISP) to wait for events from file descriptors. Unlike the common "IO threads", EDISP is not responsible for reading or writing. The problem of IO threads is that one thread can only read one fd at a given time, other reads may be delayed when many fds in one IO thread are busy. Multi-tenancy, complicated load balancing and [Streaming RPC](streaming_rpc.md) worsen the problem. Under high workloads, regular long delays on one fd may slow down all fds in the IO thread, causing more long tails.

Because of a [bug](https://web.archive.org/web/20150423184820/https://patchwork.kernel.org/patch/1970231/) of epoll (at the time of developing brpc) and overhead of epoll_ctl, edge triggered mode is used in EDISP. After receiving an event, an atomic variable associated with the fd is added by one atomically. If the variable is zero before addition, a bthread is started to handle the data from the fd. The pthread worker in which EDISP runs is yielded to the newly created bthread to make it start reading ASAP and have a better cache locality. The bthread in which EDISP runs will be stolen to another pthread and keep running, this mechanism is work stealing used in bthreads. To understand exactly how that atomic variable works, you can read [atomic instructions](atomic_instructions.md) first, then check [Socket::StartInputEvent](https://github.com/apache/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions on dispatching events of one fd [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).
Because of a [bug](https://web.archive.org/web/20150423184820/https://patchwork.kernel.org/patch/1970231/) of epoll (at the time of developing brpc) and overhead of epoll_ctl, edge triggered mode is used in EDISP. After receiving an event, an atomic variable associated with the fd is added by one atomically. Data handling is triggered only when the value is zero before addition. In the default path (`usercode_in_coroutine=false` and `EventDispatcherUnsched()` is `false`), EDISP starts processing via `bthread_start_urgent` and yields the current worker to the new task for faster reads and better cache locality. If `EventDispatcherUnsched()` is `true`, it switches to `bthread_start_background`, so EDISP does not actively yield scheduling on this event. If `usercode_in_coroutine=true`, the processing logic runs inline in the current coroutine without creating another bthread. To understand exactly how that atomic variable works, you can read [atomic instructions](atomic_instructions.md) first, then check [Socket::StartInputEvent](https://github.com/apache/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions on dispatching events of one fd [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).

[InputMessenger](https://github.com/apache/brpc/blob/master/src/brpc/input_messenger.h) cuts messages and uses customizable callbacks to handle different format of data. `Parse` callback cuts messages from binary data and has relatively stable running time; `Process` parses messages further(such as parsing by protobuf) and calls users' callbacks, which vary in running time. If n(n > 1) messages are read from the fd, InputMessenger launches n-1 bthreads to handle first n-1 messages respectively, and processes the last message in-place. InputMessenger tries protocols one by one. Since one connections often has only one type of messages, InputMessenger remembers current protocol to avoid trying for protocols next time.
[InputMessenger](https://github.com/apache/brpc/blob/master/src/brpc/input_messenger.h) cuts messages and uses customizable callbacks to handle different format of data. `Parse` callback cuts messages from binary data and has relatively stable running time; `Process` parses messages further(such as parsing by protobuf) and calls users' callbacks, which vary in running time. In the default path (TCP or RDMA event-driven mode, and `usercode_in_coroutine=false`), if n(n > 1) messages are read from the fd, InputMessenger launches n-1 bthreads to handle first n-1 messages respectively, and processes the last message in-place. In RDMA polling mode (`rdma_use_polling=true`), the last message is also dispatched to a new bthread to avoid blocking the polling bthread (unless `usercode_in_coroutine` or `rdma_disable_bthread` is enabled). InputMessenger tries protocols one by one. Since one connections often has only one type of messages, InputMessenger remembers current protocol to avoid trying for protocols next time.

It can be seen that messages from different fds or even same fd are processed concurrently in brpc, which makes brpc good at handling large messages and reducing long tails on processing messages from different sources under high workloads.

Expand Down
24 changes: 23 additions & 1 deletion docs/en/rdma.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,27 @@ The application can manage memory by itself and send data with IOBuf::append_use

RDMA is hardware-related. It has some different concepts such as device, port, GID, LID, MaxSge and so on. These parameters can be read from NICs at initialization, and brpc will make the default choice (see src/brpc/rdma/rdma_helper.cpp). Sometimes the default choice is not the expectation, then it can be changed in the flag way.

`event_dispatcher_edisp_unsched` is a global flag and affects EventDispatcher scheduling in both normal mode (TCP) and RDMA mode. For backward compatibility, `rdma_edisp_unsched` is still kept, but it is deprecated and will be removed in a future release.

Historical notes:
1. In an earlier implementation, the RDMA path had an `if` condition bug where branch behavior did not match flag semantics (`unsched=true` could still enter the schedulable branch). This has been fixed.
2. `event_dispatcher_edisp_unsched` is intended to replace `rdma_edisp_unsched`; the old flag remains only for compatibility. Their semantics are consistent: `true` means unschedulable, `false` means schedulable.

The effective unsched condition is unified as:
`event_dispatcher_edisp_unsched || rdma_edisp_unsched`

No startup synchronization rewrites user flags. Runtime behavior is determined directly from user-provided values.

Recommended usage:
1. New deployment: set only `event_dispatcher_edisp_unsched`.
2. Existing deployment: keep `rdma_edisp_unsched` temporarily, but migrate to `event_dispatcher_edisp_unsched`.
3. Avoid conflicting values in scripts; with unified OR semantics, either flag being `true` makes EventDispatcher unschedulable.

Examples:
1. Only `-rdma_edisp_unsched=true`: `rdma_edisp_unsched=true`, `event_dispatcher_edisp_unsched=false`; both TCP and RDMA are unschedulable.
2. Only `-event_dispatcher_edisp_unsched=true`: `rdma_edisp_unsched=false`, `event_dispatcher_edisp_unsched=true`; both TCP and RDMA are unschedulable.
3. Both `-rdma_edisp_unsched=true -event_dispatcher_edisp_unsched=true`: both flags are `true`; both TCP and RDMA are unschedulable.

# Parameters

Configurable parameters:
Expand All @@ -71,5 +92,6 @@ Configurable parameters:
* rdma_use_polling: Whether to use RDMA polling mode, default is false.
* rdma_poller_num: The number of pollers in polling mode, default is 1.
* rdma_poller_yield: Whether pollers in polling mode voluntarily relinquish the CPU, default is false.
* rdma_edisp_unsched`: Prevents the event driver from being scheduled, default is false.
* event_dispatcher_edisp_unsched: Global switch for EventDispatcher scheduling (true means unschedulable), default is false.
* rdma_edisp_unsched: Deprecated compatibility flag (planned removal in a future release). It still participates in unified unsched condition, default is false.
* rdma_disable_bthread: Disables bthread, default is false.
Loading