MDEV-34358 Encryption threads consume CPU when no work available by Thirunarayanan · Pull Request #5040 · MariaDB/server

Thirunarayanan · 2026-05-04T19:15:15Z

Problem:

1. Encryption threads busy-wait when no work is available. When reaching
fil_system.space_list.end(), fil_crypt_return_iops() is called with
wake=true, causing pthread_cond_broadcast() to wake all threads
unnecessarily, leading to CPU waste.

2. Tablespaces with CLOSING/STOPPING flags (set during DDL operations)
are skipped during iteration. Since DDL completion doesn't wake
encryption threads, these spaces may never be encrypted if threads
sleep indefinitely.

3. For default_encrypt_list iteration, when spaces exist but none are
acquirable, threads need to wake others for
cooperative retry, but this case was not distinguished from
fil_system.space_list.end().

4. IOPS are allocated before searching for tablespaces, wasting resources
during iteration when no I/O occurs.

Solution:
========
1. Implement timed wait with exponential backoff (5s -> 10s -> 20s -> 40s
-> 60s, max 5 attempts) for fil_system.space_list iteration. After
~135 seconds, switch to indefinite wait. This periodically rechecks
for spaces that become available after DDL completes.

2. Use indefinite wait for default_encrypt_list iteration since other
threads will retry and wake when needed.

3. fil_space_t::next(): Add default_encrypt_list flag to distinguish between
the two iteration modes. Wake other threads only
when this flag is true (spaces exist but unacquirable).

Move IOPS allocation from before tablespace search to after finding a
space that needs rotation.

5. Handle wake logic explicitly at call site based on default_encrypt_list flag.

rotate_thread_t changes
====================
- default_encrypt_list (bool): Indicates if default_encrypt_list has unacquirable spaces
- timed_wait_count (uint8_t): Counts consecutive timeouts for exponential backoff
- sleep_timeout_ms (uint16_t): Current timeout in ms (5s -> 60s max)
- wait_for_work(): Implements timed/indefinite wait based on iteration mode
- increase_sleep_timeout(): Doubles timeout up to 60s
- reset_sleep_timeout(): Resets timeout to 5s and clears count

Added debug-only status variables Innodb_encryption_timed_waits
and Innodb_encryption_indefinite_waits to track encryption thread
wait types, enabling verification that threads use timed wait with
exponential backoff instead of busy-waiting when idle.

The counters are incremented in rotate_thread_t::wait_for_work()
and exposed via SHOW STATUS in debug builds only. Also added a
debug sync point 'rotate_only_2_timed_waits' to reduce the timed
wait threshold from 5 to 2 for faster testing of the indefinite
wait transition.

Problem: -------- 1. Encryption threads busy-wait when no work is available. When reaching fil_system.space_list.end(), fil_crypt_return_iops() is called with wake=true, causing pthread_cond_broadcast() to wake all threads unnecessarily, leading to CPU waste. 2. Tablespaces with CLOSING/STOPPING flags (set during DDL operations) are skipped during iteration. Since DDL completion doesn't wake encryption threads, these spaces may never be encrypted if threads sleep indefinitely. 3. For default_encrypt_list iteration, when spaces exist but none are acquirable, threads need to wake others for cooperative retry, but this case was not distinguished from fil_system.space_list.end(). 4. IOPS are allocated before searching for tablespaces, wasting resources during iteration when no I/O occurs. Solution: --------- 1. Implement timed wait with exponential backoff (5s -> 10s -> 20s -> 40s -> 60s, max 5 attempts) for fil_system.space_list iteration. After ~135 seconds, switch to indefinite wait. This periodically rechecks for spaces that become available after DDL completes. 2. Use indefinite wait for default_encrypt_list iteration since other threads will retry and wake when needed. 3. fil_space_t::next(): Add default_encrypt_list flag to distinguish between the two iteration modes. Wake other threads only when this flag is true (spaces exist but unacquirable). 4. Move IOPS allocation from before tablespace search to after finding a space that needs rotation. 5. Handle wake logic explicitly at call site based on default_encrypt_list flag. rotate_thread_t changes ------------------------ - default_encrypt_list (bool): Indicates if default_encrypt_list has unacquirable spaces - timed_wait_count (uint8_t): Counts consecutive timeouts for exponential backoff - sleep_timeout_ms (uint16_t): Current timeout in ms (5s -> 60s max) - wait_for_work(): Implements timed/indefinite wait based on iteration mode - increase_sleep_timeout(): Doubles timeout up to 60s - reset_sleep_timeout(): Resets timeout to 5s and clears count

…it behavior Added debug-only status variables Innodb_encryption_timed_waits and Innodb_encryption_indefinite_waits to track encryption thread wait types, enabling verification that threads use timed wait with exponential backoff instead of busy-waiting when idle. The counters are incremented in rotate_thread_t::wait_for_work() and exposed via SHOW STATUS in debug builds only. Also added a debug sync point 'rotate_only_2_timed_waits' to reduce the timed wait threshold from 5 to 2 for faster testing of the indefinite wait transition.

CLAassistant · 2026-05-04T19:15:21Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

vuvova · 2026-05-04T20:53:39Z

/gemini review

gemini-code-assist

Code Review

This pull request implements an exponential backoff mechanism for InnoDB encryption threads to reduce CPU usage when tablespaces are temporarily unavailable, such as during DDL operations. It introduces a versioning system for encryption settings to trigger iteration restarts and adds debug status variables to track wait behavior. The review feedback highlights several improvement opportunities: addressing dead code wrapped in #if 0, correcting the timeout threshold logic to match the intended behavior described in comments, removing redundant counter resets, and increasing the bit-width of the settings version counter to prevent potential overflow.

gemini-code-assist · 2026-05-04T20:58:43Z

+Incremented each time innodb_encrypt_tables or
+innodb_encryption_rotate_key_age is modified to signal
+encryption threads to restart iteration */
+static Atomic_counter<uint16_t> fil_crypt_settings_version;


Using uint16_t for a version counter is risky as it can wrap around relatively quickly (65535 changes). While unlikely to cause an issue in typical operation, using uint32_t or uint64_t is safer and generally has the same performance characteristics for atomic operations on modern architectures.

static Atomic_counter<uint32_t> fil_crypt_settings_version;

If user changes innodb_encrypt_tables variable every hour then this variable may ran out in 7.5 years. If we restart the server in the mean time then this variable becomes 0. This datatype is enough for practical usage. I don't think sane people is going to toggle encryption for every hour.

dr-m

I think that we need to consider two paths forward. I see that this pull request is currently targeting the 10.6 branch, to which only minimal changes should be done, given that the branch will soon reach its end of life.

There is also the hang MDEV-37946, which I expect we should be hitting when testing this extensively enough. I think that the hang needs to be fixed as part of a minimal fix; it should be fairly easy.

For newer branches, I would like to see a more comprehensive solution, which I am outlining below.

I’d like to see some testing of this subsystem. I have some doubts of the efficacy of the current multi-threaded implementation. We currently have multiple threads that can increase the size of buf_pool.flush_list, It could be the case that multiple threads are concurrently calling buf_flush_list_space(), which could repeatedly invalidate buf_pool.flush_hp.

As a starting point, I think that we have to test if there is any benefit of having multiple "encryption threads" which perform dummy writes to data pages. The name "encryption thread" is misleading, because the actual encryption will take place in the single buf_flush_page_cleaner() thread, in the function buf_page_encrypt(). Yes, this CPU intensive operation will be executed in a single thread!

How can it possibly be helpful to have multiple "producers" (fil_crypt_thread) that are dirtying the buffer pool, when we have a single "consumer" (buf_flush_page_cleaner) that is doing some very CPU intensive work (buf_page_encrypt() and computing CRC-32C)?

Could we implement tighter coupling between the current "encryption" (page-dirtying) and the buf_flush_page_cleaner()? That thread is already regularly scanning potentially the entire buffer pool. So, why not implement an "encryption step" that would process a number of pages that were missed by buf_flush_list_holding_mutex() and buf_flush_LRU()? Such a design could trivially follow the innodb_io_capacity budget and scale back the key rotation activity when there are bursts of application workload (some threads are blocked in buf_flush_wait() or buf_LRU_get_free_block()).

When a clean page that we wish to re-encrypt is in the buffer pool, the buf_flush_page_cleaner() thread can simply write a dummy log record and move the page to the buf_pool.flush_list.

When a page will need to be read into the buffer pool, the current implementation is invoking synchronous buf_page_read() via buf_page_get_gen(). It would be much better to submit asynchronous reads, implementing the logic with a future implementation of MDEV-11378 in mind. The read completion callback would have to detect that this is a special read for encryption, and it would write the dummy record so that the page will be re-encrypted.

The relevant parts of fil_space_rotate_state_t and key_state_t may have to be made part of fil_space_t or fil_space_crypt_t.

A final part of this would be to repurpose the parameter innodb_encrypt_threads to refer to worker tasks that the buf_flush_page_cleaner() would employ to encrypt pages or compute CRC-32C. The function buf_page_t::flush() would be split. Only initial part would be invoked by the buf_flush_page_cleaner() thread; the rest would be invoked by worker tasks, which would process a number of buf_page_t* from their queue.

Problem: ======= When innodb_encrypt_tables or innodb_encryption_rotate_key_age is changed during encryption thread iteration, threads continue with stale configuration values, potentially missing tablespaces that should be encrypted or rotated under the new settings. Solution: ======== Added atomic version counter fil_crypt_settings_version that is incremented whenever innodb_encrypt_tables or innodb_encryption_rotate_key_age changes. Encryption threads capture the version at iteration start and check for changes during iteration. If config changed, threads immediately restart iteration from the beginning to ensure complete coverage with new settings. fil_crypt_settings_version: Atomic counter to track the innodb_encrypt_tables or innodb_encryption_rotate_key_age changes rotate_thread_t::settings_version: To compare the existing fil_crypt_settings_version to restart the encryption from the beginning

Thirunarayanan · 2026-05-05T06:36:12Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an exponential backoff mechanism for InnoDB encryption threads to optimize CPU usage and a versioning system to manage configuration changes, including updates to status variable tests. The review feedback identifies several improvements: using 32-bit integers for timeout and version counters to prevent overflow and wrap-around, correcting the wait logic for the default encryption list, resetting timeout states when work is found, and addressing a busy-wait loop in IOPS allocation.

gemini-code-assist · 2026-05-05T06:42:43Z

+  Used only when iterating fil_system.space_list to periodically recheck for
+  spaces that become available after DDL operations complete. Reset to 5000ms
+  when woken by signal or when work is found. */
+  uint16_t sleep_timeout_ms= 5000;


The variable sleep_timeout_ms should be changed to uint32_t to avoid potential overflow issues during calculations in increase_sleep_timeout(). When sleep_timeout_ms is 40,000, sleep_timeout_ms * 2 results in 80,000, which exceeds the uint16_t range and causes incorrect behavior when passed to std::min<uint16_t>.

uint32_t sleep_timeout_ms= 5000;

gemini-code-assist · 2026-05-05T06:42:43Z

+  timed wait expires without being woken by a signal, indicating no work
+  became available during the timeout period. */
+  void increase_sleep_timeout() {
+	sleep_timeout_ms = std::min<uint16_t>(sleep_timeout_ms * 2, 60000);


As noted previously, sleep_timeout_ms * 2 can exceed 65,535. If sleep_timeout_ms is changed to uint32_t, the explicit template parameter for std::min can be removed or updated to ensure correct comparison.

sleep_timeout_ms = std::min(sleep_timeout_ms * 2, 60000U);

gemini-code-assist · 2026-05-05T06:42:43Z

+Incremented each time innodb_encrypt_tables or
+innodb_encryption_rotate_key_age is modified to signal
+encryption threads to restart iteration */
+static Atomic_counter<uint16_t> fil_crypt_settings_version;


Using uint16_t for a version counter is risky in a long-running database server as it can wrap around relatively quickly (65,535 changes). While the logic only checks for inequality, using uint32_t is safer and more standard for versioning to avoid any collision during a single iteration.

static Atomic_counter<uint32_t> fil_crypt_settings_version;

gemini-code-assist · 2026-05-05T06:42:43Z

+  /** Config version when thread started current iteration.
+  Used to detect innodb_encrypt_tables changes during iteration
+  and restart from beginning to ensure complete encryption coverage. */
+  uint16_t settings_version;


Update this to uint32_t to match the suggested change for the global fil_crypt_settings_version.

uint32_t settings_version;

gemini-code-assist · 2026-05-05T06:42:43Z

+			return;
+		}
+
+		if (space == fil_system.space_list.end()) {


There is a discrepancy between the code and the intended design described in the PR and the comments (lines 1131-1133). The code currently performs a timed wait when space == .end(), regardless of whether it's iterating the default_encrypt_list. To follow the stated design of using an indefinite wait for the default list (where other threads are expected to wake this one), the condition should check the default_encrypt_list flag.

if (space == fil_system.space_list.end() && !default_encrypt_list) {

Are you kidding me, right?
Please check the callers for wait_for_work()

if (thr.default_encrypt_list) { pthread_cond_broadcast( &fil_crypt_threads_cond); } goto wait_for_work; }

gemini-code-assist · 2026-05-05T06:42:43Z

 				goto wait_for_work;
 			}

 			/* we found a space to rotate */


When a tablespace is successfully found for rotation, the sleep timeout and timed wait count should be reset. This ensures that the exponential backoff starts fresh the next time the thread becomes idle, rather than potentially jumping straight to an indefinite wait due to stale state from previous idle periods.

/* we found a space to rotate */ thr.reset_sleep_timeout();

gemini-code-assist · 2026-05-05T06:42:43Z

+			do {
+				if (thr.should_shutdown()) {
+					thr.space->release();
+					thr.space = fil_system.space_list.end();
+					goto func_exit;
+				}
+			} while (!fil_crypt_alloc_iops(&thr));
+


This loop for IOPS allocation is a busy-wait that consumes 100% CPU if fil_crypt_alloc_iops returns false (e.g., when the global encryption IOPS limit is reached). While this pattern existed in the previous version, it remains a performance bottleneck. Consider adding a short sleep or yielding the thread if IOPS are unavailable.

Thirunarayanan added 2 commits May 5, 2026 00:26

Thirunarayanan requested a review from dr-m May 4, 2026 19:15

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

dr-m reviewed May 5, 2026

View reviewed changes

Comment thread storage/innobase/fil/fil0crypt.cc Outdated

Comment thread storage/innobase/fil/fil0crypt.cc

Thirunarayanan force-pushed the MDEV-34358 branch from 41afe38 to 3b3b9a2 Compare May 5, 2026 05:32

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

gkodinov added the MariaDB Corporation label May 5, 2026

- Addressing the review comments

44bf327

Uh oh!

Conversation

Thirunarayanan commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Uh oh!

CLAassistant commented May 4, 2026

Uh oh!

vuvova commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Thirunarayanan May 5, 2026

Choose a reason for hiding this comment

Uh oh!

dr-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Thirunarayanan commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Thirunarayanan May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

Thirunarayanan commented May 4, 2026 •

edited

Loading