Skip to content

MDEV-34358 Encryption threads consume CPU when no work available#5040

Open
Thirunarayanan wants to merge 4 commits into10.6from
MDEV-34358
Open

MDEV-34358 Encryption threads consume CPU when no work available#5040
Thirunarayanan wants to merge 4 commits into10.6from
MDEV-34358

Conversation

@Thirunarayanan
Copy link
Copy Markdown
Member

@Thirunarayanan Thirunarayanan commented May 4, 2026

Problem:

1. Encryption threads busy-wait when no work is available. When reaching
fil_system.space_list.end(), fil_crypt_return_iops() is called with
wake=true, causing pthread_cond_broadcast() to wake all threads
unnecessarily, leading to CPU waste.

2. Tablespaces with CLOSING/STOPPING flags (set during DDL operations)
are skipped during iteration. Since DDL completion doesn't wake
encryption threads, these spaces may never be encrypted if threads
sleep indefinitely.

3. For default_encrypt_list iteration, when spaces exist but none are
acquirable, threads need to wake others for
cooperative retry, but this case was not distinguished from
fil_system.space_list.end().

4. IOPS are allocated before searching for tablespaces, wasting resources
during iteration when no I/O occurs.

Solution:
========
1. Implement timed wait with exponential backoff (5s -> 10s -> 20s -> 40s
-> 60s, max 5 attempts) for fil_system.space_list iteration. After
~135 seconds, switch to indefinite wait. This periodically rechecks
for spaces that become available after DDL completes.

2. Use indefinite wait for default_encrypt_list iteration since other
threads will retry and wake when needed.

3. fil_space_t::next(): Add default_encrypt_list flag to distinguish between
the two iteration modes. Wake other threads only
when this flag is true (spaces exist but unacquirable).
  1. Move IOPS allocation from before tablespace search to after finding a
    space that needs rotation.
5. Handle wake logic explicitly at call site based on default_encrypt_list flag.

rotate_thread_t changes
====================
- default_encrypt_list (bool): Indicates if default_encrypt_list has unacquirable spaces
- timed_wait_count (uint8_t): Counts consecutive timeouts for exponential backoff
- sleep_timeout_ms (uint16_t): Current timeout in ms (5s -> 60s max)
- wait_for_work(): Implements timed/indefinite wait based on iteration mode
- increase_sleep_timeout(): Doubles timeout up to 60s
- reset_sleep_timeout(): Resets timeout to 5s and clears count

Added debug-only status variables Innodb_encryption_timed_waits
and Innodb_encryption_indefinite_waits to track encryption thread
wait types, enabling verification that threads use timed wait with
exponential backoff instead of busy-waiting when idle.

The counters are incremented in rotate_thread_t::wait_for_work()
and exposed via SHOW STATUS in debug builds only. Also added a
debug sync point 'rotate_only_2_timed_waits' to reduce the timed
wait threshold from 5 to 2 for faster testing of the indefinite
wait transition.

Problem:
--------
1. Encryption threads busy-wait when no work is available. When reaching
fil_system.space_list.end(), fil_crypt_return_iops() is called with
wake=true, causing pthread_cond_broadcast() to wake all threads
unnecessarily, leading to CPU waste.

2. Tablespaces with CLOSING/STOPPING flags (set during DDL operations)
are skipped during iteration. Since DDL completion doesn't wake
encryption threads, these spaces may never be encrypted if threads
sleep indefinitely.

3. For default_encrypt_list iteration, when spaces exist but none are
acquirable, threads need to wake others for
cooperative retry, but this case was not distinguished from
fil_system.space_list.end().

4. IOPS are allocated before searching for tablespaces, wasting resources
during iteration when no I/O occurs.

Solution:
---------
1. Implement timed wait with exponential backoff (5s -> 10s -> 20s -> 40s
-> 60s, max 5 attempts) for fil_system.space_list iteration. After
~135 seconds, switch to indefinite wait. This periodically rechecks
for spaces that become available after DDL completes.

2. Use indefinite wait for default_encrypt_list iteration since other
threads will retry and wake when needed.

3. fil_space_t::next(): Add default_encrypt_list flag to distinguish between
the two iteration modes. Wake other threads only
when this flag is true (spaces exist but unacquirable).

4. Move IOPS allocation from before tablespace search to after finding a
space that needs rotation.

5. Handle wake logic explicitly at call site based on default_encrypt_list flag.

rotate_thread_t changes
------------------------
- default_encrypt_list (bool): Indicates if default_encrypt_list has unacquirable spaces
- timed_wait_count (uint8_t): Counts consecutive timeouts for exponential backoff
- sleep_timeout_ms (uint16_t): Current timeout in ms (5s -> 60s max)
- wait_for_work(): Implements timed/indefinite wait based on iteration mode
- increase_sleep_timeout(): Doubles timeout up to 60s
- reset_sleep_timeout(): Resets timeout to 5s and clears count
…it behavior

Added debug-only status variables Innodb_encryption_timed_waits
and Innodb_encryption_indefinite_waits to track encryption thread
wait types, enabling verification that threads use timed wait with
exponential backoff instead of busy-waiting when idle.

The counters are incremented in rotate_thread_t::wait_for_work()
and exposed via SHOW STATUS in debug builds only. Also added a
debug sync point 'rotate_only_2_timed_waits' to reduce the timed
wait threshold from 5 to 2 for faster testing of the indefinite
wait transition.
@Thirunarayanan Thirunarayanan requested a review from dr-m May 4, 2026 19:15
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@vuvova
Copy link
Copy Markdown
Member

vuvova commented May 4, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an exponential backoff mechanism for InnoDB encryption threads to reduce CPU usage when tablespaces are temporarily unavailable, such as during DDL operations. It introduces a versioning system for encryption settings to trigger iteration restarts and adds debug status variables to track wait behavior. The review feedback highlights several improvement opportunities: addressing dead code wrapped in #if 0, correcting the timeout threshold logic to match the intended behavior described in comments, removing redundant counter resets, and increasing the bit-width of the settings version counter to prevent potential overflow.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Incremented each time innodb_encrypt_tables or
innodb_encryption_rotate_key_age is modified to signal
encryption threads to restart iteration */
static Atomic_counter<uint16_t> fil_crypt_settings_version;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using uint16_t for a version counter is risky as it can wrap around relatively quickly (65535 changes). While unlikely to cause an issue in typical operation, using uint32_t or uint64_t is safer and generally has the same performance characteristics for atomic operations on modern architectures.

static Atomic_counter<uint32_t> fil_crypt_settings_version;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user changes innodb_encrypt_tables variable every hour then this variable may ran out in 7.5 years. If we restart the server in the mean time then this variable becomes 0. This datatype is enough for practical usage. I don't think sane people is going to toggle encryption for every hour.

Copy link
Copy Markdown
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to consider two paths forward. I see that this pull request is currently targeting the 10.6 branch, to which only minimal changes should be done, given that the branch will soon reach its end of life.

There is also the hang MDEV-37946, which I expect we should be hitting when testing this extensively enough. I think that the hang needs to be fixed as part of a minimal fix; it should be fairly easy.

For newer branches, I would like to see a more comprehensive solution, which I am outlining below.

I’d like to see some testing of this subsystem. I have some doubts of the efficacy of the current multi-threaded implementation. We currently have multiple threads that can increase the size of buf_pool.flush_list, It could be the case that multiple threads are concurrently calling buf_flush_list_space(), which could repeatedly invalidate buf_pool.flush_hp.

As a starting point, I think that we have to test if there is any benefit of having multiple "encryption threads" which perform dummy writes to data pages. The name "encryption thread" is misleading, because the actual encryption will take place in the single buf_flush_page_cleaner() thread, in the function buf_page_encrypt(). Yes, this CPU intensive operation will be executed in a single thread!

How can it possibly be helpful to have multiple "producers" (fil_crypt_thread) that are dirtying the buffer pool, when we have a single "consumer" (buf_flush_page_cleaner) that is doing some very CPU intensive work (buf_page_encrypt() and computing CRC-32C)?

Could we implement tighter coupling between the current "encryption" (page-dirtying) and the buf_flush_page_cleaner()? That thread is already regularly scanning potentially the entire buffer pool. So, why not implement an "encryption step" that would process a number of pages that were missed by buf_flush_list_holding_mutex() and buf_flush_LRU()? Such a design could trivially follow the innodb_io_capacity budget and scale back the key rotation activity when there are bursts of application workload (some threads are blocked in buf_flush_wait() or buf_LRU_get_free_block()).

When a clean page that we wish to re-encrypt is in the buffer pool, the buf_flush_page_cleaner() thread can simply write a dummy log record and move the page to the buf_pool.flush_list.

When a page will need to be read into the buffer pool, the current implementation is invoking synchronous buf_page_read() via buf_page_get_gen(). It would be much better to submit asynchronous reads, implementing the logic with a future implementation of MDEV-11378 in mind. The read completion callback would have to detect that this is a special read for encryption, and it would write the dummy record so that the page will be re-encrypted.

The relevant parts of fil_space_rotate_state_t and key_state_t may have to be made part of fil_space_t or fil_space_crypt_t.

A final part of this would be to repurpose the parameter innodb_encrypt_threads to refer to worker tasks that the buf_flush_page_cleaner() would employ to encrypt pages or compute CRC-32C. The function buf_page_t::flush() would be split. Only initial part would be invoked by the buf_flush_page_cleaner() thread; the rest would be invoked by worker tasks, which would process a number of buf_page_t* from their queue.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc
Problem:
=======
When innodb_encrypt_tables or innodb_encryption_rotate_key_age is
changed during encryption thread iteration, threads continue with
stale configuration values, potentially missing tablespaces that
should be encrypted or rotated under the new settings.

Solution:
========
Added atomic version counter fil_crypt_settings_version that is
incremented whenever innodb_encrypt_tables or
innodb_encryption_rotate_key_age changes. Encryption threads capture
the version at iteration start and check for changes during iteration.
If config changed, threads immediately restart iteration from the
beginning to ensure complete coverage with new settings.

fil_crypt_settings_version: Atomic counter to track the
innodb_encrypt_tables or innodb_encryption_rotate_key_age changes

rotate_thread_t::settings_version: To compare the
existing fil_crypt_settings_version to restart the
encryption from the beginning
@Thirunarayanan
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an exponential backoff mechanism for InnoDB encryption threads to optimize CPU usage and a versioning system to manage configuration changes, including updates to status variable tests. The review feedback identifies several improvements: using 32-bit integers for timeout and version counters to prevent overflow and wrap-around, correcting the wait logic for the default encryption list, resetting timeout states when work is found, and addressing a busy-wait loop in IOPS allocation.

Used only when iterating fil_system.space_list to periodically recheck for
spaces that become available after DDL operations complete. Reset to 5000ms
when woken by signal or when work is found. */
uint16_t sleep_timeout_ms= 5000;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable sleep_timeout_ms should be changed to uint32_t to avoid potential overflow issues during calculations in increase_sleep_timeout(). When sleep_timeout_ms is 40,000, sleep_timeout_ms * 2 results in 80,000, which exceeds the uint16_t range and causes incorrect behavior when passed to std::min<uint16_t>.

  uint32_t sleep_timeout_ms= 5000;

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
timed wait expires without being woken by a signal, indicating no work
became available during the timeout period. */
void increase_sleep_timeout() {
sleep_timeout_ms = std::min<uint16_t>(sleep_timeout_ms * 2, 60000);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As noted previously, sleep_timeout_ms * 2 can exceed 65,535. If sleep_timeout_ms is changed to uint32_t, the explicit template parameter for std::min can be removed or updated to ensure correct comparison.

	sleep_timeout_ms = std::min(sleep_timeout_ms * 2, 60000U);

Incremented each time innodb_encrypt_tables or
innodb_encryption_rotate_key_age is modified to signal
encryption threads to restart iteration */
static Atomic_counter<uint16_t> fil_crypt_settings_version;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using uint16_t for a version counter is risky in a long-running database server as it can wrap around relatively quickly (65,535 changes). While the logic only checks for inequality, using uint32_t is safer and more standard for versioning to avoid any collision during a single iteration.

static Atomic_counter<uint32_t> fil_crypt_settings_version;

/** Config version when thread started current iteration.
Used to detect innodb_encrypt_tables changes during iteration
and restart from beginning to ensure complete encryption coverage. */
uint16_t settings_version;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update this to uint32_t to match the suggested change for the global fil_crypt_settings_version.

  uint32_t settings_version;

return;
}

if (space == fil_system.space_list.end()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a discrepancy between the code and the intended design described in the PR and the comments (lines 1131-1133). The code currently performs a timed wait when space == .end(), regardless of whether it's iterating the default_encrypt_list. To follow the stated design of using an indefinite wait for the default list (where other threads are expected to wake this one), the condition should check the default_encrypt_list flag.

		if (space == fil_system.space_list.end() && !default_encrypt_list) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you kidding me, right?
Please check the callers for wait_for_work()

                                if (thr.default_encrypt_list) {
                                        pthread_cond_broadcast(
                                                &fil_crypt_threads_cond);
                                }
                
                                goto wait_for_work;
                        }

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
goto wait_for_work;
}

/* we found a space to rotate */
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When a tablespace is successfully found for rotation, the sleep timeout and timed wait count should be reset. This ensures that the exponential backoff starts fresh the next time the thread becomes idle, rather than potentially jumping straight to an indefinite wait due to stale state from previous idle periods.

			/* we found a space to rotate */
			thr.reset_sleep_timeout();

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment on lines +2213 to +2220
do {
if (thr.should_shutdown()) {
thr.space->release();
thr.space = fil_system.space_list.end();
goto func_exit;
}
} while (!fil_crypt_alloc_iops(&thr));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop for IOPS allocation is a busy-wait that consumes 100% CPU if fil_crypt_alloc_iops returns false (e.g., when the global encryption IOPS limit is reached). While this pattern existed in the previous version, it remains a performance bottleneck. Consider adding a short sleep or yielding the thread if IOPS are unavailable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants