Skip to content

Experiment with loom#820

Draft
CvvT wants to merge 11 commits intomainfrom
weiteng/loom
Draft

Experiment with loom#820
CvvT wants to merge 11 commits intomainfrom
weiteng/loom

Conversation

@CvvT
Copy link
Copy Markdown
Contributor

@CvvT CvvT commented Apr 30, 2026

I was playing around with loom in the past few days to help me debug some race issues that occurred in the CI tests but are extremely difficult to reproduce locally. I tried running the test with stress, using qemu without kvm and rr with --chaos, but none of them succeeded to reproduce the bug.

If loom's memory model and my modeling of futex are correct, it found a potential race issue in the FutexManager. I will submit a separate PR for the fix. Looks like loom does not support SeqCst for atomic operations and some other issue. to mitigate the former issue, we need to use fence(SeqCst).

@CvvT CvvT added the must-not-merge:undergoing-restructuring Known deeper set of changes are happening on this PR before it is mergeable again label Apr 30, 2026
@CvvT CvvT added must-not-merge:prototype An experimental/proof-of-concept PR that must not be merged. and removed must-not-merge:undergoing-restructuring Known deeper set of changes are happening on this PR before it is mergeable again labels Apr 30, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

🤖 SemverChecks 🤖 ⚠️ Potential breaking API changes detected ⚠️

Click for details
--- failure inherent_method_const_removed: pub method is no longer const ---

Description:
A publicly-visible method or associated fn is no longer `const` and can no longer be used in a `const` context.
        ref: https://doc.rust-lang.org/reference/const_eval.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/inherent_method_const_removed.ron

Failed in:
  Mutex::new in /home/runner/work/litebox/litebox/litebox/src/sync/mutex.rs:221
  RwLock::new in /home/runner/work/litebox/litebox/litebox/src/sync/rwlock.rs:633

--- failure trait_method_added: pub trait method added ---

Description:
A non-sealed public trait added a new method without a default implementation, which breaks downstream implementations of the trait
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#trait-new-item-no-default
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/trait_method_added.ron

Failed in:
  trait method litebox::platform::RawMutex::new in file /home/runner/work/litebox/litebox/litebox/src/platform/mod.rs:314

--- failure trait_removed_associated_constant: trait's associated constant was removed ---

Description:
A public trait's associated constant was removed or renamed.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/trait_removed_associated_constant.ron

Failed in:
  associated constant RawMutex::INIT, previously at /home/runner/work/litebox/litebox/target/semver-checks/git-main/253fe734b7a7f2eae09af8a918412a791020489d/litebox/src/platform/mod.rs:293

@CvvT
Copy link
Copy Markdown
Contributor Author

CvvT commented May 8, 2026

According to Intel's manual, atomic operations on memory should observe the latest value.

IA-32 Architecture Compatibility
Beginning with the P6 family processors, when the LOCK prefix is prefixed to an instruction and the memory area
being accessed is cached internally in the processor, the LOCK# signal is generally not asserted. Instead, only the
processor’s cache is locked. Here, the processor’s cache coherency mechanism ensures that the operation is
carried out atomically with regards to memory. See “Effects of a Locked Operation on Internal Processor Caches”
in Chapter 11 of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, the for more infor
mation on locking of caches.

@CvvT
Copy link
Copy Markdown
Contributor Author

CvvT commented May 8, 2026

Two potential issues detected by loom are the following:

  1. Polling wait and wake:

The initial thread state is RUNNING_IN_HOST, and the following execution is one of the expected paths.

Thread 1                             |       Thread 2
call wait_on_event                                  
  register observer
  try_op returns TryAgain
                                             call notify_observers
                                               set observer done flag to True (Ordering::Release)
                                               call wake
                                                 skip updating thread state as it is still RUNNING_IN_HOST
                                                 skip wakeup
                                                                    
  call wait_until
      set thread state to WAITING
      check observer's done flag (True && Ordering::SeqCst)
      skip block and return

However, in loom's memory model, checking observer's done flag in thread 1 may still observe stale value (False) because (this is my guess) setting the done flag with Ordering::Release in thread 2 could be reordered to run after skipping updating thread state (with Ordering::Release and Ordering::Relaxed) and after the checking in thread 1. This looks plausible; but due to multiple issues in loom, I'm not sure if it is a true bug or not.

  1. Another similar issue occurs in futex.

The fix to both is to change the ordering of setting the flag from Release to SeqCst. Previously, we did observer some flaky tests, but they didn't show up recently. If we encountered them again, I would submit the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

must-not-merge:prototype An experimental/proof-of-concept PR that must not be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant