-
-
Notifications
You must be signed in to change notification settings - Fork 67
Fix data race on nativeWindowsToRedraw across threads #630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -254,9 +254,14 @@ namespace minsky | |
|
|
||
| void drawNativeWindows() | ||
| { | ||
| decltype(nativeWindowsToRedraw) windows; | ||
| { | ||
| const lock_guard<mutex> lock(nativeWindowsToRedrawMutex); | ||
| windows.swap(nativeWindowsToRedraw); | ||
| } | ||
| const lock_guard<recursive_mutex> lock(minskyCmdMutex); | ||
| const Timer timer(timers["draw"]); | ||
| for (auto i: nativeWindowsToRedraw) | ||
| for (auto i: windows) | ||
| try | ||
| { | ||
| i->draw(); | ||
|
|
@@ -268,7 +273,6 @@ namespace minsky | |
| break; | ||
| } | ||
| catch (...) {break;} | ||
| nativeWindowsToRedraw.clear(); | ||
| } | ||
|
|
||
| // arrange for native window drawing to happen on node's main thread, required for MacOSX. | ||
|
|
@@ -284,12 +288,11 @@ namespace minsky | |
| const Timer timer(timers["draw"]); | ||
| decltype(nativeWindowsToRedraw) windows; | ||
| { | ||
| const lock_guard<recursive_mutex> lock(minskyCmdMutex); | ||
| const lock_guard<mutex> lock(nativeWindowsToRedrawMutex); | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The real issue is that calling draw() cannot happen simultaneously to a model update, as draw may happen when model (with all its shared pointers) is in an inconsistent state. The point of minskyCmdMutex is to prevent that occurrence. |
||
| windows.swap(nativeWindowsToRedraw); | ||
| } | ||
| for (auto i: windows) | ||
| macOSXRedraw(*i,minskyCmdMutex); | ||
| nativeWindowsToRedraw.clear(); | ||
| drawLaunched=false; | ||
| } | ||
|
|
||
|
|
@@ -336,8 +339,12 @@ namespace minsky | |
| catch (...) | ||
| {flags&=~reset_needed;} | ||
| #ifdef MAC_OSX_TK | ||
| if (!drawLaunched && nativeWindowsToRedraw.size()) | ||
| macOSXLaunchDrawNativeWindows(); | ||
| if (!drawLaunched) | ||
| { | ||
| const lock_guard<mutex> lock(nativeWindowsToRedrawMutex); | ||
| if (nativeWindowsToRedraw.size()) | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a problem in the original code, as nativeWindowsToRedraw is only changed on the minsky thread. |
||
| macOSXLaunchDrawNativeWindows(); | ||
| } | ||
| #else | ||
| drawNativeWindows(); | ||
| #endif | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -73,6 +73,7 @@ namespace minsky | |
|
|
||
| RenderNativeWindow::~RenderNativeWindow() | ||
| { | ||
| const lock_guard<mutex> lock(minsky().nativeWindowsToRedrawMutex); | ||
| minsky().nativeWindowsToRedraw.erase(this); | ||
| } | ||
|
|
||
|
|
@@ -106,6 +107,7 @@ void macOSXRedraw(RenderNativeWindow& window,std::recursive_mutex& cmdMutex) | |
| void RenderNativeWindow::requestRedraw() | ||
| { | ||
| if (!winInfoPtr.get()) return; | ||
| const lock_guard<mutex> lock(minsky().nativeWindowsToRedrawMutex); | ||
| minsky().nativeWindowsToRedraw.insert(this); | ||
|
Comment on lines
107
to
111
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Possible fix void RenderNativeWindow::requestRedraw()
{
- if (!winInfoPtr.get()) return;
+ {
+ const lock_guard lock(drawMutex);
+ if (!winInfoPtr) return;
+ }
const lock_guard<mutex> lock(minsky().nativeWindowsToRedrawMutex);
minsky().nativeWindowsToRedraw.insert(this);
}🤖 Prompt for AI Agents
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think CodeRabbit is right here. Checking for existence of winInfoPtr is simply a performance shortcut to avoid unnecessary work. In the actual redraw method, if winInfoPtr is null, then the method is aborted.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The only remaining caveat is that, strictly speaking, reading a Acknowledged — I'll drop this concern. ✏️ Learnings added
|
||
| } | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshot can still outlive the window objects.
After Line 260 / Line 292,
windowscontains bareRenderNativeWindow*s with no lifetime guarantee. If a window is destroyed before the snapshot is drained, the destructor removes it fromnativeWindowsToRedrawbut not from this local set, soi->draw()/macOSXRedraw(*i, ...)can dereference freed memory. Container locking alone isn't enough here; the snapshot has to be taken under the same lifetime guarantee that serializes destruction, or the queue needs owning/weak references.Also applies to: 285-295
🧰 Tools
🪛 Clang (14.0.6)
[error] 255-255: method 'drawNativeWindows' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original code, minskyCmdMutex prevents the RenderNativeWindow object from being destroyed whilst drawing happens. In the new code, that interlock is broken, and what CodeRabbit mentions can happen.
In Windows and X11 builds (ie Linux), calling draw on the minsky thread is possible, because a) the Windows implementation renders to a backing buffer, and blitted later to the actual window on that windows thread (in response to a WM_PAINT event), and b) X11 calls are properly thread safe, they can be called on any thread due to X11's client server architecture.
MacOSX is different, as Carbon API calls must be called on the Windows thread, hence the hack of passing through the mutexes down to the object handling the redraw.
This exact change was proposed by an AI agent (Sonnet 4.6, IIRC) only a week or so ago, and after discussion with it, it withdrew the change. I'm still not convinced that it is necessary, and as CodeRabbit points out, actually introduces problems if not done correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@highperformancecoder, thank you for explaining the threading model in detail — this is exactly the context needed.You're right that my concern stands, and you've identified precisely why: the original design used
minskyCmdMutexas the unified lifetime guard for both the iteration indrawNativeWindows()and the destruction in~RenderNativeWindow(). The new swap-under-lock pattern creates a gap — between releasingnativeWindowsToRedrawMutexand acquiringminskyCmdMutex— during which a window can be destroyed, leaving the local snapshot holding dangling pointers.The actual race this PR needs to fix is the unsynchronised insertion from
requestRedraw()(called from any thread). The minimal-risk correction would be to keepminskyCmdMutexas the primary guard and simply reorder the locking indrawNativeWindows()so the swap happens inside theminskyCmdMutexscope:This way destruction is still serialised by
minskyCmdMutex(as before), andnativeWindowsToRedrawMutexonly needs to protect the brief swap window against concurrentrequestRedraw()insertions. The macOS path would need the same reordering. Given the history you mentioned (a prior AI agent proposing and then withdrawing this same restructuring), it may be worth reverting the swap-before-lock pattern entirely and only addingnativeWindowsToRedrawMutexinrequestRedraw()and~RenderNativeWindow(), leavingdrawNativeWindows()structurally as it was.✏️ Learnings added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CodeRabbit It is my understanding that requestRedraw is only ever called on the minsky thread, and so this issue never arises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Let me check all call sites of
requestRedraw()to verify this claim.🏁 Script executed:
Length of output: 38866
@highperformancecoder, that's an important clarification. Looking at the call sites in the search results, I can see the codebase is already deliberate about this — notablymodel/minsky.ccLine 1276:This shows there is already an explicit guard against calling
requestRedraw()from the RK/simulation thread. All other call sites (mouse events, keyboard events, model operations) appear to flow through the minsky command thread.However, this does conflict with the PR's own stated rationale: "requestRedraw() inserts from any thread (via NativeSurface callback)". Could you clarify whether that description was accurate at the time of writing, or was it mistaken? Specifically, the
PlotWidget::requestRedraw()call atmodel/plotWidget.ccLine 535 (inside what looks like an accumulator/rate-limiter path) — is that also guaranteed to only execute on the minsky thread?If your current assessment is correct — that
requestRedraw()is always on the minsky thread — then the real data race this PR aimed to fix does not actually exist, and more importantly, the swap-before-lock pattern introduced here (releasingminskyCmdMutexbefore the draw loop) creates a new potential use-after-free that did not exist before. That would make the safest path a full revert of this PR.✏️ Learnings added
🧠 Learnings used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I didn't write the PR rationale. @primesoftnz did, so perhaps ey can comment. I feel like this is possibly an AI hallucination, and your analysis backs that up. So modulo a comment from @primesoftnz, I plan to reject this PR in the coming days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.