The printer task on test jobs can cause crashes (observed on a CUDA.jl buildkite job):
nested task error: UndefRefError: access to undefined reference
[1] getindex
@ ./essentials.jl:920 [inlined]
[2] _sort!(v::Vector{String}, ::Base.Sort.InsertionSortAlg, o::Base.Order.By{...}, ...)
@ Base.Sort ./sort.jl:842
...
[11] update_status
@ ParallelTestRunner.jl:1001
[12] (printer task closure)
@ ParallelTestRunner.jl:1099
Root cause
running_tests is a Dict{String,Float64} that worker tasks mutate under test_lock (writes at lines 1141-1144 and 1212-1214). Since #119 flipped worker tasks from @async to Threads.@spawn, those writes happen on parallel OS threads while the printer is on @async — so true preemption is now possible.
update_status reads running_tests four times without taking test_lock:
- 993:
isempty(running_tests)
- 1001:
sort(collect(keys(running_tests)), by = x -> running_tests[x]) — crashes here
- 1023:
for (test, start_time) in running_tests
- 1030:
haskey(running_tests, test)
At line 1001, collect(keys(d)) preallocates Vector{String}(undef, length(d)) and then copyto!s by iterating the KeySet. If a worker thread does delete!(running_tests, test) mid-iteration, copyto! short-circuits (iterator runs out early), leaving trailing #undef slots in the result vector. _sort!'s first getindex on such a slot raises UndefRefError.
This is the multithreading scenario @maleadt flagged in the #122 review — at that time workers were @async so reads didn't strictly need locking. #119 changed the model but didn't add the corresponding read-locks for running_tests (it did add one new unsynchronized read of results at line 1032 too).
Reproducer
using Base.Threads # julia -t 4
errors = Dict{Type,Int}()
for _ in 1:50
d = Dict{String,Float64}()
for i in 1:500; d["k$i"] = float(i); end
mut = @spawn for i in 1:500; yield(); haskey(d,"k$i") && delete!(d,"k$i"); end
for _ in 1:5000
try
sort!(collect(keys(d)), by = x -> get(d, x, 0.0))
catch e
errors[typeof(e)] = get(errors, typeof(e), 0) + 1
break
end
end
fetch(mut)
end
errors # => Dict(UndefRefError => 50)
Suggested fix
Snapshot once at the top of update_status under test_lock so all four read sites see a consistent view:
function update_status()
snapshot = Base.@lock test_lock copy(running_tests)
isempty(snapshot) && return
...
test_list = sort(collect(keys(snapshot)), by = x -> snapshot[x])
...
for (test, start_time) in snapshot ...
...
haskey(snapshot, test) && continue
end
While here, the any(r -> test == r.test, results) at line 1032 is similarly unsynchronized against worker pushes — a Base.@lock results_lock snapshot in the same spot would close that one too.
This issue was created and diagnosed with the help of Claude.
The printer task on test jobs can cause crashes (observed on a CUDA.jl buildkite job):
Root cause
running_testsis aDict{String,Float64}that worker tasks mutate undertest_lock(writes at lines 1141-1144 and 1212-1214). Since #119 flipped worker tasks from@asynctoThreads.@spawn, those writes happen on parallel OS threads while the printer is on@async— so true preemption is now possible.update_statusreadsrunning_testsfour times without takingtest_lock:isempty(running_tests)sort(collect(keys(running_tests)), by = x -> running_tests[x])— crashes herefor (test, start_time) in running_testshaskey(running_tests, test)At line 1001,
collect(keys(d))preallocatesVector{String}(undef, length(d))and thencopyto!s by iterating theKeySet. If a worker thread doesdelete!(running_tests, test)mid-iteration,copyto!short-circuits (iterator runs out early), leaving trailing#undefslots in the result vector._sort!'s firstgetindexon such a slot raisesUndefRefError.This is the multithreading scenario @maleadt flagged in the #122 review — at that time workers were
@asyncso reads didn't strictly need locking. #119 changed the model but didn't add the corresponding read-locks forrunning_tests(it did add one new unsynchronized read ofresultsat line 1032 too).Reproducer
Suggested fix
Snapshot once at the top of
update_statusundertest_lockso all four read sites see a consistent view:While here, the
any(r -> test == r.test, results)at line 1032 is similarly unsynchronized against worker pushes — aBase.@lock results_locksnapshot in the same spot would close that one too.This issue was created and diagnosed with the help of Claude.