Skip to content

UndefRefError: update_status races on running_tests #131

@AntonOresten

Description

@AntonOresten

The printer task on test jobs can cause crashes (observed on a CUDA.jl buildkite job):

nested task error: UndefRefError: access to undefined reference
 [1] getindex
   @ ./essentials.jl:920 [inlined]
 [2] _sort!(v::Vector{String}, ::Base.Sort.InsertionSortAlg, o::Base.Order.By{...}, ...)
   @ Base.Sort ./sort.jl:842
 ...
[11] update_status
   @ ParallelTestRunner.jl:1001
[12] (printer task closure)
   @ ParallelTestRunner.jl:1099

Root cause

running_tests is a Dict{String,Float64} that worker tasks mutate under test_lock (writes at lines 1141-1144 and 1212-1214). Since #119 flipped worker tasks from @async to Threads.@spawn, those writes happen on parallel OS threads while the printer is on @async — so true preemption is now possible.

update_status reads running_tests four times without taking test_lock:

  • 993: isempty(running_tests)
  • 1001: sort(collect(keys(running_tests)), by = x -> running_tests[x]) — crashes here
  • 1023: for (test, start_time) in running_tests
  • 1030: haskey(running_tests, test)

At line 1001, collect(keys(d)) preallocates Vector{String}(undef, length(d)) and then copyto!s by iterating the KeySet. If a worker thread does delete!(running_tests, test) mid-iteration, copyto! short-circuits (iterator runs out early), leaving trailing #undef slots in the result vector. _sort!'s first getindex on such a slot raises UndefRefError.

This is the multithreading scenario @maleadt flagged in the #122 review — at that time workers were @async so reads didn't strictly need locking. #119 changed the model but didn't add the corresponding read-locks for running_tests (it did add one new unsynchronized read of results at line 1032 too).

Reproducer

using Base.Threads  # julia -t 4
errors = Dict{Type,Int}()
for _ in 1:50
    d = Dict{String,Float64}()
    for i in 1:500; d["k$i"] = float(i); end
    mut = @spawn for i in 1:500; yield(); haskey(d,"k$i") && delete!(d,"k$i"); end
    for _ in 1:5000
        try
            sort!(collect(keys(d)), by = x -> get(d, x, 0.0))
        catch e
            errors[typeof(e)] = get(errors, typeof(e), 0) + 1
            break
        end
    end
    fetch(mut)
end
errors  # => Dict(UndefRefError => 50)

Suggested fix

Snapshot once at the top of update_status under test_lock so all four read sites see a consistent view:

function update_status()
    snapshot = Base.@lock test_lock copy(running_tests)
    isempty(snapshot) && return
    ...
    test_list = sort(collect(keys(snapshot)), by = x -> snapshot[x])
    ...
    for (test, start_time) in snapshot ...
    ...
    haskey(snapshot, test) && continue
end

While here, the any(r -> test == r.test, results) at line 1032 is similarly unsynchronized against worker pushes — a Base.@lock results_lock snapshot in the same spot would close that one too.

This issue was created and diagnosed with the help of Claude.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions