-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
While investigating some strange slowdowns I've disovered a behaviour of the GC I do not understand.
Here is a MWE. It's a bit complicated. I have 500000 short vectors which I take a large sample from (50 million, with replacement). Then there's an alternative where the individual vectors are replaced by a struct containing two references to the vector.
When I do an operation on the vector of vectors, it's all like I would expect. The GC is exercised, but nothing too extreme. When it's a vector of structs, and in addition the actual work is done in a single @spawn which is immediately fetched, I sometimes get up to 20 seconds of GC activity. This is on master. On 1.10.2 it's up to 50 seconds. With more vectors, the GC may run for several minutes.
I see no intuitive reason for this behaviour which only appears when a fetch(@spawn work(v)) is done, not with work(v). I have a 64GB 12 core AMD, with 24 virtual cores. If I run with -t 12 this strange behaviour does not seem happen. It does however happen with -t 24 --gcthreads=4, with GC time up to 40 seconds.
using Base.Threads: @spawn
using InteractiveUtils: versioninfo
using StatsBase: sample
struct Astruct
v1::Vector{Float64}
v2::Vector{Float64}
Astruct(v) = new(v,v)
end
Base.getindex(as::Astruct, i) = as.v1[i]
function work(vvec)
x = 0.0
for v in vvec
x += v[1]
end
return x
end
fun(v, par) = par ? fetch(@spawn work(v))::Float64 : work(v)
test(v, N) = @time "serial" fun(sample(v, N), false)
partest(v, N) = @time "parallel" fun(sample(v, N), true)
versioninfo()
const N = 500_000
const v = [rand(10) for _ in 1:N]
const as = Astruct.(v)
println("***** Vector *****")
GC.gc()
for _ in 1:10
test(v, 100N)
end
GC.gc()
for _ in 1:10
partest(v, 100N)
end
println("\n***** Astruct *****")
GC.gc()
for _ in 1:10
test(as, 100N)
end
GC.gc()
for _ in 1:10
partest(as, 100N)
end
Output:
Julia Version 1.12.0-DEV.383
Commit 903ffff167* (2024-04-22 04:40 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen Threadripper PRO 5945WX 12-Cores
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)
Environment:
JULIA_NUM_THREADS = auto
***** Vector *****
serial: 0.790318 seconds (3 allocations: 381.470 MiB, 0.55% gc time)
serial: 0.820720 seconds (3 allocations: 381.470 MiB, 4.01% gc time)
serial: 0.823082 seconds (3 allocations: 381.470 MiB, 3.90% gc time)
serial: 0.812152 seconds (3 allocations: 381.470 MiB, 3.73% gc time)
serial: 0.811470 seconds (3 allocations: 381.470 MiB, 3.76% gc time)
serial: 0.803246 seconds (3 allocations: 381.470 MiB, 3.78% gc time)
serial: 0.815217 seconds (3 allocations: 381.470 MiB, 3.75% gc time)
serial: 0.815175 seconds (3 allocations: 381.470 MiB, 3.79% gc time)
serial: 0.807833 seconds (3 allocations: 381.470 MiB, 3.76% gc time)
serial: 0.821069 seconds (3 allocations: 381.470 MiB, 3.67% gc time)
parallel: 0.847207 seconds (3.33 k allocations: 381.640 MiB, 0.28% gc time, 0.93% compilation time)
parallel: 0.850981 seconds (9 allocations: 381.470 MiB, 1.21% gc time)
parallel: 0.886037 seconds (9 allocations: 381.470 MiB, 12.49% gc time)
parallel: 0.776483 seconds (9 allocations: 381.470 MiB)
parallel: 0.962529 seconds (9 allocations: 381.470 MiB, 14.78% gc time)
parallel: 0.985055 seconds (9 allocations: 381.470 MiB, 19.63% gc time)
parallel: 0.831015 seconds (9 allocations: 381.470 MiB, 4.25% gc time)
parallel: 0.840257 seconds (9 allocations: 381.470 MiB, 4.67% gc time)
parallel: 0.838799 seconds (9 allocations: 381.470 MiB, 4.48% gc time)
parallel: 0.859712 seconds (9 allocations: 381.470 MiB, 4.76% gc time)
***** Astruct *****
serial: 0.947084 seconds (3 allocations: 762.940 MiB, 0.39% gc time)
serial: 0.990493 seconds (3 allocations: 762.940 MiB, 5.19% gc time)
serial: 0.978335 seconds (3 allocations: 762.940 MiB, 4.95% gc time)
serial: 0.976766 seconds (3 allocations: 762.940 MiB, 4.94% gc time)
serial: 0.975440 seconds (3 allocations: 762.940 MiB, 5.05% gc time)
serial: 0.977785 seconds (3 allocations: 762.940 MiB, 4.94% gc time)
serial: 0.980397 seconds (3 allocations: 762.940 MiB, 4.95% gc time)
serial: 0.976771 seconds (3 allocations: 762.940 MiB, 4.94% gc time)
serial: 0.982649 seconds (3 allocations: 762.940 MiB, 4.94% gc time)
serial: 0.981036 seconds (3 allocations: 762.940 MiB, 4.92% gc time)
parallel: 0.952661 seconds (3.42 k allocations: 763.113 MiB, 0.23% gc time, 0.77% compilation time)
parallel: 8.064466 seconds (9 allocations: 762.940 MiB, 88.19% gc time)
parallel: 0.940224 seconds (9 allocations: 762.940 MiB)
parallel: 14.162925 seconds (9 allocations: 762.940 MiB, 93.19% gc time)
parallel: 0.930567 seconds (9 allocations: 762.940 MiB)
parallel: 0.926054 seconds (9 allocations: 762.940 MiB)
parallel: 14.008024 seconds (9 allocations: 762.940 MiB, 93.24% gc time)
parallel: 0.942324 seconds (9 allocations: 762.940 MiB)
parallel: 0.938320 seconds (9 allocations: 762.940 MiB)
parallel: 20.819898 seconds (9 allocations: 762.940 MiB, 95.42% gc time)