From 0160e33666b8ae8fdfb245bb0ac3b8d5d0a49817 Mon Sep 17 00:00:00 2001 From: Rafal Slota Date: Wed, 30 Jul 2025 12:41:33 +0200 Subject: [PATCH] Fix graceful shutdown race condition Currently, when `Topology` is shutting down, it calls `Terminator.trap_exit/1` to allow it to handle `terminate/2` callback and shutdown producers gracefully. However, `Terminator.trap_exit/1` is using `GenServer.cast/2` which makes it fully asynchronous, making it possible for `Topology`'s terminate callback to return before `Terminator` starts trapping exits, allowing `Terminator` to be shut down without putting producers into "draining" state. In my system when we're running ~50-100 Broadway instances we're seeing ~50% of them shutting down properly and the rest of them get stuck with producers going full blast unaware of the shutdown. This change simply changes `Terminator.trap_exit/1` to use `GenServer.call/2` instead of `GenServer.cast/2` to make it fully synchronous, which fixed the issue. Since I don't know Broadway internals at all, please let me know if there is a better way to fix this. --- lib/broadway/topology/terminator.ex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/broadway/topology/terminator.ex b/lib/broadway/topology/terminator.ex index 92f3b6ad..7a41b311 100644 --- a/lib/broadway/topology/terminator.ex +++ b/lib/broadway/topology/terminator.ex @@ -9,7 +9,7 @@ defmodule Broadway.Topology.Terminator do @spec trap_exit(GenServer.server()) :: :ok def trap_exit(terminator) do - GenServer.cast(terminator, :trap_exit) + GenServer.call(terminator, :trap_exit) end @impl true @@ -24,9 +24,9 @@ defmodule Broadway.Topology.Terminator do end @impl true - def handle_cast(:trap_exit, state) do + def handle_call(:trap_exit, _from, state) do Process.flag(:trap_exit, true) - {:noreply, state} + {:reply, :ok, state} end @impl true