The current bandit reward mixes heterogeneous signals (token savings, latency, and quality proxy) that live on different scales and sometimes conflict with each other. This makes the learning signal noisy, unstable, and hard to interpret.
At the moment, quality is treated as a soft component of the reward rather than a hard constraint, which allows the optimizer to trade quality away unintentionally.