The system environment is non-stationary due to cache warm-up, traffic drift, evolving heuristics, and routing changes. However, the bandit currently assumes stable reward distributions.
This mismatch can cause policy decay or overfitting to early conditions.