-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Currently the project uses the simplest model possible for routing incoming requests, with the Gateway parsing them and pushing them into a queue while the Scheduler asynchronously picks up tasks from that same queue and forwards them to the appropriate model server. Even though the scheduler can process more than one tasks from the queue at a time using multiple threads, this design can be suboptimal when the queue is dominated by requests targeting a certain model server, blocking others that could be routed to a different server (that could be idle) in parallel.
One potential way to tackle the above would be to use multiple queues for routing. Given that the number of model servers is dynamic, using a different queue for each one can be tricky as there's a limit to how many tasks/threads should be scheduled in parallel, while it's also harder to reason about avoiding starvation in a scenario with a dynamically changing set of queues.
Focusing on a constant number of queues, we could implement a consistent hashing approach for allocating requests to a given queue. Even though that doesn't guarantee each model server will have its dedicated queue, it decreases the chance of unrelated requests blocking each other.