LLM Routing

When agents send LLM requests to AIOS kernel, the agent can choose one LLM backend or choose multiple LLM backends. 1) If only one LLM backend is specified, then the request will be sent to the specified LLM; 2) If multiple LLM backends are specified, it means that the agent allows this request to be processed by any of the specified LLMs. In this case, AIOS provides two LLM routing strategies — Sequential Routing and Smart Routing — to decide which LLM among the specified LLMs will be chosen to process the request.

Overview of Routing Strategies

AIOS provides two routing strategies to distribute requests across multiple LLM backends:

Strategy Type

Description

Sequential Routing

Sequentially cycles through the available models and selects an available one that is within the specific models

Smart Routing

A cost-quality optimized strategy which smartly chooses the lowest cost LLM while maintaining the quality of request processing

Sequential Routing

The SequentialRouting implements a basic model selection approach for load-balancing LLM requests. It sequentially cycles through the available models and selects an available models one that is within the specified models.

Core Functions

def get_model_idxs(self, selected_llms: List[str], n_queries: int=1):
    model_idxs = []
    
    for _ in range(n_queries):
        current = selected_llms[self.idx]
        for i, llm_config in enumerate(self.llm_configs):
            if llm_config["name"] == current["name"]:
                model_idxs.append(i)
                break
        self.idx = (self.idx + 1) % len(selected_llms)
    
    return model_idxs

Smart Routing

The SmartRouting implements a sophisticated cost-quality optimized selection strategy for LLM requests, using historical performance data to predict which models will perform best for a given query while minimizing cost. It leverages a two-stage constrained optimization method. The figure below shows the overall pipeline of this smart routing strategy.

Optimization Methods

Uses Lagrangian dual optimization to globally optimize model selection
Balances overall performance against total cost
Ensures each query is assigned to exactly one model

def get_model_idxs(self, selected_llms, queries, input_token_lengths):
    model_idxs = []
    
    for i in range(len(queries)):
        query = queries[i]
        input_token_length = input_token_lengths[i]
        
        # Get performance and length predictions
        perf_scores, length_scores = self.store.predict(
            query, selected_llms, n_similar=self.n_similar
        )
        
        # Calculate cost scores
        cost_scores = []
        for j in range(len(selected_llms)):
            pred_output_length = length_scores[0][j]
            input_cost = input_token_length * selected_llms[j].get("cost_per_input_token", 0)
            output_cost = pred_output_length * selected_llms[j].get("cost_per_output_token", 0)
            weighted_score = input_cost + output_cost
            cost_scores.append(weighted_score)
        
        # Select optimal model
        selected_idx = self.optimize_model_selection(
            selected_llms,
            perf_scores[0],
            np.array(cost_scores)
        )
        
        # Find index in original llm_configs
        for idx, config in enumerate(self.llm_configs):
            if config["name"] == selected_llms[selected_idx]["name"]:
                model_idxs.append(idx)
                break
    
    return model_idxs

Usage Example

configs = [
    {"name": "gpt-4o-mini", "backend": "openai", "cost_per_input_token": 0.00005, "cost_per_output_token": 0.00015},
    {"name": "llama-3-8b", "backend": "ollama", "cost_per_input_token": 0.00001, "cost_per_output_token": 0.00002}
]

strategy = SmartRouting(
    llm_configs=configs,
    performance_requirement=0.7,
    n_similar=16
)

queries = ["What is the capital of France?", "Explain quantum computing"]
token_lengths = [8, 12]

model_idxs = strategy.get_model_idxs(configs, queries, token_lengths)

For implementation details and experimental results, see our official repository and research paper.

Reference

@article{mei2025eccos,
  title={ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving},
  author={Mei, Kai and Xu, Wujiang and Lin, Shuhang and Zhang, Yongfeng},
  journal={arXiv:2502.20576},
  year={2025}
}

PreviousHugging Face Backend NextScheduler

Last updated 13 days ago

def get_model_idxs(self, selected_llms: List[str], n_queries: int=1): model_idxs = [] for _ in range(n_queries): current = selected_llms[self.idx] for i, llm_config in enumerate(self.llm_configs): if llm_config["name"] == current["name"]: model_idxs.append(i) break self.idx = (self.idx + 1) % len(selected_llms) return model_idxs

def get_model_idxs(self, selected_llms, queries, input_token_lengths): model_idxs = [] for i in range(len(queries)): query = queries[i] input_token_length = input_token_lengths[i] # Get performance and length predictions perf_scores, length_scores = self.store.predict( query, selected_llms, n_similar=self.n_similar ) # Calculate cost scores cost_scores = [] for j in range(len(selected_llms)): pred_output_length = length_scores[0][j] input_cost = input_token_length * selected_llms[j].get("cost_per_input_token", 0) output_cost = pred_output_length * selected_llms[j].get("cost_per_output_token", 0) weighted_score = input_cost + output_cost cost_scores.append(weighted_score) # Select optimal model selected_idx = self.optimize_model_selection( selected_llms, perf_scores[0], np.array(cost_scores) ) # Find index in original llm_configs for idx, config in enumerate(self.llm_configs): if config["name"] == selected_llms[selected_idx]["name"]: model_idxs.append(idx) break return model_idxs

configs = [ {"name": "gpt-4o-mini", "backend": "openai", "cost_per_input_token": 0.00005, "cost_per_output_token": 0.00015}, {"name": "llama-3-8b", "backend": "ollama", "cost_per_input_token": 0.00001, "cost_per_output_token": 0.00002} ] strategy = SmartRouting( llm_configs=configs, performance_requirement=0.7, n_similar=16 ) queries = ["What is the capital of France?", "Explain quantum computing"] token_lengths = [8, 12] model_idxs = strategy.get_model_idxs(configs, queries, token_lengths)

@article{mei2025eccos, title={ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving}, author={Mei, Kai and Xu, Wujiang and Lin, Shuhang and Zhang, Yongfeng}, journal={arXiv:2502.20576}, year={2025} }