# LLM Routing

When agents send LLM requests to AIOS kernel, the agent can choose one LLM backend or choose multiple LLM backends. 1) If only one LLM backend is specified, then the request will be sent to the specified LLM;  2) If multiple LLM backends are specified, it means that the agent allows this request to be processed by any of the specified LLMs. In this case, AIOS provides two LLM routing strategies — Sequential Routing and Smart Routing — to decide which LLM among the specified LLMs will be chosen to process the request.&#x20;

### Overview of Routing Strategies

AIOS provides two routing strategies to distribute requests across multiple LLM backends:

| Strategy Type          | Description                                                                                                                     |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| **Sequential Routing** | Sequentially cycles through the available models and selects an available one that is within the specific models                |
| **Smart Routing**      | A cost-quality optimized strategy which smartly chooses the lowest cost LLM while maintaining the quality of request processing |

### Sequential Routing

The `SequentialRouting` implements a basic model selection approach for load-balancing LLM requests. It sequentially cycles through the available models and selects an available models one that is within the specified models.

#### Core Functions

```python
def get_model_idxs(self, selected_llms: List[str], n_queries: int=1):
    model_idxs = []
    
    for _ in range(n_queries):
        current = selected_llms[self.idx]
        for i, llm_config in enumerate(self.llm_configs):
            if llm_config["name"] == current["name"]:
                model_idxs.append(i)
                break
        self.idx = (self.idx + 1) % len(selected_llms)
    
    return model_idxs
```

### Smart Routing

The `SmartRouting` implements a sophisticated cost-quality optimized selection strategy for LLM requests, using historical performance data to predict which models will perform best for a given query while minimizing cost. It leverages a two-stage constrained optimization method. The figure below shows the overall pipeline of this smart routing strategy.&#x20;

<figure><img src="/files/T9tA7psKmNY8gKzhT2YZ" alt="" width="375"><figcaption></figcaption></figure>

**Optimization Methods**

* Uses Lagrangian dual optimization to globally optimize model selection
* Balances overall performance against total cost
* Ensures each query is assigned to exactly one model

```python
def get_model_idxs(self, selected_llms, queries, input_token_lengths):
    model_idxs = []
    
    for i in range(len(queries)):
        query = queries[i]
        input_token_length = input_token_lengths[i]
        
        # Get performance and length predictions
        perf_scores, length_scores = self.store.predict(
            query, selected_llms, n_similar=self.n_similar
        )
        
        # Calculate cost scores
        cost_scores = []
        for j in range(len(selected_llms)):
            pred_output_length = length_scores[0][j]
            input_cost = input_token_length * selected_llms[j].get("cost_per_input_token", 0)
            output_cost = pred_output_length * selected_llms[j].get("cost_per_output_token", 0)
            weighted_score = input_cost + output_cost
            cost_scores.append(weighted_score)
        
        # Select optimal model
        selected_idx = self.optimize_model_selection(
            selected_llms,
            perf_scores[0],
            np.array(cost_scores)
        )
        
        # Find index in original llm_configs
        for idx, config in enumerate(self.llm_configs):
            if config["name"] == selected_llms[selected_idx]["name"]:
                model_idxs.append(idx)
                break
    
    return model_idxs
```

#### Usage Example

```python
configs = [
    {"name": "gpt-4o-mini", "backend": "openai", "cost_per_input_token": 0.00005, "cost_per_output_token": 0.00015},
    {"name": "llama-3-8b", "backend": "ollama", "cost_per_input_token": 0.00001, "cost_per_output_token": 0.00002}
]

strategy = SmartRouting(
    llm_configs=configs,
    performance_requirement=0.7,
    n_similar=16
)

queries = ["What is the capital of France?", "Explain quantum computing"]
token_lengths = [8, 12]

model_idxs = strategy.get_model_idxs(configs, queries, token_lengths)
```

For implementation details and experimental results, see our [official repository](https://github.com/agiresearch/ECCOS) and [research paper](https://arxiv.org/abs/2502.20576).&#x20;

### Reference

```
@article{mei2025eccos,
  title={ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving},
  author={Mei, Kai and Xu, Wujiang and Lin, Shuhang and Zhang, Yongfeng},
  journal={arXiv:2502.20576},
  year={2025}
}
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.aios.foundation/aios-docs/aios-kernel/llm-cores/llm-routing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
