AIOS Docs
  • Welcome
  • Getting Started
    • Installation
    • Quickstart
      • Use Terminal
      • Use WebUI
    • Environment Variables Configuration
  • AIOS Kernel
    • Overview
    • LLM Core(s)
      • LiteLLM Compatible Backend
      • vLLM Backend
      • Hugging Face Backend
      • LLM Routing
    • Scheduler
      • FIFOScheduler
      • RRScheduler
    • Context
    • Memory
      • Base Layer
      • Agentic Memory Operations
    • Storage
      • sto_mount
      • sto_create_file
      • sto_create_directory
      • sto_write
      • sto_retrieve
      • sto_rollback
      • sto_share
    • Tools
    • Access
    • Syscalls
    • Terminal
  • AIOS Agent
    • How to Use Agent
    • How to Develop Agents
      • Develop with Native SDK
      • Develop with AutoGen
      • Develop with Open-Interpreter
      • Develop with MetaGPT
    • How to Publish Agents
  • AIOS-Agent SDK
    • Overview
    • LLM Core API
      • llm_chat
      • llm_chat_with_json_output
      • llm_chat_with_tool_call_output
      • llm_call_tool
      • llm_operate_file
    • Memory API
      • create_memory
      • get_memory
      • update_memory
      • delete_memory
      • search_memories
      • create_agentic_memory
    • Storage API
      • mount
      • create_file
      • create_dir
      • write_file
      • retrieve_file
      • rollback_file
      • share_file
    • Tool API
      • How to Develop Tools
    • Access API
    • Post API
    • Agent API
  • Community
    • How to Contribute
Powered by GitBook
On this page
  • Overview of Routing Strategies
  • Sequential Routing
  • Smart Routing
  • Reference
  1. AIOS Kernel
  2. LLM Core(s)

LLM Routing

When agents send LLM requests to AIOS kernel, the agent can choose one LLM backend or choose multiple LLM backends. 1) If only one LLM backend is specified, then the request will be sent to the specified LLM; 2) If multiple LLM backends are specified, it means that the agent allows this request to be processed by any of the specified LLMs. In this case, AIOS provides two LLM routing strategies — Sequential Routing and Smart Routing — to decide which LLM among the specified LLMs will be chosen to process the request.

Overview of Routing Strategies

AIOS provides two routing strategies to distribute requests across multiple LLM backends:

Strategy Type
Description

Sequential Routing

Sequentially cycles through the available models and selects an available one that is within the specific models

Smart Routing

A cost-quality optimized strategy which smartly chooses the lowest cost LLM while maintaining the quality of request processing

Sequential Routing

The SequentialRouting implements a basic model selection approach for load-balancing LLM requests. It sequentially cycles through the available models and selects an available models one that is within the specified models.

Core Functions

def get_model_idxs(self, selected_llms: List[str], n_queries: int=1):
    model_idxs = []
    
    for _ in range(n_queries):
        current = selected_llms[self.idx]
        for i, llm_config in enumerate(self.llm_configs):
            if llm_config["name"] == current["name"]:
                model_idxs.append(i)
                break
        self.idx = (self.idx + 1) % len(selected_llms)
    
    return model_idxs

Smart Routing

The SmartRouting implements a sophisticated cost-quality optimized selection strategy for LLM requests, using historical performance data to predict which models will perform best for a given query while minimizing cost. It leverages a two-stage constrained optimization method. The figure below shows the overall pipeline of this smart routing strategy.

Optimization Methods

  • Uses Lagrangian dual optimization to globally optimize model selection

  • Balances overall performance against total cost

  • Ensures each query is assigned to exactly one model

def get_model_idxs(self, selected_llms, queries, input_token_lengths):
    model_idxs = []
    
    for i in range(len(queries)):
        query = queries[i]
        input_token_length = input_token_lengths[i]
        
        # Get performance and length predictions
        perf_scores, length_scores = self.store.predict(
            query, selected_llms, n_similar=self.n_similar
        )
        
        # Calculate cost scores
        cost_scores = []
        for j in range(len(selected_llms)):
            pred_output_length = length_scores[0][j]
            input_cost = input_token_length * selected_llms[j].get("cost_per_input_token", 0)
            output_cost = pred_output_length * selected_llms[j].get("cost_per_output_token", 0)
            weighted_score = input_cost + output_cost
            cost_scores.append(weighted_score)
        
        # Select optimal model
        selected_idx = self.optimize_model_selection(
            selected_llms,
            perf_scores[0],
            np.array(cost_scores)
        )
        
        # Find index in original llm_configs
        for idx, config in enumerate(self.llm_configs):
            if config["name"] == selected_llms[selected_idx]["name"]:
                model_idxs.append(idx)
                break
    
    return model_idxs

Usage Example

configs = [
    {"name": "gpt-4o-mini", "backend": "openai", "cost_per_input_token": 0.00005, "cost_per_output_token": 0.00015},
    {"name": "llama-3-8b", "backend": "ollama", "cost_per_input_token": 0.00001, "cost_per_output_token": 0.00002}
]

strategy = SmartRouting(
    llm_configs=configs,
    performance_requirement=0.7,
    n_similar=16
)

queries = ["What is the capital of France?", "Explain quantum computing"]
token_lengths = [8, 12]

model_idxs = strategy.get_model_idxs(configs, queries, token_lengths)

Reference

@article{mei2025eccos,
  title={ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving},
  author={Mei, Kai and Xu, Wujiang and Lin, Shuhang and Zhang, Yongfeng},
  journal={arXiv:2502.20576},
  year={2025}
}

PreviousHugging Face BackendNextScheduler

Last updated 1 month ago

For implementation details and experimental results, see our and .

official repository
research paper