AIOS Docs
  • Welcome
  • Getting Started
    • Installation
    • Quickstart
      • Use Terminal
      • Use WebUI
    • Environment Variables Configuration
  • AIOS Kernel
    • Overview
    • LLM Core(s)
      • LiteLLM Compatible Backend
      • vLLM Backend
      • Hugging Face Backend
      • LLM Routing
    • Scheduler
      • FIFOScheduler
      • RRScheduler
    • Context
    • Memory
      • Base Layer
      • Agentic Memory Operations
    • Storage
      • sto_mount
      • sto_create_file
      • sto_create_directory
      • sto_write
      • sto_retrieve
      • sto_rollback
      • sto_share
    • Tools
    • Access
    • Syscalls
    • Terminal
  • AIOS Agent
    • How to Use Agent
    • How to Develop Agents
      • Develop with Native SDK
      • Develop with AutoGen
      • Develop with Open-Interpreter
      • Develop with MetaGPT
    • How to Publish Agents
  • AIOS-Agent SDK
    • Overview
    • LLM Core API
      • llm_chat
      • llm_chat_with_json_output
      • llm_chat_with_tool_call_output
      • llm_call_tool
      • llm_operate_file
    • Memory API
      • create_memory
      • get_memory
      • update_memory
      • delete_memory
      • search_memories
      • create_agentic_memory
    • Storage API
      • mount
      • create_file
      • create_dir
      • write_file
      • retrieve_file
      • rollback_file
      • share_file
    • Tool API
      • How to Develop Tools
    • Access API
    • Post API
    • Agent API
  • Community
    • How to Contribute
Powered by GitBook
On this page
  1. AIOS Kernel

Context

PreviousRRSchedulerNextMemory

Last updated 1 month ago

The LLM generation process consists of two key stages: the prefilling stage and the decoding stage.

  1. Prefilling Stage: During this stage, the LLM computes the key-value (KV) caches, which store intermediate results that can be reused during the subsequent decoding stage.

  2. Decoding Stage: In this stage, the LLM generates tokens progressively, often working in a regressive manner to predict the next token based on the previously generated ones.

The SimpleContextManager class provides a practical implementation of context management for different types of LLM backends.

Key Features

  • Support for different model types (API-based, OpenAI client, HuggingFace local)

  • Time limit enforcement for generations

  • Context saving and restoration

  • Handling of special input types (tools, JSON formatting)

Context Switching Approaches

The implementation supports two primary approaches to context switching:

Approach
Description
Compatibility
Implementation

Text-Based Context Switch

Stores generated text as intermediate results

Compatible with API-based models and OpenAI clients

Uses streaming responses with partial content accumulation

Logits-Based Context Switch

Stores model state including KV cache and logits

Currently only supported for HuggingFace models

Saves generated tokens, past key values, and other state information

Text-Based Context Switching

For API-based models and OpenAI clients, the context manager uses a streaming approach:

def process_completion_streaming_response(self, response, initial_content, time_limit):
    start_time = time.time()
    completed_response = initial_content
    finished = True
    
    for part in response:
        delta_content = part.choices[0].delta.content or ""
        completed_response += delta_content
        
        if time.time() - start_time > time_limit:
            if part.choices[0].finish_reason is None:
                finished = False
            break
            
    if not finished:
        self.context_dict[str(pid)] = completed_response
    else:
        self.clear_context(str(pid))
    
    return completed_response, finished

This approach:

  • Accumulates partial content from streamed responses

  • Enforces time limits by breaking out of processing loop

  • Returns both the accumulated text and a completion status

Logits-Based Context Switching

For HuggingFace models, the context manager implements a more sophisticated approach:

def generate_with_time_limit_hf(self, model, messages, max_tokens, temperature, pid, time_limit):
    context_data = self.load_context(pid)
    
    if context_data:
        # Restore previous state
        start_idx = context_data["start_idx"]
        generated_tokens = context_data["generated_tokens"]
        past_key_values = context_data["past_key_values"]
        input_length = context_data["input_length"]
    else:
        # Initialize new generation
        # [Initialization code...]
    
    start_time = time.time()
    finished = True
    
    # Generate tokens incrementally with time checking
    for i in range(start_idx, max_tokens):
        if time.time() - start_time > time_limit:
            finished = False
            break
        
        # [Token generation code...]
    
    # Store state if not finished
    if not finished:
        self.context_dict[str(pid)] = {
            "generated_tokens": generated_tokens,
            "past_key_values": past_key_values,
            "start_idx": start_idx,
            "input_length": input_length
        }
    else:
        self.clear_context(str(pid))
    
    return result, finished

This approach:

  • Saves and restores the full model state (generation state, position counters)

  • Continues generation from the exact point it was interrupted

  • Provides more efficient resumption as it doesn't repeat computation

Logits-based context switch