Context

The LLM generation process consists of two key stages: the prefilling stage and the decoding stage.

Prefilling Stage: During this stage, the LLM computes the key-value (KV) caches, which store intermediate results that can be reused during the subsequent decoding stage.
Decoding Stage: In this stage, the LLM generates tokens progressively, often working in a regressive manner to predict the next token based on the previously generated ones.

The SimpleContextManager class provides a practical implementation of context management for different types of LLM backends.

Key Features

Support for different model types (API-based, OpenAI client, HuggingFace local)
Time limit enforcement for generations
Context saving and restoration
Handling of special input types (tools, JSON formatting)

Context Switching Approaches

The implementation supports two primary approaches to context switching:

Approach

Description

Compatibility

Implementation

Text-Based Context Switch

Stores generated text as intermediate results

Compatible with API-based models and OpenAI clients

Uses streaming responses with partial content accumulation

Logits-Based Context Switch

Stores model state including KV cache and logits

Currently only supported for HuggingFace models

Saves generated tokens, past key values, and other state information

Text-Based Context Switching

For API-based models and OpenAI clients, the context manager uses a streaming approach:

def process_completion_streaming_response(self, response, initial_content, time_limit):
    start_time = time.time()
    completed_response = initial_content
    finished = True
    
    for part in response:
        delta_content = part.choices[0].delta.content or ""
        completed_response += delta_content
        
        if time.time() - start_time > time_limit:
            if part.choices[0].finish_reason is None:
                finished = False
            break
            
    if not finished:
        self.context_dict[str(pid)] = completed_response
    else:
        self.clear_context(str(pid))
    
    return completed_response, finished

This approach:

Accumulates partial content from streamed responses
Enforces time limits by breaking out of processing loop
Returns both the accumulated text and a completion status

Logits-Based Context Switching

For HuggingFace models, the context manager implements a more sophisticated approach:

def generate_with_time_limit_hf(self, model, messages, max_tokens, temperature, pid, time_limit):
    context_data = self.load_context(pid)
    
    if context_data:
        # Restore previous state
        start_idx = context_data["start_idx"]
        generated_tokens = context_data["generated_tokens"]
        past_key_values = context_data["past_key_values"]
        input_length = context_data["input_length"]
    else:
        # Initialize new generation
        # [Initialization code...]
    
    start_time = time.time()
    finished = True
    
    # Generate tokens incrementally with time checking
    for i in range(start_idx, max_tokens):
        if time.time() - start_time > time_limit:
            finished = False
            break
        
        # [Token generation code...]
    
    # Store state if not finished
    if not finished:
        self.context_dict[str(pid)] = {
            "generated_tokens": generated_tokens,
            "past_key_values": past_key_values,
            "start_idx": start_idx,
            "input_length": input_length
        }
    else:
        self.clear_context(str(pid))
    
    return result, finished

This approach:

Saves and restores the full model state (generation state, position counters)
Continues generation from the exact point it was interrupted
Provides more efficient resumption as it doesn't repeat computation

PreviousRRScheduler NextMemory

Last updated 4 months ago