Context
The LLM generation process consists of two key stages: the prefilling stage and the decoding stage.
Prefilling Stage: During this stage, the LLM computes the key-value (KV) caches, which store intermediate results that can be reused during the subsequent decoding stage.
Decoding Stage: In this stage, the LLM generates tokens progressively, often working in a regressive manner to predict the next token based on the previously generated ones.

The SimpleContextManager class provides a practical implementation of context management for different types of LLM backends.
Key Features
Support for different model types (API-based, OpenAI client, HuggingFace local)
Time limit enforcement for generations
Context saving and restoration
Handling of special input types (tools, JSON formatting)
Context Switching Approaches
The implementation supports two primary approaches to context switching:
Text-Based Context Switch
Stores generated text as intermediate results
Compatible with API-based models and OpenAI clients
Uses streaming responses with partial content accumulation
Logits-Based Context Switch
Stores model state including KV cache and logits
Currently only supported for HuggingFace models
Saves generated tokens, past key values, and other state information
Text-Based Context Switching
For API-based models and OpenAI clients, the context manager uses a streaming approach:
This approach:
Accumulates partial content from streamed responses
Enforces time limits by breaking out of processing loop
Returns both the accumulated text and a completion status
Logits-Based Context Switching
For HuggingFace models, the context manager implements a more sophisticated approach:
This approach:
Saves and restores the full model state (generation state, position counters)
Continues generation from the exact point it was interrupted
Provides more efficient resumption as it doesn't repeat computation
Last updated