Context
Last updated
Last updated
The LLM generation process consists of two key stages: the prefilling stage and the decoding stage.
Prefilling Stage: During this stage, the LLM computes the key-value (KV) caches, which store intermediate results that can be reused during the subsequent decoding stage.
Decoding Stage: In this stage, the LLM generates tokens progressively, often working in a regressive manner to predict the next token based on the previously generated ones.
The SimpleContextManager
class provides a practical implementation of context management for different types of LLM backends.
Support for different model types (API-based, OpenAI client, HuggingFace local)
Time limit enforcement for generations
Context saving and restoration
Handling of special input types (tools, JSON formatting)
The implementation supports two primary approaches to context switching:
Text-Based Context Switch
Stores generated text as intermediate results
Compatible with API-based models and OpenAI clients
Uses streaming responses with partial content accumulation
Logits-Based Context Switch
Stores model state including KV cache and logits
Currently only supported for HuggingFace models
Saves generated tokens, past key values, and other state information
Text-Based Context Switching
For API-based models and OpenAI clients, the context manager uses a streaming approach:
This approach:
Accumulates partial content from streamed responses
Enforces time limits by breaking out of processing loop
Returns both the accumulated text and a completion status
Logits-Based Context Switching
For HuggingFace models, the context manager implements a more sophisticated approach:
This approach:
Saves and restores the full model state (generation state, position counters)
Continues generation from the exact point it was interrupted
Provides more efficient resumption as it doesn't repeat computation