Context
The LLM generation process consists of two key stages: the prefilling stage and the decoding stage.
Prefilling Stage: During this stage, the LLM computes the key-value (KV) caches, which store intermediate results that can be reused during the subsequent decoding stage.
Decoding Stage: In this stage, the LLM generates tokens progressively, often working in a regressive manner to predict the next token based on the previously generated ones.

The SimpleContextManager
class provides a practical implementation of context management for different types of LLM backends.
Key Features
Support for different model types (API-based, OpenAI client, HuggingFace local)
Time limit enforcement for generations
Context saving and restoration
Handling of special input types (tools, JSON formatting)
Context Switching Approaches
The implementation supports two primary approaches to context switching:
Text-Based Context Switch
Stores generated text as intermediate results
Compatible with API-based models and OpenAI clients
Uses streaming responses with partial content accumulation
Logits-Based Context Switch
Stores model state including KV cache and logits
Currently only supported for HuggingFace models
Saves generated tokens, past key values, and other state information
Text-Based Context Switching
For API-based models and OpenAI clients, the context manager uses a streaming approach:
def process_completion_streaming_response(self, response, initial_content, time_limit):
start_time = time.time()
completed_response = initial_content
finished = True
for part in response:
delta_content = part.choices[0].delta.content or ""
completed_response += delta_content
if time.time() - start_time > time_limit:
if part.choices[0].finish_reason is None:
finished = False
break
if not finished:
self.context_dict[str(pid)] = completed_response
else:
self.clear_context(str(pid))
return completed_response, finished
This approach:
Accumulates partial content from streamed responses
Enforces time limits by breaking out of processing loop
Returns both the accumulated text and a completion status
Logits-Based Context Switching
For HuggingFace models, the context manager implements a more sophisticated approach:
def generate_with_time_limit_hf(self, model, messages, max_tokens, temperature, pid, time_limit):
context_data = self.load_context(pid)
if context_data:
# Restore previous state
start_idx = context_data["start_idx"]
generated_tokens = context_data["generated_tokens"]
past_key_values = context_data["past_key_values"]
input_length = context_data["input_length"]
else:
# Initialize new generation
# [Initialization code...]
start_time = time.time()
finished = True
# Generate tokens incrementally with time checking
for i in range(start_idx, max_tokens):
if time.time() - start_time > time_limit:
finished = False
break
# [Token generation code...]
# Store state if not finished
if not finished:
self.context_dict[str(pid)] = {
"generated_tokens": generated_tokens,
"past_key_values": past_key_values,
"start_idx": start_idx,
"input_length": input_length
}
else:
self.clear_context(str(pid))
return result, finished
This approach:
Saves and restores the full model state (generation state, position counters)
Continues generation from the exact point it was interrupted
Provides more efficient resumption as it doesn't repeat computation
Last updated