Context

The LLM generation process consists of two key stages: the prefilling stage and the decoding stage.

  1. Prefilling Stage: During this stage, the LLM computes the key-value (KV) caches, which store intermediate results that can be reused during the subsequent decoding stage.

  2. Decoding Stage: In this stage, the LLM generates tokens progressively, often working in a regressive manner to predict the next token based on the previously generated ones.

Logits-based context switch

The SimpleContextManager class provides a practical implementation of context management for different types of LLM backends.

Key Features

  • Support for different model types (API-based, OpenAI client, HuggingFace local)

  • Time limit enforcement for generations

  • Context saving and restoration

  • Handling of special input types (tools, JSON formatting)

Context Switching Approaches

The implementation supports two primary approaches to context switching:

Approach
Description
Compatibility
Implementation

Text-Based Context Switch

Stores generated text as intermediate results

Compatible with API-based models and OpenAI clients

Uses streaming responses with partial content accumulation

Logits-Based Context Switch

Stores model state including KV cache and logits

Currently only supported for HuggingFace models

Saves generated tokens, past key values, and other state information

Text-Based Context Switching

For API-based models and OpenAI clients, the context manager uses a streaming approach:

This approach:

  • Accumulates partial content from streamed responses

  • Enforces time limits by breaking out of processing loop

  • Returns both the accumulated text and a completion status

Logits-Based Context Switching

For HuggingFace models, the context manager implements a more sophisticated approach:

This approach:

  • Saves and restores the full model state (generation state, position counters)

  • Continues generation from the exact point it was interrupted

  • Provides more efficient resumption as it doesn't repeat computation

Last updated