# Context

The LLM generation process consists of two key stages: the **prefilling stage** and the **decoding stage**.

1. **Prefilling Stage**: During this stage, the LLM computes the key-value (KV) caches, which store intermediate results that can be reused during the subsequent decoding stage.&#x20;
2. **Decoding Stage**: In this stage, the LLM generates tokens progressively, often working in a regressive manner to predict the next token based on the previously generated ones.&#x20;

<figure><img src="https://4233234616-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5h7XvlMFgKMtRboLGG1i%2Fuploads%2FrxngggQw9ZFJDhNRjWFu%2Fcontext.png?alt=media&#x26;token=b7ffd851-c22d-404c-8acc-6002be469cb0" alt="" width="563"><figcaption><p>Logits-based context switch</p></figcaption></figure>

The `SimpleContextManager` class provides a practical implementation of context management for different types of LLM backends.

#### Key Features

* Support for different model types (API-based, OpenAI client, HuggingFace local)
* Time limit enforcement for generations
* Context saving and restoration
* Handling of special input types (tools, JSON formatting)

#### Context Switching Approaches

The implementation supports two primary approaches to context switching:

| Approach                        | Description                                      | Compatibility                                       | Implementation                                                       |
| ------------------------------- | ------------------------------------------------ | --------------------------------------------------- | -------------------------------------------------------------------- |
| **Text-Based Context Switch**   | Stores generated text as intermediate results    | Compatible with API-based models and OpenAI clients | Uses streaming responses with partial content accumulation           |
| **Logits-Based Context Switch** | Stores model state including KV cache and logits | Currently only supported for HuggingFace models     | Saves generated tokens, past key values, and other state information |

**Text-Based Context Switching**

For API-based models and OpenAI clients, the context manager uses a streaming approach:

```python
def process_completion_streaming_response(self, response, initial_content, time_limit):
    start_time = time.time()
    completed_response = initial_content
    finished = True
    
    for part in response:
        delta_content = part.choices[0].delta.content or ""
        completed_response += delta_content
        
        if time.time() - start_time > time_limit:
            if part.choices[0].finish_reason is None:
                finished = False
            break
            
    if not finished:
        self.context_dict[str(pid)] = completed_response
    else:
        self.clear_context(str(pid))
    
    return completed_response, finished
```

This approach:

* Accumulates partial content from streamed responses
* Enforces time limits by breaking out of processing loop
* Returns both the accumulated text and a completion status

**Logits-Based Context Switching**

For HuggingFace models, the context manager implements a more sophisticated approach:

```python
def generate_with_time_limit_hf(self, model, messages, max_tokens, temperature, pid, time_limit):
    context_data = self.load_context(pid)
    
    if context_data:
        # Restore previous state
        start_idx = context_data["start_idx"]
        generated_tokens = context_data["generated_tokens"]
        past_key_values = context_data["past_key_values"]
        input_length = context_data["input_length"]
    else:
        # Initialize new generation
        # [Initialization code...]
    
    start_time = time.time()
    finished = True
    
    # Generate tokens incrementally with time checking
    for i in range(start_idx, max_tokens):
        if time.time() - start_time > time_limit:
            finished = False
            break
        
        # [Token generation code...]
    
    # Store state if not finished
    if not finished:
        self.context_dict[str(pid)] = {
            "generated_tokens": generated_tokens,
            "past_key_values": past_key_values,
            "start_idx": start_idx,
            "input_length": input_length
        }
    else:
        self.clear_context(str(pid))
    
    return result, finished
```

This approach:

* Saves and restores the full model state (generation state, position counters)
* Continues generation from the exact point it was interrupted
* Provides more efficient resumption as it doesn't repeat computation
