# Context

The LLM generation process consists of two key stages: the **prefilling stage** and the **decoding stage**.

1. **Prefilling Stage**: During this stage, the LLM computes the key-value (KV) caches, which store intermediate results that can be reused during the subsequent decoding stage.&#x20;
2. **Decoding Stage**: In this stage, the LLM generates tokens progressively, often working in a regressive manner to predict the next token based on the previously generated ones.&#x20;

<figure><img src="/files/A1TyH27uEZkdpfpRShdx" alt="" width="563"><figcaption><p>Logits-based context switch</p></figcaption></figure>

The `SimpleContextManager` class provides a practical implementation of context management for different types of LLM backends.

#### Key Features

* Support for different model types (API-based, OpenAI client, HuggingFace local)
* Time limit enforcement for generations
* Context saving and restoration
* Handling of special input types (tools, JSON formatting)

#### Context Switching Approaches

The implementation supports two primary approaches to context switching:

| Approach                        | Description                                      | Compatibility                                       | Implementation                                                       |
| ------------------------------- | ------------------------------------------------ | --------------------------------------------------- | -------------------------------------------------------------------- |
| **Text-Based Context Switch**   | Stores generated text as intermediate results    | Compatible with API-based models and OpenAI clients | Uses streaming responses with partial content accumulation           |
| **Logits-Based Context Switch** | Stores model state including KV cache and logits | Currently only supported for HuggingFace models     | Saves generated tokens, past key values, and other state information |

**Text-Based Context Switching**

For API-based models and OpenAI clients, the context manager uses a streaming approach:

```python
def process_completion_streaming_response(self, response, initial_content, time_limit):
    start_time = time.time()
    completed_response = initial_content
    finished = True
    
    for part in response:
        delta_content = part.choices[0].delta.content or ""
        completed_response += delta_content
        
        if time.time() - start_time > time_limit:
            if part.choices[0].finish_reason is None:
                finished = False
            break
            
    if not finished:
        self.context_dict[str(pid)] = completed_response
    else:
        self.clear_context(str(pid))
    
    return completed_response, finished
```

This approach:

* Accumulates partial content from streamed responses
* Enforces time limits by breaking out of processing loop
* Returns both the accumulated text and a completion status

**Logits-Based Context Switching**

For HuggingFace models, the context manager implements a more sophisticated approach:

```python
def generate_with_time_limit_hf(self, model, messages, max_tokens, temperature, pid, time_limit):
    context_data = self.load_context(pid)
    
    if context_data:
        # Restore previous state
        start_idx = context_data["start_idx"]
        generated_tokens = context_data["generated_tokens"]
        past_key_values = context_data["past_key_values"]
        input_length = context_data["input_length"]
    else:
        # Initialize new generation
        # [Initialization code...]
    
    start_time = time.time()
    finished = True
    
    # Generate tokens incrementally with time checking
    for i in range(start_idx, max_tokens):
        if time.time() - start_time > time_limit:
            finished = False
            break
        
        # [Token generation code...]
    
    # Store state if not finished
    if not finished:
        self.context_dict[str(pid)] = {
            "generated_tokens": generated_tokens,
            "past_key_values": past_key_values,
            "start_idx": start_idx,
            "input_length": input_length
        }
    else:
        self.clear_context(str(pid))
    
    return result, finished
```

This approach:

* Saves and restores the full model state (generation state, position counters)
* Continues generation from the exact point it was interrupted
* Provides more efficient resumption as it doesn't repeat computation


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.aios.foundation/aios-docs/aios-kernel/context.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
