Hugging Face Backend

The Hugging Face Local Backend allows running models locally using the Hugging Face Transformers library.

The HF Local Backend is initialized as a class instance:

case "huggingface":
    self.llms.append(HfLocalBackend(
        model_name=config.name,
        max_gpu_memory=config.max_gpu_memory,
        eval_device=config.eval_device
    ))

It handles loading and running Hugging Face models locally, with options for GPU memory allocation.

Standard Text Input

For standard text requests, the backend uses the generate() method:

completed_response = model.generate(**completion_kwargs)
return completed_response, True

Tool Calls

As huggingface models do not natively support tool calls, the adapter merges tool information into messages before generation and decodes tool calls after generation.

if tools:
    new_messages = merge_messages_with_tools(messages, tools)
    completion_kwargs["messages"] = new_messages

completed_response = model.generate(**completion_kwargs)

# During processing
if tools:
    if isinstance(model, HfLocalBackend):
        if finished:
            tool_calls = decode_hf_tool_calls(completed_response)
            tool_calls = double_underscore_to_slash(tool_calls)
            return LLMResponse(
                response_message=None,
                tool_calls=tool_calls,
                finished=finished
            )

The merge_messages_with_tools() function formats the tool information into the prompt, and decode_hf_tool_calls() extracts tool calls from the text response.

JSON-Formatted Responses

JSON formatting is handled by merging the response format into the messages:

elif message_return_type == "json":
    new_messages = merge_messages_with_response_format(messages, response_format)
    completion_kwargs["messages"] = new_messages

The merge_messages_with_response_format() function likely adds instructions for the model to respond in JSON format.

PreviousvLLM Backend NextLLM Routing

Last updated 3 months ago