Code
import os
from dotenv import load_dotenv
load_dotenv()
A Single Inference Wrapper for OpenAI, Together AI, Hugging Face Inference TGI, Ollama, etc.
Chris Levy
March 8, 2024
March 8, 2024
Until recently I thought that the openai
library was only for connecting to OpenAI endpoints. It was not until I was testing out LLM inference with together.ai that I came across a section in their documentation on OpenAI API compatibility. The idea of using the openai
client to do inference with open source models was completely new to me. In the together.ai documentation example they use the openai
library to connect to an open source model.
import os
import openai
system_content = "You are a travel agent. Be descriptive and helpful."
user_content = "Tell me about San Francisco"
client = openai.OpenAI(
api_key=os.environ.get("TOGETHER_API_KEY"),
base_url="https://api.together.xyz/v1",
)
chat_completion = client.chat.completions.create(
model="mistralai/Mixtral-8x7B-Instruct-v0.1",
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
],
temperature=0.7,
max_tokens=1024,
)
response = chat_completion.choices[0].message.content
print("Together response:\n", response)
Then a week later I saw that Hugging Face had also released support for OpenAI compatibility with Text Generation Inference (TGI) and Inference Endpoints. Again, you simply modify the base_url
, api_key
, and model
as seen is this example from their blog post announcement.
from openai import OpenAI
# initialize the client but point it to TGI
client = OpenAI(
base_url="<ENDPOINT_URL>" + "/v1/", # replace with your endpoint url
api_key="<HF_API_TOKEN>", # replace with your token
)
chat_completion = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Why is open-source software important?"},
],
stream=True,
max_tokens=500
)
# iterate and print stream
for message in chat_completion:
print(message.choices[0].delta.content, end="")
What about working with LLMs locally? Two such options are Ollama and LM Studio. Ollama recently added support for the openai
client and LM Studio supports it too. For example, here is how one can use mistral-7b
locally with Ollama to run inference with the openai
client:
ollama pull mistral
from openai import OpenAI
client = OpenAI(
base_url = 'http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
response = client.chat.completions.create(
model="mistral",
messages=[
{"role": "system", "content": "You are a helpful assistant and always talk like a pirate."},
{"role": "user", "content": "Write a haiku."},
])
print(response.choices[0].message.content)
There are other services and libraries for running LLM inference that are compatible with the openai
library too. I find it all very exciting because it is less code I have to write and maintain for running inference with LLMs. All I need to change is a base_url
, an api_key
, and the name of the model
.
At the same time that I was learning about openai
client compatibility, I was also looking into the instructor library. Since it patches in some additional functionality into the openai
client, I thought it would be fun to discuss here too.
Start by creating a virtual environment:
python3 -m venv env
source env/bin/activate
Then install:
pip install openai
pip install instructor # only if you want to try out instructor library
pip install python-dotenv # or define your environment variables differently
I also have:
ollama pull gemma:2b-instruct
and ollama pull llama2
In my .env
file I have the following:
OPENAI_API_KEY=your_key
HUGGING_FACE_ACCESS_TOKEN=your_key
TOGETHER_API_KEY=your_key
You could go ahead and just start using client.chat.completions.create
directly as in the examples from the introduction. However, I do like wrapping third party services into classes for reusability, maintainability, etc.
The class below, OpenAIChatCompletion
, does several things:
clients
dictclient.chat.completions.create
in the __call__
methodAsyncOpenAI
client, but sometimes I prefer simply using futures.ThreadPoolExecutor
as seen in the function create_chat_completions_async
.OpenAI
client with the instructor library. If you don’t want to play around with instructor library then simply remove the instructor.patch
code.I also added some logging functionality which keeps track of every outgoing LLM request. This was inspired by the awesome blog post by Hamel Husain, Fuck You, Show Me The Prompt.. In that post, Hamel writes about how various LLM tools can often hide the prompts, making it tricky to see what requests are actually sent to the LLM behind the scenes. I created a simple logger class OpenAIMessagesLogger
which keeps track of all the requests sent to the openai
client. Later when we try out the instructor library for getting structured output, we will utilize this debugging logger to see some additional messages that were sent to the client.
import ast
import logging
import re
from concurrent import futures
from typing import Any, Dict, List, Optional, Union
import instructor
from openai import APITimeoutError, OpenAI
from openai._streaming import Stream
from openai.types.chat.chat_completion import ChatCompletion
from openai.types.chat.chat_completion_chunk import ChatCompletionChunk
class OpenAIChatCompletion:
clients: Dict = dict()
@classmethod
def _load_client(cls, base_url: Optional[str] = None, api_key: Optional[str] = None) -> OpenAI:
client_key = (base_url, api_key)
if OpenAIChatCompletion.clients.get(client_key) is None:
OpenAIChatCompletion.clients[client_key] = instructor.patch(OpenAI(base_url=base_url, api_key=api_key))
return OpenAIChatCompletion.clients[client_key]
def __call__(
self,
model: str,
messages: list,
base_url: Optional[str] = None,
api_key: Optional[str] = None,
**kwargs: Any,
) -> Union[ChatCompletion, Stream[ChatCompletionChunk]]:
# https://platform.openai.com/docs/api-reference/chat/create
# https://github.com/openai/openai-python
client = self._load_client(base_url, api_key)
return client.chat.completions.create(model=model, messages=messages, **kwargs)
@classmethod
def create_chat_completions_async(
cls, task_args_list: List[Dict], concurrency: int = 10
) -> List[Union[ChatCompletion, Stream[ChatCompletionChunk]]]:
"""
Make a series of calls to chat.completions.create endpoint in parallel and collect back
the results.
:param task_args_list: A list of dictionaries where each dictionary contains the keyword
arguments required for __call__ method.
:param concurrency: the max number of workers
"""
def create_chat_task(
task_args: Dict,
) -> Union[None, ChatCompletion, Stream[ChatCompletionChunk]]:
try:
return cls().__call__(**task_args)
except APITimeoutError:
return None
with futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
results = list(executor.map(create_chat_task, task_args_list))
return results
class OpenAIMessagesLogger(logging.Handler):
def __init__(self):
super().__init__()
self.log_messages = []
def emit(self, record):
# Append the log message to the list
log_record_str = self.format(record)
match = re.search(r"Request options: (.+)", log_record_str, re.DOTALL)
if match:
text = match[1].replace("\n", "")
log_obj = ast.literal_eval(text)
self.log_messages.append(log_obj)
def debug_messages():
msg = OpenAIMessagesLogger()
openai_logger = logging.getLogger("openai")
openai_logger.setLevel(logging.DEBUG)
openai_logger.addHandler(msg)
return msg
Here is how you use the inference class to call the LLM. If you have ever used the openai
client you will be familiar with the input and output format.
ChatCompletion(id='chatcmpl-90N4hSh3AG1Sz68zjUnfcEtAjvFn5', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hello! How can I assist you today?', role='assistant', function_call=None, tool_calls=None))], created=1709875727, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint='fp_2b778c6b35', usage=CompletionUsage(completion_tokens=9, prompt_tokens=9, total_tokens=18))
And our logger is keeping track of all the outgoing requests:
[{'method': 'post',
'url': '/chat/completions',
'files': None,
'json_data': {'messages': [{'role': 'user', 'content': 'Hello!'}],
'model': 'gpt-3.5-turbo-0125'}}]
Now we can define some different models that can all be accessed through the same inference class.
class Models:
# OpenAI GPT Models
GPT4 = dict(model="gpt-4-0125-preview", base_url=None, api_key=None)
GPT3 = dict(model="gpt-3.5-turbo-0125", base_url=None, api_key=None)
# Hugging Face Inference Endpoints
OPENHERMES2_5_MISTRAL_7B = dict(
model="tgi",
base_url="https://xofunqxk66baupmf.us-east-1.aws.endpoints.huggingface.cloud" + "/v1/",
api_key=os.environ["HUGGING_FACE_ACCESS_TOKEN"],
)
# Ollama Models
LLAMA2 = dict(
model="llama2",
base_url="http://localhost:11434/v1",
api_key="ollama",
)
GEMMA2B = dict(
model="gemma:2b-instruct",
base_url="http://localhost:11434/v1",
api_key="ollama",
)
# together AI endpoints
GEMMA7B = dict(model="google/gemma-7b-it", base_url="https://api.together.xyz/v1", api_key=os.environ.get("TOGETHER_API_KEY"))
MISTRAL7B = dict(model="mistralai/Mistral-7B-Instruct-v0.1", base_url="https://api.together.xyz/v1", api_key=os.environ.get("TOGETHER_API_KEY"))
Model: GPT4
Response: Neil Armstrong, 1969.
Model: GPT3
Response: The first person to walk on the Moon was Neil Armstrong in 1969.
Model: OPENHERMES2_5_MISTRAL_7B
Response: Neil Armstrong was the first person to walk on the Moon. It happened on July 20, 1969.
Model: LLAMA2
Response: The first person to walk on the Moon was Neil Armstrong, who stepped onto the lunar surface on July 20, 1969 as part of the Apollo 11 mission.
Model: GEMMA2B
Response: There is no evidence to support the claim that a person walked on the Moon in any year.
Model: GEMMA7B
Response: Sure, here is the answer:
Neil Armstrong was the first person to walk on the Moon in 1969.
Model: MISTRAL7B
Response: The first person to walk on the Moon was Neil Armstrong, and it happened on July 20, 1969.
We can also send the same requests in parallel like this:
task_args_list = []
for model_name, model_config in all_models:
task_args_list.append(dict(messages=messages, **model_config))
# execute the same calls in parallel
model_names = [m[0] for m in all_models]
resps = llm.create_chat_completions_async(task_args_list)
for model_name, resp in zip(model_names, resps):
print(f"Model: {model_name}")
print(f"Response: {resp.choices[0].message.content}")
Model: GPT4
Response: Neil Armstrong, 1969.
Model: GPT3
Response: The first person to walk on the Moon was Neil Armstrong in 1969.
Model: OPENHERMES2_5_MISTRAL_7B
Response: The first person to walk on the Moon was Neil Armstrong, and it happened in 1969.
Model: LLAMA2
Response: Nice question! The first person to walk on the Moon was Neil Armstrong, and it happened in 1969 during the Apollo 11 mission. Armstrong stepped onto the lunar surface on July 20, 1969, famously declaring "That's one small step for man, one giant leap for mankind" as he took his first steps.
Model: GEMMA2B
Response: There is no evidence or record of any person walking on the Moon.
Model: GEMMA7B
Response: Sure, here is the answer:
Neil Armstrong was the first person to walk on the Moon in 1969.
Model: MISTRAL7B
Response: The first person to walk on the Moon was Neil Armstrong, and it happened on July 20, 1969.
I love that! The ability to use various models (open source and OpenAI GPT) all through the same interface. And we have all our outgoing requests logged for debugging if needed. We have made 15 requests up to this point.
{'method': 'post',
'url': '/chat/completions',
'files': None,
'json_data': {'messages': [{'role': 'system',
'content': 'You are a helpful assistant. Your replies are short, brief and to the point.'},
{'role': 'user',
'content': 'Who was the first person to walk on the Moon, and in what year did it happen?'}],
'model': 'mistralai/Mistral-7B-Instruct-v0.1'}}
There are various approaches to getting structured output from LLMs. For example see JSON mode and Function calling. Some open source models and inference providers are also starting to offer these capabilities. For example see the together.ai docs. The instructor blog also has lots of examples and tips for getting structured output from LLMs. See this recent blog post for getting structured output from open source and Local LLMs.
One thing that is neat about the instructor library is you can define a Pydantic schema and then pass it to the patched openai
client. It also adds in schema validation and retry logic.
First we will clear out our debugging log messages.
from typing import List
from pydantic import BaseModel, field_validator
class Character(BaseModel):
name: str
race: str
fun_fact: str
favorite_food: str
skills: List[str]
weapons: List[str]
class Characters(BaseModel):
characters: List[Character]
@field_validator("characters")
@classmethod
def validate_characters(cls, v):
if len(v) < 20:
raise ValueError(f"The number of characters must be at least 20, but it is {len(v)}")
return v
name: Frodo Baggins
race: Hobbit
fun_fact: Bearer of the One Ring
favorite_food: Mushrooms
skills: ['Courage', 'Stealth']
weapons: ['Sting', 'Elven Dagger']
name: Samwise Gamgee
race: Hobbit
fun_fact: Frodo's gardener and friend
favorite_food: Potatoes
skills: ['Loyalty', 'Cooking']
weapons: ['Barrow-blade']
name: Gandalf
race: Maia
fun_fact: Known as Gandalf the Grey and later as Gandalf the White
favorite_food: N/A
skills: ['Wisdom', 'Magic']
weapons: ['Glamdring', 'Staff']
name: Aragorn
race: Human
fun_fact: Heir of Isildur and rightful king of Gondor
favorite_food: Elvish waybread
skills: ['Swordsmanship', 'Leadership']
weapons: ['Andúril', 'Bow']
name: Legolas
race: Elf
fun_fact: Prince of the Woodland Realm
favorite_food: Lembas bread
skills: ['Archery', 'Agility']
weapons: ['Elven bow', 'Daggers']
name: Gimli
race: Dwarf
fun_fact: Son of Glóin
favorite_food: Meat
skills: ['Axe fighting', 'Stout-heartedness']
weapons: ['Battle axe', 'Throwing axes']
name: Boromir
race: Human
fun_fact: Son of Denethor, Steward of Gondor
favorite_food: Stew
skills: ['Swordsmanship', 'Leadership']
weapons: ['Sword', 'Shield']
name: Meriadoc Brandybuck
race: Hobbit
fun_fact: Member of the Fellowship
favorite_food: Ale
skills: ['Stealth', 'Strategy']
weapons: ['Elven dagger']
name: Peregrin Took
race: Hobbit
fun_fact: Often known simply as Pippin
favorite_food: Cakes
skills: ['Curiosity', 'Bravery']
weapons: ['Sword']
name: Galadriel
race: Elf
fun_fact: Lady of Lothlórien
favorite_food: N/A
skills: ['Wisdom', 'Telepathy']
weapons: ['Nenya (Ring of Power)']
name: Elrond
race: Elf
fun_fact: Lord of Rivendell
favorite_food: N/A
skills: ['Wisdom', 'Healing']
weapons: ['Sword']
name: Eowyn
race: Human
fun_fact: Niece of King Théoden of Rohan; slayer of the Witch-king
favorite_food: Bread
skills: ['Swordsmanship', 'Courage']
weapons: ['Sword', 'Shield']
name: Faramir
race: Human
fun_fact: Brother of Boromir
favorite_food: Bread
skills: ['Archery', 'Strategy']
weapons: ['Bow', 'Sword']
name: Gollum
race: Hobbit-like creature
fun_fact: Once the bearer of the One Ring, known as Sméagol
favorite_food: Raw fish
skills: ['Stealth', 'Persuasion']
weapons: ['Teeth and claws']
name: Saruman
race: Maia
fun_fact: Head of the White Council before being corrupted
favorite_food: N/A
skills: ['Magic', 'Persuasion']
weapons: ['Staff']
name: Sauron
race: Maia
fun_fact: The Dark Lord and creator of the One Ring
favorite_food: N/A
skills: ['Necromancy', 'Deception']
weapons: ['One Ring', 'Mace']
name: Bilbo Baggins
race: Hobbit
fun_fact: Original discoverer of the One Ring
favorite_food: Everything
skills: ['Stealth', 'Story-telling']
weapons: ['Sting']
name: Théoden
race: Human
fun_fact: King of Rohan
favorite_food: Meat
skills: ['Leadership', 'Horsemanship']
weapons: ['Herugrim', 'Sword']
name: Treebeard
race: Ent
fun_fact: Oldest of the Ents, protectors of Fangorn Forest
favorite_food: Water
skills: ['Strength', 'Wisdom']
weapons: ['None']
name: Witch-king of Angmar
race: Undead/Nazgûl
fun_fact: Leader of the Nazgûl
favorite_food: N/A
skills: ['Fear-induction', 'Swordsmanship']
weapons: ['Morgul-blade', 'Flail']
name: Gríma Wormtongue
race: Human
fun_fact: Advisor to King Théoden under Saruman's influence
favorite_food: N/A
skills: ['Deception', 'Speechcraft']
weapons: ['Knife']
name: Éomer
race: Human
fun_fact: Nephew of King Théoden; later king of Rohan
favorite_food: Meat
skills: ['Swordsmanship', 'Horsemanship']
weapons: ['Sword', 'Spear']
It is probably likely that GPT would not return 20 characters in the first request. If max_retries=0
then it would likely raise a Pydantic validation error. But since we have max_retries=4
then the instructor
library sends back the validation error as a message and asks again. How exactly does it do that? We can look at the messages that we have logged for debugging.
[{'method': 'post',
'url': '/chat/completions',
'files': None,
'json_data': {'messages': [{'role': 'user',
'content': 'Who are the main characters from Lord of the Rings?.'}],
'model': 'gpt-4-0125-preview',
'tool_choice': {'type': 'function', 'function': {'name': 'Characters'}},
'tools': [{'type': 'function',
'function': {'name': 'Characters',
'description': 'Correctly extracted `Characters` with all the required parameters with correct types',
'parameters': {'$defs': {'Character': {'properties': {'name': {'title': 'Name',
'type': 'string'},
'race': {'title': 'Race', 'type': 'string'},
'fun_fact': {'title': 'Fun Fact', 'type': 'string'},
'favorite_food': {'title': 'Favorite Food', 'type': 'string'},
'skills': {'items': {'type': 'string'},
'title': 'Skills',
'type': 'array'},
'weapons': {'items': {'type': 'string'},
'title': 'Weapons',
'type': 'array'}},
'required': ['name',
'race',
'fun_fact',
'favorite_food',
'skills',
'weapons'],
'title': 'Character',
'type': 'object'}},
'properties': {'characters': {'items': {'$ref': '#/$defs/Character'},
'title': 'Characters',
'type': 'array'}},
'required': ['characters'],
'type': 'object'}}}]}},
{'method': 'post',
'url': '/chat/completions',
'files': None,
'json_data': {'messages': [{'role': 'user',
'content': 'Who are the main characters from Lord of the Rings?.'},
{'role': 'assistant',
'content': '',
'tool_calls': [{'id': 'call_kjUg9ogoR1OdRr0OkmTzabue',
'function': {'arguments': '{"characters":[{"name":"Frodo Baggins","race":"Hobbit","fun_fact":"Bearer of the One Ring","favorite_food":"Mushrooms","skills":["Courage","Stealth"],"weapons":["Sting","Elven Dagger"]},{"name":"Samwise Gamgee","race":"Hobbit","fun_fact":"Frodo\'s gardener and friend","favorite_food":"Potatoes","skills":["Loyalty","Cooking"],"weapons":["Barrow-blade"]},{"name":"Gandalf","race":"Maia","fun_fact":"Known as Gandalf the Grey and later as Gandalf the White","favorite_food":"N/A","skills":["Wisdom","Magic"],"weapons":["Glamdring","Staff"]},{"name":"Aragorn","race":"Human","fun_fact":"Heir of Isildur and rightful king of Gondor","favorite_food":"Elvish waybread","skills":["Swordsmanship","Leadership"],"weapons":["Andúril","Bow"]},{"name":"Legolas","race":"Elf","fun_fact":"Prince of the Woodland Realm","favorite_food":"Lembas bread","skills":["Archery","Agility"],"weapons":["Elven bow","Daggers"]},{"name":"Gimli","race":"Dwarf","fun_fact":"Son of Glóin","favorite_food":"Meat","skills":["Axe fighting","Stout-heartedness"],"weapons":["Battle axe","Throwing axes"]}]}',
'name': 'Characters'},
'type': 'function'}]},
{'role': 'tool',
'tool_call_id': 'call_kjUg9ogoR1OdRr0OkmTzabue',
'name': 'Characters',
'content': "Recall the function correctly, fix the errors and exceptions found\n1 validation error for Characters\ncharacters\n Value error, The number of characters must be at least 20, but it is 6 [type=value_error, input_value=[{'name': 'Frodo Baggins'...axe', 'Throwing axes']}], input_type=list]\n For further information visit https://errors.pydantic.dev/2.6/v/value_error"}],
'model': 'gpt-4-0125-preview',
'tool_choice': {'type': 'function', 'function': {'name': 'Characters'}},
'tools': [{'type': 'function',
'function': {'name': 'Characters',
'description': 'Correctly extracted `Characters` with all the required parameters with correct types',
'parameters': {'$defs': {'Character': {'properties': {'name': {'title': 'Name',
'type': 'string'},
'race': {'title': 'Race', 'type': 'string'},
'fun_fact': {'title': 'Fun Fact', 'type': 'string'},
'favorite_food': {'title': 'Favorite Food', 'type': 'string'},
'skills': {'items': {'type': 'string'},
'title': 'Skills',
'type': 'array'},
'weapons': {'items': {'type': 'string'},
'title': 'Weapons',
'type': 'array'}},
'required': ['name',
'race',
'fun_fact',
'favorite_food',
'skills',
'weapons'],
'title': 'Character',
'type': 'object'}},
'properties': {'characters': {'items': {'$ref': '#/$defs/Character'},
'title': 'Characters',
'type': 'array'}},
'required': ['characters'],
'type': 'object'}}}]}}]
If you look through the above messages carefully you can see the retry asking logic.
Recall the function correctly, fix the errors and exceptions found validation error for CharactersValue error, The number of characters must be at least 20, …
You can even use the structured output with some of the open source models. I would refer to the instructor blog or documentation for further information on that. I have not fully looked into the different patching modes yet. But here is a simple example of using MISTRAL7B
through together.ai.
{'name': 'Superman', 'race': 'Kryptonian', 'fun_fact': 'Can fly', 'favorite_food': 'Pizza', 'skills': ['Super strength', 'Flight', 'Heat vision', 'X-ray vision'], 'weapons': ['Laser vision', 'Heat vision', 'X-ray vision']}
Again, I really like the idea of using a single interface for interacting with multiple LLMs. I hope the space continues to mature so that more open source models and services support JSON mode and function calling. I think instructor is a cool library and the corresponding blog is interesting too. I also like the idea of logging all the outgoing prompts/messages just to make sure I fully understand what is happening under the hood.