Proxy Config.yaml
Set model list, api_base
, api_key
, temperature
& proxy server settings (master-key
) on the config.yaml.
Param Name | Description |
---|---|
model_list | List of supported models on the server, with model-specific configs |
router_settings | litellm Router settings, example routing_strategy="least-busy" see all |
litellm_settings | litellm Module settings, example litellm.drop_params=True , litellm.set_verbose=True , litellm.api_base , litellm.cache see all |
general_settings | Server settings, example setting master_key: sk-my_special_key |
environment_variables | Environment Variables example, REDIS_HOST , REDIS_PORT |
Complete List: Check the Swagger UI docs on <your-proxy-url>/#/config.yaml
(e.g. http://0.0.0.0:4000/#/config.yaml), for everything you can pass in the config.yaml.
Quick Start
Set a model alias for your deployments.
In the config.yaml
the model_name parameter is the user-facing name to use for your deployment.
In the config below:
model_name
: the name to pass TO litellm from the external clientlitellm_params.model
: the model string passed to the litellm.completion() function
E.g.:
model=vllm-models
will route toopenai/facebook/opt-125m
.model=gpt-3.5-turbo
will load balance betweenazure/gpt-turbo-small-eu
andazure/gpt-turbo-small-ca
model_list:
- model_name: gpt-3.5-turbo ### RECEIVED MODEL NAME ###
litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
model: azure/gpt-turbo-small-eu ### MODEL NAME sent to `litellm.completion()` ###
api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
api_key: "os.environ/AZURE_API_KEY_EU" # does os.getenv("AZURE_API_KEY_EU")
rpm: 6 # [OPTIONAL] Rate limit for this deployment: in requests per minute (rpm)
- model_name: bedrock-claude-v1
litellm_params:
model: bedrock/anthropic.claude-instant-v1
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/gpt-turbo-small-ca
api_base: https://my-endpoint-canada-berri992.openai.azure.com/
api_key: "os.environ/AZURE_API_KEY_CA"
rpm: 6
- model_name: anthropic-claude
litellm_params:
model: bedrock/anthropic.claude-instant-v1
### [OPTIONAL] SET AWS REGION ###
aws_region_name: us-east-1
- model_name: vllm-models
litellm_params:
model: openai/facebook/opt-125m # the `openai/` prefix tells litellm it's openai compatible
api_base: http://0.0.0.0:4000/v1
api_key: none
rpm: 1440
model_info:
version: 2
# Use this if you want to make requests to `claude-3-haiku-20240307`,`claude-3-opus-20240229`,`claude-2.1` without defining them on the config.yaml
# Default models
# Works for ALL Providers and needs the default provider credentials in .env
- model_name: "*"
litellm_params:
model: "*"
litellm_settings: # module level litellm settings - https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py
drop_params: True
success_callback: ["langfuse"] # OPTIONAL - if you want to start sending LLM Logs to Langfuse. Make sure to set `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` in your env
general_settings:
master_key: sk-1234 # [OPTIONAL] Only use this if you to require all calls to contain this key (Authorization: Bearer sk-1234)
alerting: ["slack"] # [OPTIONAL] If you want Slack Alerts for Hanging LLM requests, Slow llm responses, Budget Alerts. Make sure to set `SLACK_WEBHOOK_URL` in your env
For more provider-specific info, go here
Step 2: Start Proxy with config
$ litellm --config /path/to/config.yaml
Run with --detailed_debug
if you need detailed debug logs
$ litellm --config /path/to/config.yaml --detailed_debug
Using Proxy - Curl Request, OpenAI Package, Langchain, Langchain JS
Calling a model group
- Curl Request
- Curl Request: Bedrock
- OpenAI v1.0.0+
- Langchain Python
Sends request to model where model_name=gpt-3.5-turbo
on config.yaml.
If multiple with model_name=gpt-3.5-turbo
does Load Balancing
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'
Sends this request to model where model_name=bedrock-claude-v1
on config.yaml
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "bedrock-claude-v1",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
# Sends request to model where `model_name=gpt-3.5-turbo` on config.yaml.
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
{
"role": "user",
"content": "this is a test request, write a short poem"
}
])
print(response)
# Sends this request to model where `model_name=bedrock-claude-v1` on config.yaml
response = client.chat.completions.create(model="bedrock-claude-v1", messages = [
{
"role": "user",
"content": "this is a test request, write a short poem"
}
])
print(response)
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage
messages = [
SystemMessage(
content="You are a helpful assistant that im using to make a test request to."
),
HumanMessage(
content="test from litellm. tell me why it's amazing in 1 sentence"
),
]
# Sends request to model where `model_name=gpt-3.5-turbo` on config.yaml.
chat = ChatOpenAI(
openai_api_base="http://0.0.0.0:4000", # set openai base to the proxy
model = "gpt-3.5-turbo",
temperature=0.1
)
response = chat(messages)
print(response)
# Sends request to model where `model_name=bedrock-claude-v1` on config.yaml.
claude_chat = ChatOpenAI(
openai_api_base="http://0.0.0.0:4000", # set openai base to the proxy
model = "bedrock-claude-v1",
temperature=0.1
)
response = claude_chat(messages)
print(response)
Save Model-specific params (API Base, Keys, Temperature, Max Tokens, Organization, Headers etc.)
You can use the config to save model-specific information like api_base, api_key, temperature, max_tokens, etc.
Step 1: Create a config.yaml
file
model_list:
- model_name: gpt-4-team1
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: azure/chatgpt-v-2
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
api_version: "2023-05-15"
azure_ad_token: eyJ0eXAiOiJ
seed: 12
max_tokens: 20
- model_name: gpt-4-team2
litellm_params:
model: azure/gpt-4
api_key: sk-123
api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
temperature: 0.2
- model_name: openai-gpt-3.5
litellm_params:
model: openai/gpt-3.5-turbo
extra_headers: {"AI-Resource Group": "ishaan-resource"}
api_key: sk-123
organization: org-ikDc4ex8NB
temperature: 0.2
- model_name: mistral-7b
litellm_params:
model: ollama/mistral
api_base: your_ollama_api_base
Step 2: Start server with config
$ litellm --config /path/to/config.yaml
Multiple OpenAI Organizations
Add all openai models across all OpenAI organizations with just 1 model definition
- model_name: *
litellm_params:
model: openai/*
api_key: os.environ/OPENAI_API_KEY
organization:
- org-1
- org-2
- org-3
LiteLLM will automatically create separate deployments for each org.
Confirm this via
curl --location 'http://0.0.0.0:4000/v1/model/info' \
--header 'Authorization: Bearer ${LITELLM_KEY}' \
--data ''
Provider specific wildcard routing
Proxy all models from a provider
Use this if you want to proxy all models from a specific provider without defining them on the config.yaml
Step 1 - define provider specific routing on config.yaml
model_list:
# provider specific wildcard routing
- model_name: "anthropic/*"
litellm_params:
model: "anthropic/*"
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: "groq/*"
litellm_params:
model: "groq/*"
api_key: os.environ/GROQ_API_KEY
Step 2 - Run litellm proxy
$ litellm --config /path/to/config.yaml
Step 3 Test it
Test with anthropic/
- all models with anthropic/
prefix will get routed to anthropic/*
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "anthropic/claude-3-sonnet-20240229",
"messages": [
{"role": "user", "content": "Hello, Claude!"}
]
}'
Test with groq/
- all models with groq/
prefix will get routed to groq/*
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "groq/llama3-8b-8192",
"messages": [
{"role": "user", "content": "Hello, Claude!"}
]
}'
Load Balancing
For more on this, go to this page
Use this to call multiple instances of the same model and configure things like routing strategy.
For optimal performance:
- Set
tpm/rpm
per model deployment. Weighted picks are then based on the established tpm/rpm. - Select your optimal routing strategy in
router_settings:routing_strategy
.
LiteLLM supports
["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], default="simple-shuffle"`
When tpm/rpm
is set + routing_strategy==simple-shuffle
litellm will use a weighted pick based on set tpm/rpm. In our load tests setting tpm/rpm for all deployments + routing_strategy==simple-shuffle
maximized throughput
- When using multiple LiteLLM Servers / Kubernetes set redis settings
router_settings:redis_host
etc
model_list:
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
rpm: 60 # Optional[int]: When rpm/tpm set - litellm uses weighted pick for load balancing. rpm = Rate limit for this deployment: in requests per minute (rpm).
tpm: 1000 # Optional[int]: tpm = Tokens Per Minute
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
rpm: 600
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003
rpm: 60000
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: <my-openai-key>
rpm: 200
- model_name: gpt-3.5-turbo-16k
litellm_params:
model: gpt-3.5-turbo-16k
api_key: <my-openai-key>
rpm: 100
litellm_settings:
num_retries: 3 # retry call 3 times on each model_name (e.g. zephyr-beta)
request_timeout: 10 # raise Timeout error if call takes longer than 10s. Sets litellm.request_timeout
fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # fallback to gpt-3.5-turbo if call fails num_retries
context_window_fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo-16k"]}, {"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]}] # fallback to gpt-3.5-turbo-16k if context window error
allowed_fails: 3 # cooldown model if it fails > 1 call in a minute.
router_settings: # router_settings are optional
routing_strategy: simple-shuffle # Literal["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], default="simple-shuffle"
model_group_alias: {"gpt-4": "gpt-3.5-turbo"} # all requests with `gpt-4` will be routed to models with `gpt-3.5-turbo`
num_retries: 2
timeout: 30 # 30 seconds
redis_host: <your redis host> # set this when using multiple litellm proxy deployments, load balancing state stored in redis
redis_password: <your redis password>
redis_port: 1992
You can view your cost once you set up Virtual keys or custom_callbacks
Load API Keys
Load API Keys / config values from Environment
If you have secrets saved in your environment, and don't want to expose them in the config.yaml, here's how to load model-specific keys from the environment. This works for ANY value on the config.yaml
os.environ/<YOUR-ENV-VAR> # runs os.getenv("YOUR-ENV-VAR")
model_list:
- model_name: gpt-4-team1
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: azure/chatgpt-v-2
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
api_version: "2023-05-15"
api_key: os.environ/AZURE_NORTH_AMERICA_API_KEY # 👈 KEY CHANGE
s/o to @David Manouchehri for helping with this.
Load API Keys from Azure Vault
- Install Proxy dependencies
$ pip install 'litellm[proxy]' 'litellm[extra_proxy]'
- Save Azure details in your environment
export["AZURE_CLIENT_ID"]="your-azure-app-client-id"
export["AZURE_CLIENT_SECRET"]="your-azure-app-client-secret"
export["AZURE_TENANT_ID"]="your-azure-tenant-id"
export["AZURE_KEY_VAULT_URI"]="your-azure-key-vault-uri"
- Add to proxy config.yaml
model_list:
- model_name: "my-azure-models" # model alias
litellm_params:
model: "azure/<your-deployment-name>"
api_key: "os.environ/AZURE-API-KEY" # reads from key vault - get_secret("AZURE_API_KEY")
api_base: "os.environ/AZURE-API-BASE" # reads from key vault - get_secret("AZURE_API_BASE")
general_settings:
use_azure_key_vault: True
You can now test this by starting your proxy:
litellm --config /path/to/config.yaml
Set Custom Prompt Templates
LiteLLM by default checks if a model has a prompt template and applies it (e.g. if a huggingface model has a saved chat template in it's tokenizer_config.json). However, you can also set a custom prompt template on your proxy in the config.yaml
:
Step 1: Save your prompt template in a config.yaml
# Model-specific parameters
model_list:
- model_name: mistral-7b # model alias
litellm_params: # actual params for litellm.completion()
model: "huggingface/mistralai/Mistral-7B-Instruct-v0.1"
api_base: "<your-api-base>"
api_key: "<your-api-key>" # [OPTIONAL] for hf inference endpoints
initial_prompt_value: "\n"
roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
final_prompt_value: "\n"
bos_token: "<s>"
eos_token: "</s>"
max_tokens: 4096
Step 2: Start server with config
$ litellm --config /path/to/config.yaml
Setting Embedding Models
See supported Embedding Providers & Models here
Use Sagemaker, Bedrock, Azure, OpenAI, XInference
Create Config.yaml
- Bedrock Completion/Chat
- Sagemaker, Bedrock Embeddings
- Hugging Face Embeddings
- Azure OpenAI Embeddings
- OpenAI Embeddings
- XInference
- OpenAI Compatible Embeddings
model_list:
- model_name: bedrock-cohere
litellm_params:
model: "bedrock/cohere.command-text-v14"
aws_region_name: "us-west-2"
- model_name: bedrock-cohere
litellm_params:
model: "bedrock/cohere.command-text-v14"
aws_region_name: "us-east-2"
- model_name: bedrock-cohere
litellm_params:
model: "bedrock/cohere.command-text-v14"
aws_region_name: "us-east-1"
Here's how to route between GPT-J embedding (sagemaker endpoint), Amazon Titan embedding (Bedrock) and Azure OpenAI embedding on the proxy server:
model_list:
- model_name: sagemaker-embeddings
litellm_params:
model: "sagemaker/berri-benchmarking-gpt-j-6b-fp16"
- model_name: amazon-embeddings
litellm_params:
model: "bedrock/amazon.titan-embed-text-v1"
- model_name: azure-embeddings
litellm_params:
model: "azure/azure-embedding-model"
api_base: "os.environ/AZURE_API_BASE" # os.getenv("AZURE_API_BASE")
api_key: "os.environ/AZURE_API_KEY" # os.getenv("AZURE_API_KEY")
api_version: "2023-07-01-preview"
general_settings:
master_key: sk-1234 # [OPTIONAL] if set all calls to proxy will require either this key or a valid generated token
model_list:
- model_name: deployed-codebert-base
litellm_params:
# send request to deployed hugging face inference endpoint
model: huggingface/microsoft/codebert-base # add huggingface prefix so it routes to hugging face
api_key: hf_LdS # api key for hugging face inference endpoint
api_base: https://uysneno1wv2wd4lw.us-east-1.aws.endpoints.huggingface.cloud # your hf inference endpoint
- model_name: codebert-base
litellm_params:
# no api_base set, sends request to hugging face free inference api https://api-inference.huggingface.co/models/
model: huggingface/microsoft/codebert-base # add huggingface prefix so it routes to hugging face
api_key: hf_LdS # api key for hugging face
model_list:
- model_name: azure-embedding-model # model group
litellm_params:
model: azure/azure-embedding-model # model name for litellm.embedding(model=azure/azure-embedding-model) call
api_base: your-azure-api-base
api_key: your-api-key
api_version: 2023-07-01-preview
model_list:
- model_name: text-embedding-ada-002 # model group
litellm_params:
model: text-embedding-ada-002 # model name for litellm.embedding(model=text-embedding-ada-002)
api_key: your-api-key-1
- model_name: text-embedding-ada-002
litellm_params:
model: text-embedding-ada-002
api_key: your-api-key-2
https://docs.litellm.ai/docs/providers/xinference
Note add xinference/
prefix to litellm_params
: model
so litellm knows to route to OpenAI
model_list:
- model_name: embedding-model # model group
litellm_params:
model: xinference/bge-base-en # model name for litellm.embedding(model=xinference/bge-base-en)
api_base: http://0.0.0.0:9997/v1
Use this for calling /embedding endpoints on OpenAI Compatible Servers.
Note add openai/
prefix to litellm_params
: model
so litellm knows to route to OpenAI
model_list:
- model_name: text-embedding-ada-002 # model group
litellm_params:
model: openai/<your-model-name> # model name for litellm.embedding(model=text-embedding-ada-002)
api_base: <model-api-base>
Start Proxy
litellm --config config.yaml
Make Request
Sends Request to bedrock-cohere
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "bedrock-cohere",
"messages": [
{
"role": "user",
"content": "gm"
}
]
}'
✨ IP Address Filtering
You need a LiteLLM License to unlock this feature. Grab time, to get one today!
Restrict which IP's can call the proxy endpoints.
general_settings:
allowed_ips: ["192.168.1.1"]
Expected Response (if IP not listed)
{
"error": {
"message": "Access forbidden: IP address not allowed.",
"type": "auth_error",
"param": "None",
"code": 403
}
}
Disable Swagger UI
To disable the Swagger docs from the base url, set
NO_DOCS="True"
in your environment, and restart the proxy.
Configure DB Pool Limits + Connection Timeouts
general_settings:
database_connection_pool_limit: 100 # sets connection pool for prisma client to postgres db at 100
database_connection_timeout: 60 # sets a 60s timeout for any connection call to the db
All settings
{
"environment_variables": {},
"model_list": [
{
"model_name": "string",
"litellm_params": {},
"model_info": {
"id": "string",
"mode": "embedding",
"input_cost_per_token": 0,
"output_cost_per_token": 0,
"max_tokens": 2048,
"base_model": "gpt-4-1106-preview",
"additionalProp1": {}
}
}
],
"litellm_settings": {}, # ALL (https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py)
"general_settings": {
"completion_model": "string",
"disable_spend_logs": "boolean", # turn off writing each transaction to the db
"disable_master_key_return": "boolean", # turn off returning master key on UI (checked on '/user/info' endpoint)
"disable_retry_on_max_parallel_request_limit_error": "boolean", # turn off retries when max parallel request limit is reached
"disable_reset_budget": "boolean", # turn off reset budget scheduled task
"disable_adding_master_key_hash_to_db": "boolean", # turn off storing master key hash in db, for spend tracking
"enable_jwt_auth": "boolean", # allow proxy admin to auth in via jwt tokens with 'litellm_proxy_admin' in claims
"enforce_user_param": "boolean", # requires all openai endpoint requests to have a 'user' param
"allowed_routes": "list", # list of allowed proxy API routes - a user can access. (currently JWT-Auth only)
"key_management_system": "google_kms", # either google_kms or azure_kms
"master_key": "string",
"database_url": "string",
"database_connection_pool_limit": 0, # default 100
"database_connection_timeout": 0, # default 60s
"database_type": "dynamo_db",
"database_args": {
"billing_mode": "PROVISIONED_THROUGHPUT",
"read_capacity_units": 0,
"write_capacity_units": 0,
"ssl_verify": true,
"region_name": "string",
"user_table_name": "LiteLLM_UserTable",
"key_table_name": "LiteLLM_VerificationToken",
"config_table_name": "LiteLLM_Config",
"spend_table_name": "LiteLLM_SpendLogs"
},
"otel": true,
"custom_auth": "string",
"max_parallel_requests": 0, # the max parallel requests allowed per deployment
"global_max_parallel_requests": 0, # the max parallel requests allowed on the proxy all up
"infer_model_from_keys": true,
"background_health_checks": true,
"health_check_interval": 300,
"alerting": [
"string"
],
"alerting_threshold": 0
}
}