inference-manager
The inference-manager manages inference runtimes (e.g., vLLM and Ollama) in containers, load models, and process requests.
Set up Inference Server/Engine for development
Requirements:
Run the following command:
make setup-llmariner setup-cluster helm-apply-inference
[!TIP]
- Run just only
make helm-reapply-inference-server or make helm-reapply-inference-engine, it will rebuild inference-manager container images, deploy them using the local helm chart, and restart containers.
- You can configure parameters in .values.yaml.
Try out inference APIs
with curl:
curl --request POST http://localhost:8080/v1/chat/completions -d '{
"model": "google-gemma-2b-it-q4_0",
"messages": [{"role": "user", "content": "hello"}]
}'
with llma:
export LLMARINER_API_KEY=dummy
llma chat completions create \
--model google-gemma-2b-it-q4_0 \
--role system \
--completion 'hi'