Skip to main content
We understand that there are use cases that require that no data be sent out to third-party services, including LLM providers like OpenAI. For this reason, we’ve set up Skald so that it can be deployed without connecting to any third-party provider at all, giving you full control over your data. As we’re new to the self-hosted LLM space, we’ve done the ground work to make this happen, but we understand that to make this work well there’s more that needs to be done. If you’re knowledgeable about this, don’t hesitate to reach out or even submit a PR on our repo. We’ve tested and ensured that the setup outlined here works, but it’s certainly limited.

Pre-requisites

  • Have a running LLM inference server with an OpenAI-compatible API (e.g. Ollama, llama.cpp)

Setup

We’ve built Skald to work on top of as much open source infrastructure as possible, meaning we use Postgres with pgvector instead of a service like Pinecone, for example. That means the only parts of the stack that need to be adapted to work with a fully “local” setup are the LLM and the embeddings setup. On our Cloud deployment, we use OpenAI as the primary LLM provider, and Voyage AI for vector embeddings. As a result, we need substitutes for both of those when self-hosting.

Embeddings

We’ll start with embeddings because it’s easier. Our docker-compose.yml comes configured with an optional embedding-service component that’s attached to the local-embedding profile. This service uses the Python package sentence-transformers to expose an /embed and a /rerank endpoint that our other services can use to generate vector embeddings and rerank results without sending data elsewhere. In order to use it, you should set the env var EMBEDDING_PROVIDER=local and run the stack with docker-compose --profile local-embedding up. This should then just work out of the box.

LLM

For the LLM configuration, you’ll need to have a running LLM inference server yourself. This server should have an OpenAI compatible API and support tool-calling. We’ve tested this using llama.cpp but equivalent providers should also work. Once you have this running, you can then set LLM_PROVIDER=local and LOCAL_LLM_BASE_URL to your server’s endpoint. As of today, the only configuration we support is the URL, meaning this is best suited for a server running on the same machine for security reasons. We intend to expand this to include support for security-related variables that would make it viable to run the LLM server on another machine, and we’d be keen to hear from people who are deploying open-source models how you’ve been securing these servers.

Postgres

By default we will spin up a Postgres instance as part of the Docker Compose stack for you, and we will install pgvector on it. If you’re running a production deploy you would ideally host and manage Postgres yourself. If you do so, you just need to set the DATABASE_URL env var to point to your instance, and run the stack without starting the Postgres service.
If you do host Postgres elsewhere, the one thing you need to remember is to install the pgvector extension on the instance.

RabbitMQ

The same concepts that apply to Postgres apply to RabbitMQ. Ideally you’d host this yourself in a prod deployment, and for that you should spin up the stack without the RabbitMQ service and set the following vars:
RABBITMQ_HOST
RABBITMQ_PORT
RABBITMQ_USER
RABBITMQ_PASSWORD
RABBITMQ_VHOST
I