feat(halo): serve bge-m3 embeddings alongside coder

Add a multilingual bge-m3 embedding model to the llama-server preset and
raise --models-max to 2 so it stays co-resident with the coder model.
This gives the RAG stack a local embeddings endpoint without a second
service, keeping all inference on halo. Embedding-specific overrides
(ubatch-size, context, pooling) are pinned since the global defaults
would truncate or misconfigure embedding requests.
This commit is contained in:
Harald Hoyer 2026-05-22 00:35:28 +02:00
parent a1b55fe2ec
commit ab729a0720
2 changed files with 14 additions and 1 deletions

View file

@ -29,7 +29,7 @@
"--host 0.0.0.0"
"--port 8000"
"--models-preset ${./models.ini}"
"--models-max 1"
"--models-max 2"
];
Restart = "on-failure";
RestartSec = 10;