feat(halo): serve bge-m3 embeddings alongside coder

Add a multilingual bge-m3 embedding model to the llama-server preset and raise --models-max to 2 so it stays co-resident with the coder model. This gives the RAG stack a local embeddings endpoint without a second service, keeping all inference on halo. Embedding-specific overrides (ubatch-size, context, pooling) are pinned since the global defaults would truncate or misconfigure embedding requests.
2026-05-22 00:35:28 +02:00 · 2026-05-22 00:35:28 +02:00 · ab729a0720
commit ab729a0720
parent a1b55fe2ec
2 changed files with 14 additions and 1 deletions
--- a/systems/x86_64-linux/halo/llama-server.nix
+++ b/systems/x86_64-linux/halo/llama-server.nix
@ -29,7 +29,7 @@
        "--host 0.0.0.0"
        "--port 8000"
        "--models-preset ${./models.ini}"
-        "--models-max 1"
+        "--models-max 2"
      ];
      Restart = "on-failure";
      RestartSec = 10;