Add a multilingual bge-m3 embedding model to the llama-server preset and raise --models-max to 2 so it stays co-resident with the coder model. This gives the RAG stack a local embeddings endpoint without a second service, keeping all inference on halo. Embedding-specific overrides (ubatch-size, context, pooling) are pinned since the global defaults would truncate or misconfigure embedding requests. |
||
|---|---|---|
| .. | ||
| aarch64-darwin | ||
| aarch64-linux | ||
| x86_64-darwin/mpro | ||
| x86_64-linux | ||
| nixbuild.nix | ||