Add a multilingual bge-m3 embedding model to the llama-server preset and raise --models-max to 2 so it stays co-resident with the coder model. This gives the RAG stack a local embeddings endpoint without a second service, keeping all inference on halo. Embedding-specific overrides (ubatch-size, context, pooling) are pinned since the global defaults would truncate or misconfigure embedding requests. |
||
|---|---|---|
| .. | ||
| amd | ||
| attic | ||
| halo | ||
| mx | ||
| nixtee1 | ||
| sgx | ||
| t15 | ||
| x1 | ||