feat(halo): serve bge-m3 embeddings alongside coder
Add a multilingual bge-m3 embedding model to the llama-server preset and raise --models-max to 2 so it stays co-resident with the coder model. This gives the RAG stack a local embeddings endpoint without a second service, keeping all inference on halo. Embedding-specific overrides (ubatch-size, context, pooling) are pinned since the global defaults would truncate or misconfigure embedding requests.
This commit is contained in:
parent
a1b55fe2ec
commit
ab729a0720
2 changed files with 14 additions and 1 deletions
|
|
@ -29,7 +29,7 @@
|
|||
"--host 0.0.0.0"
|
||||
"--port 8000"
|
||||
"--models-preset ${./models.ini}"
|
||||
"--models-max 1"
|
||||
"--models-max 2"
|
||||
];
|
||||
Restart = "on-failure";
|
||||
RestartSec = 10;
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue