feat(rag): route the fast model and use it for synthesis by default

Expose halo's [fast] MoE preset through the LiteLLM gateway and make it the rag CLI's default chat model (overridable via RAG_CHAT_MODEL), so query synthesis is quicker than the larger coder model.
2026-05-22 09:15:59 +02:00 · 2026-05-22 09:15:59 +02:00 · bc0d79db57
commit bc0d79db57
parent 2b1bba0703
2 changed files with 10 additions and 1 deletions
--- a/systems/x86_64-linux/sgx/litellm.nix
+++ b/systems/x86_64-linux/sgx/litellm.nix
@ -22,6 +22,15 @@
            api_key = "none"; # llama-server requires no key; value is ignored
          };
        }
+        {
+          # Faster MoE chat model (the `[fast]` preset), default for rag synthesis.
+          model_name = "fast";
+          litellm_params = {
+            model = "openai/fast";
+            api_base = "http://halo:8000/v1";
+            api_key = "none";
+          };
+        }
        {
          # Multilingual embeddings, also served by halo's router (the `[bge-m3]`
          # preset). Exposes /v1/embeddings on this gateway for the rag CLI.