Expose halo's [fast] MoE preset through the LiteLLM gateway and make it
the rag CLI's default chat model (overridable via RAG_CHAT_MODEL), so
query synthesis is quicker than the larger coder model.
Uptime Kuma already binds 4000, so the gateway never got the port and
requests hit the wrong service. Move LiteLLM to 4001 and update the rag
CLI default endpoint to match.
Stand up document retrieval as shared, client-agnostic primitives rather
than locking it inside Open WebUI:
- Qdrant as the LAN-reachable vector store
- LiteLLM gains a bge-m3 route so sgx:4000 also serves /v1/embeddings
- a thin `rag` CLI (ingest/query, optional coder synthesis) usable from
any machine and from scripts
Embeddings and synthesis run on halo via the gateway; the CLI is
configured entirely through RAG_* env vars.
Exposes an OpenAI-compatible endpoint on sgx:4000 (LAN-reachable) that
routes the `coder` model to halo's llama-server, so clients get a stable
gateway with per-key auth instead of hardcoding halo's address. Master
key is sourced from a sops-encrypted env file.