feat: add multimodal image marker support with Ollama vision
This commit is contained in:
parent
63aacb09ff
commit
dcd0bf641d
21 changed files with 1152 additions and 78 deletions
|
|
@ -51,6 +51,22 @@ Notes:
|
|||
- Model cache previews come from `zeroclaw models refresh --provider <ID>`.
|
||||
- These are runtime chat commands, not CLI subcommands.
|
||||
|
||||
## Inbound Image Marker Protocol
|
||||
|
||||
ZeroClaw supports multimodal input through inline message markers:
|
||||
|
||||
- Syntax: ``[IMAGE:<source>]``
|
||||
- `<source>` can be:
|
||||
- Local file path
|
||||
- Data URI (`data:image/...;base64,...`)
|
||||
- Remote URL only when `[multimodal].allow_remote_fetch = true`
|
||||
|
||||
Operational notes:
|
||||
|
||||
- Marker parsing applies to user-role messages before provider calls.
|
||||
- Provider capability is enforced at runtime: if the selected provider does not support vision, the request fails with a structured capability error (`capability=vision`).
|
||||
- Linq webhook `media` parts with `image/*` MIME type are automatically converted to this marker format.
|
||||
|
||||
## Channel Matrix
|
||||
|
||||
---
|
||||
|
|
@ -349,4 +365,3 @@ If a specific channel task crashes or exits, the channel supervisor in `channels
|
|||
- `Channel message worker crashed:`
|
||||
|
||||
These messages indicate automatic restart behavior is active, and you should inspect preceding logs for root cause.
|
||||
|
||||
|
|
|
|||
|
|
@ -62,6 +62,24 @@ Notes:
|
|||
- `reasoning_enabled = true` explicitly requests reasoning for supported providers (`think: true` on `ollama`).
|
||||
- Unset keeps provider defaults.
|
||||
|
||||
## `[multimodal]`
|
||||
|
||||
| Key | Default | Purpose |
|
||||
|---|---|---|
|
||||
| `max_images` | `4` | Maximum image markers accepted per request |
|
||||
| `max_image_size_mb` | `5` | Per-image size limit before base64 encoding |
|
||||
| `allow_remote_fetch` | `false` | Allow fetching `http(s)` image URLs from markers |
|
||||
|
||||
Notes:
|
||||
|
||||
- Runtime accepts image markers in user messages with syntax: ``[IMAGE:<source>]``.
|
||||
- Supported sources:
|
||||
- Local file path (for example ``[IMAGE:/tmp/screenshot.png]``)
|
||||
- Data URI (for example ``[IMAGE:data:image/png;base64,...]``)
|
||||
- Remote URL only when `allow_remote_fetch = true`
|
||||
- Allowed MIME types: `image/png`, `image/jpeg`, `image/webp`, `image/gif`, `image/bmp`.
|
||||
- When the active provider does not support vision, requests fail with a structured capability error (`capability=vision`) instead of silently dropping images.
|
||||
|
||||
## `[gateway]`
|
||||
|
||||
| Key | Default | Purpose |
|
||||
|
|
|
|||
|
|
@ -56,6 +56,13 @@ credential is not reused for fallback providers.
|
|||
| `lmstudio` | `lm-studio` | Yes | (optional; local by default) |
|
||||
| `nvidia` | `nvidia-nim`, `build.nvidia.com` | No | `NVIDIA_API_KEY` |
|
||||
|
||||
### Ollama Vision Notes
|
||||
|
||||
- Provider ID: `ollama`
|
||||
- Vision input is supported through user message image markers: ``[IMAGE:<source>]``.
|
||||
- After multimodal normalization, ZeroClaw sends image payloads through Ollama's native `messages[].images` field.
|
||||
- If a non-vision provider is selected, ZeroClaw returns a structured capability error instead of silently ignoring images.
|
||||
|
||||
### Bedrock Notes
|
||||
|
||||
- Provider ID: `bedrock` (alias: `aws-bedrock`)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue