Vision
TAU's Vision MCP gives any model image understanding — even those without native vision support.
How It Works
Vision MCP provides a bridge between models and visual content:
┌─────────────────────────────────────────────────────────────────┐ │ Vision MCP Flow │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ User/Agent │ │ │ │ │ ▼ │ │ understand_image(base64_image) │ │ │ │ │ ├── MLX VLM (macOS, local, free) │ │ │ └── Qwen3-VL on Apple Silicon │ │ │ │ │ └── Cloud fallback (if MLX unavailable) │ │ ├── Claude 4.5 Vision │ │ ├── GPT-5 Vision │ │ └── Gemini 3 Pro Vision │ │ │ │ ▼ │ │ Text description returned to agent │ │ │ └─────────────────────────────────────────────────────────────────┘
Vision Tools
| Tool | Description |
|---|---|
| understand_image | Analyze a single image (base64 encoded) |
| understand_images | Compare and analyze multiple images |
| file_describe_image | Describe image from file path |
| browser_screenshot_describe | Take browser screenshot and analyze in one call |
MLX Local Vision (macOS)
On Apple Silicon Macs, TAU uses a local Vision-Language Model powered by MLX:
- Model: Qwen3-VL (quantized for MLX)
- Performance: ~500ms per image on M3 Max
- Privacy: Images never leave your machine
- Cost: Free (local inference)
Cloud Fallback
When MLX is unavailable (non-macOS, no GPU), TAU falls back to cloud vision:
# Configure cloud vision provider
export TAU_VISION_PROVIDER=anthropic # or openai, google
# Fallback chain:
# 1. MLX (macOS Apple Silicon)
# 2. Anthropic Claude 4.5 Vision
# 3. OpenAI GPT-5 Vision
# 4. Google Gemini 3 Pro VisionUsage Examples
Analyze Screenshot
# In TAU conversation:
> Take a screenshot of the login page and tell me what's wrong
# TAU automatically:
# 1. Opens browser
# 2. Takes screenshot
# 3. Sends to vision model
# 4. Returns analysisAnalyze Local Image
# Drag and drop image into TUI, or:
> Describe this image: @screenshot.png
# Or via CLI:
tau mcp call understand_image --image_path ./screenshot.pngCompare UI States
# Compare before/after screenshots:
> Compare these two images and tell me what changed:
> @before.png @after.pngBrowser Integration
Vision works seamlessly with the browser MCP:
# browser_screenshot_describe combines screenshot + analysis:
> Open https://my-app.com and describe what you see
# This internally:
# 1. browser_open("https://my-app.com")
# 2. browser_screenshot()
# 3. understand_image(screenshot_base64)
# 4. Returns visual description
# Useful for:
# - Visual regression testing
# - Accessibility audits
# - UI debugging
# - Layout verificationConfiguration
# Environment variables
export TAU_VISION_PROVIDER=mlx # mlx, anthropic, openai, google
export TAU_VISION_MODEL=qwen3-vl # For MLX
export TAU_VISION_MAX_TOKENS=1024 # Max response tokens
# In config.toml
[vision]
provider = "mlx"
fallback_provider = "anthropic"
max_image_size_mb = 10TUI Indicator
When vision tools are active, the TUI status bar shows:
[B] [W] [V] [T] — [V] = Vision active
Best Practices
- Use specific prompts — "Describe the navigation menu" vs "Describe the image"
- Optimize image size — Large images slow down analysis
- Prefer browser_screenshot_describe — Combines actions efficiently
- Use MLX when possible — Faster and free on macOS