tauτ

Vision

TAU's Vision MCP gives any model image understanding — even those without native vision support.

How It Works

Vision MCP provides a bridge between models and visual content:

┌─────────────────────────────────────────────────────────────────┐
│                       Vision MCP Flow                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   User/Agent                                                    │
│       │                                                         │
│       ▼                                                         │
│   understand_image(base64_image)                                │
│       │                                                         │
│       ├── MLX VLM (macOS, local, free)                         │
│       │   └── Qwen3-VL on Apple Silicon                        │
│       │                                                         │
│       └── Cloud fallback (if MLX unavailable)                  │
│           ├── Claude 4.5 Vision                                │
│           ├── GPT-5 Vision                                     │
│           └── Gemini 3 Pro Vision                              │
│                                                                 │
│       ▼                                                         │
│   Text description returned to agent                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Vision Tools

ToolDescription
understand_imageAnalyze a single image (base64 encoded)
understand_imagesCompare and analyze multiple images
file_describe_imageDescribe image from file path
browser_screenshot_describeTake browser screenshot and analyze in one call

MLX Local Vision (macOS)

On Apple Silicon Macs, TAU uses a local Vision-Language Model powered by MLX:

  • Model: Qwen3-VL (quantized for MLX)
  • Performance: ~500ms per image on M3 Max
  • Privacy: Images never leave your machine
  • Cost: Free (local inference)

Cloud Fallback

When MLX is unavailable (non-macOS, no GPU), TAU falls back to cloud vision:

# Configure cloud vision provider
export TAU_VISION_PROVIDER=anthropic    # or openai, google

# Fallback chain:
# 1. MLX (macOS Apple Silicon)
# 2. Anthropic Claude 4.5 Vision
# 3. OpenAI GPT-5 Vision
# 4. Google Gemini 3 Pro Vision

Usage Examples

Analyze Screenshot

# In TAU conversation:
> Take a screenshot of the login page and tell me what's wrong

# TAU automatically:
# 1. Opens browser
# 2. Takes screenshot
# 3. Sends to vision model
# 4. Returns analysis

Analyze Local Image

# Drag and drop image into TUI, or:
> Describe this image: @screenshot.png

# Or via CLI:
tau mcp call understand_image --image_path ./screenshot.png

Compare UI States

# Compare before/after screenshots:
> Compare these two images and tell me what changed:
> @before.png @after.png

Browser Integration

Vision works seamlessly with the browser MCP:

# browser_screenshot_describe combines screenshot + analysis:
> Open https://my-app.com and describe what you see

# This internally:
# 1. browser_open("https://my-app.com")
# 2. browser_screenshot()
# 3. understand_image(screenshot_base64)
# 4. Returns visual description

# Useful for:
# - Visual regression testing
# - Accessibility audits
# - UI debugging
# - Layout verification

Configuration

# Environment variables
export TAU_VISION_PROVIDER=mlx           # mlx, anthropic, openai, google
export TAU_VISION_MODEL=qwen3-vl         # For MLX
export TAU_VISION_MAX_TOKENS=1024        # Max response tokens

# In config.toml
[vision]
provider = "mlx"
fallback_provider = "anthropic"
max_image_size_mb = 10

TUI Indicator

When vision tools are active, the TUI status bar shows:

[B] [W] [V] [T] — [V] = Vision active

Best Practices

  • Use specific prompts — "Describe the navigation menu" vs "Describe the image"
  • Optimize image size — Large images slow down analysis
  • Prefer browser_screenshot_describe — Combines actions efficiently
  • Use MLX when possible — Faster and free on macOS