v0.1.0 · open source

vision APIs describe.
farscry gives coordinates.

38ms. Offline. Typed coordinates.
Local. $0. No GPU. No API key.

farscry extract screen.png
$ farscry extract screen.png === farscry visual context === source: screen.png screen_type: config state_id: phash:8f4a2c9d1e3b7f6a confidence: high agent_context: "Payment settings - Save available" ---   [middle-right] button "Save Changes" enabled:true [middle-center] input "Card number" empty:true [middle-center] input "Expiry" value:"12/26" [bottom] error "Value must be ≤ 10000"   affordances: click "Save Changes" at (400,300) enabled:true type "Card number" at (300,200) current:""

Coordinates, not captions

Screenshot tooling returns descriptions. Workflows guess where to click, miss the target, and fail. OSWorld benchmarks show a 58% overall task failure rate (GPT-4V baseline, "arXiv:2404.07972").

Agents that guess coordinates fail. Agents with exact coordinates act.

farscry returns typed elements with exact pixel coordinates. The workflow knows the button is at (400,300). It clicks. It succeeds.

Screenshot, error printout, Figma export, phone photo of a screen. Offline. No GPU.

farscry output
[middle-right] button "Submit" enabled:true at (640,480) [middle-center] input "Username" empty:true at (400,200) [middle-center] input "Password" empty:true at (400,280) [bottom] error "Invalid password" at (400,340) [bottom] link "Forgot password?" at (400,380)   affordances: type "Username" at (400,200) type "Password" at (400,280) click "Submit" at (640,480)

Performance
65×
faster than Tesseract on 4K screens
38ms warm · $0 · offline · N=223, ScreenSpot-Pro (MIT)
65× faster · 4K screens vs Tesseract · 38ms warm
~9× fewer tokens · 1080p vs Claude Vision (measured)
~16× fewer tokens · 4K screens vs Claude Vision · N=223
Speed
vs Tesseract 5.5 (4K) ~2,500ms 38ms 65×
vs Cloud Vision API ~2-5s 38ms 65–130×
Token usage per image
vs Claude Vision · 1080p 1,568 tokens ~175 tokens ~9×
vs Claude Vision · 4K (N=223) 1,568 tokens ~97 tokens ~16×
Cost · 10,000 images / day
Claude Sonnet 4.6 · $3/MTok $17,000/year $0
Claude Opus 4.7 · $5/MTok (4K) $90,000/year $0
Methodology → github.com/teles-forge/farscry/benchmarks

Automatic Diff

38ms instead of 5 seconds, every action.

After an action, farscry diffs before → after. No re-upload. No tokens wasted on pixels that didn't change.

without
action re-screenshot 1,568 tokens to cloud wait 2-5s read result
~$0.0047 · ~3 seconds
farscry
action farscry diff 38ms read result
$0

One binary. Four ways to use it.

Runs local for all four. No server, no GPU, no account.

mode 1: describe Any image → typed coordinates

The output is typed UI elements with exact pixel coordinates, not a description. Pipe to any agent, MCP client, or CLI tool.

farscry extract checkout.png
$ farscry extract checkout.png === farscry visual context === screen_type: form ---   [top-center] heading "Checkout" [middle-center] input "Name" empty:true [middle-center] input "Card number" empty:true [middle-center] input "Expiry" value:"12/26" [middle-right] input "CVV" masked:true [bottom] button "Pay $24.99" enabled:true [bottom] button "Cancel" enabled:true   affordances: type "Name" at (340,180) type "Card number" at (340,240) click "Pay $24.99" at (400,420)
mode 2: diff After every action, only what changed

In MCP mode, the daemon tracks state automatically, no extra command needed. After each farscry_extract call, the next call returns what changed since the last one. Via CLI, pass two screenshots directly.

farscry diff before.png after.png
$ farscry diff before.png after.png === farscry diff === from: phash:8f4a2c3d to: phash:3d9b1e7a ---   appeared: error "Card declined" at (0,48) changed: button "Submit" enabled:true → false changed: button "Pay $24.99" loading:true removed: spinner at (450,200)   38ms · 3 of 12 elements changed
mode 3: pipe Cmd+Shift+4, pipe, done

Capture a region with your system shortcut. Pipe straight to your agent. Zero files. Zero friction.

farscry extract --from-clipboard | claude -p "fix this"

Smart Paste

Cmd+V becomes image-aware.

After farscry setup, Cmd+V detects what's in your clipboard, image or text, and routes it. No command. No alias. Just paste.

farscry setup
$farscry setup
Configure smart Cmd+V? [y/N]: y
✓ ~/.farscry/smart-paste.sh created
iTerm2 · Warp · Kitty · Gnome Terminal · Windows Terminal
Screenshot → Cmd+V → image detected → sent to your agent.
Text in clipboard falls back to normal paste.

Install

Up in 30 seconds.

npm npm install -g farscry
pip pip install farscry
homebrew brew install teles-forge/tap/farscry
cargo cargo install farscry
curl curl -fsSL https://farscry.dev/install | sh
$farscry setup
auto-detect MCP clients, wire up in 30 seconds

Models (~12MB English) download to ~/.farscry/models/ on first run. No account needed.


The open protocol behind farscry

VASP (Visual Application State Protocol) defines how workflows receive visual context - as typed coordinates with positions, not descriptions. farscry is the reference implementation.

Like MCP standardized tool connectivity, VASP standardizes visual context for workflows. One format, any framework, any workflow.

vasp-protocol.github.io/spec ->    Local docs ->

MCP = how workflows connect to tools
VASP = how workflows understand visual state
farscry setup
$ farscry setup farscry v0.1.0   Detected: Claude Code, Cursor   ── Claude Code ── Config file: ~/.claude/mcp.json "mcpServers": "farscry": "command": "farscry", "args": ["serve", "--mcp"]   farscry never modifies your config files automatically.