38ms. Offline. Typed coordinates.
Local. $0. No GPU. No API key.
Screenshot tooling returns descriptions. Workflows guess where to click, miss the target, and fail. OSWorld benchmarks show a 58% overall task failure rate (GPT-4V baseline, "arXiv:2404.07972").
Agents that guess coordinates fail. Agents with exact coordinates act.
farscry returns typed elements with exact pixel coordinates. The workflow knows the button is at (400,300). It clicks. It succeeds.
Screenshot, error printout, Figma export, phone photo of a screen. Offline. No GPU.
After an action, farscry diffs before → after. No re-upload. No tokens wasted on pixels that didn't change.
Runs local for all four. No server, no GPU, no account.
The output is typed UI elements with exact pixel coordinates, not a description. Pipe to any agent, MCP client, or CLI tool.
In MCP mode, the daemon tracks state automatically, no extra command needed.
After each farscry_extract call,
the next call returns what changed since the last one.
Via CLI, pass two screenshots directly.
Capture a region with your system shortcut. Pipe straight to your agent. Zero files. Zero friction.
After farscry setup, Cmd+V detects what's in your clipboard, image or text, and routes it. No command. No alias. Just paste.
Models (~12MB English) download to ~/.farscry/models/ on first run. No account needed.
VASP (Visual Application State Protocol) defines how workflows receive visual context - as typed coordinates with positions, not descriptions. farscry is the reference implementation.
Like MCP standardized tool connectivity, VASP standardizes visual context for workflows. One format, any framework, any workflow.
vasp-protocol.github.io/spec -> Local docs ->