v0.1.0 · open source

vision APIs describe.
farscry gives coordinates.

38ms. Offline. Typed coordinates.
Local. $0. No GPU. No API key.

farscry extract screen.png

$ farscry extract screen.png === farscry visual context === source: screen.png screen_type: config state_id: phash:8f4a2c9d1e3b7f6a confidence: high agent_context: "Payment settings - Save available" --- [middle-right] button "Save Changes" enabled:true [middle-center] input "Card number" empty:true [middle-center] input "Expiry" value:"12/26" [bottom] error "Value must be ≤ 10000" affordances: click → "Save Changes" at (400,300) enabled:true type → "Card number" at (300,200) current:""

Coordinate Extraction

Coordinates, not captions

Screenshot tooling returns descriptions. Workflows guess where to click, miss the target, and fail. OSWorld benchmarks show a 58% overall task failure rate (GPT-4V baseline, "arXiv:2404.07972").

Agents that guess coordinates fail. Agents with exact coordinates act.

farscry returns typed elements with exact pixel coordinates. The workflow knows the button is at (400,300). It clicks. It succeeds.

Screenshot, error printout, Figma export, phone photo of a screen. Offline. No GPU.

farscry output

[middle-right] button "Submit" enabled:true at (640,480) [middle-center] input "Username" empty:true at (400,200) [middle-center] input "Password" empty:true at (400,280) [bottom] error "Invalid password" at (400,340) [bottom] link "Forgot password?" at (400,380) affordances: type → "Username" at (400,200) type → "Password" at (400,280) click → "Submit" at (640,480)

Performance

65×

faster than Tesseract on 4K screens

38ms warm · $0 · offline · N=223, ScreenSpot-Pro (MIT)

65× faster · 4K screens vs Tesseract · 38ms warm

~9× fewer tokens · 1080p vs Claude Vision (measured)

~16× fewer tokens · 4K screens vs Claude Vision · N=223

Speed

vs Tesseract 5.5 (4K) ~2,500ms → 38ms 65×

vs Cloud Vision API ~2-5s → 38ms 65–130×

Token usage per image

vs Claude Vision · 1080p 1,568 tokens → ~175 tokens ~9×

vs Claude Vision · 4K (N=223) 1,568 tokens → ~97 tokens ~16×

Cost · 10,000 images / day

Claude Sonnet 4.6 · $3/MTok $17,000/year → $0

Claude Opus 4.7 · $5/MTok (4K) $90,000/year → $0

Methodology → github.com/teles-forge/farscry/benchmarks

Automatic Diff

38ms instead of 5 seconds, every action.

After an action, farscry diffs before → after. No re-upload. No tokens wasted on pixels that didn't change.

without

action → re-screenshot → 1,568 tokens to cloud → wait 2-5s → read result

~$0.0047 · ~3 seconds

farscry

action → farscry diff → 38ms → read result

One binary. Four ways to use it.

Runs local for all four. No server, no GPU, no account.

mode 1: describe Any image → typed coordinates

The output is typed UI elements with exact pixel coordinates, not a description. Pipe to any agent, MCP client, or CLI tool.

farscry extract checkout.png

$ farscry extract checkout.png === farscry visual context === screen_type: form --- [top-center] heading "Checkout" [middle-center] input "Name" empty:true [middle-center] input "Card number" empty:true [middle-center] input "Expiry" value:"12/26" [middle-right] input "CVV" masked:true [bottom] button "Pay $24.99" enabled:true [bottom] button "Cancel" enabled:true affordances: type → "Name" at (340,180) type → "Card number" at (340,240) click → "Pay $24.99" at (400,420)

mode 2: diff After every action, only what changed

In MCP mode, the daemon tracks state automatically, no extra command needed. After each farscry_extract call, the next call returns what changed since the last one. Via CLI, pass two screenshots directly.

farscry diff before.png after.png

$ farscry diff before.png after.png === farscry diff === from: phash:8f4a2c3d to: phash:3d9b1e7a --- appeared: error "Card declined" at (0,48) changed: button "Submit" enabled:true → false changed: button "Pay $24.99" loading:true removed: spinner at (450,200) 38ms · 3 of 12 elements changed

mode 3: pipe Cmd+Shift+4, pipe, done

Capture a region with your system shortcut. Pipe straight to your agent. Zero files. Zero friction.

farscry extract --from-clipboard | claude -p "fix this"

Smart Paste

Cmd+V becomes image-aware.

After farscry setup, Cmd+V detects what's in your clipboard, image or text, and routes it. No command. No alias. Just paste.

farscry setup

$farscry setup

Configure smart Cmd+V? [y/N]: y

✓ ~/.farscry/smart-paste.sh created

iTerm2 · Warp · Kitty · Gnome Terminal · Windows Terminal

Screenshot → Cmd+V → image detected → sent to your agent.
Text in clipboard falls back to normal paste.

Install

Up in 30 seconds.

npm npm install -g farscry

pip pip install farscry

homebrew brew install teles-forge/tap/farscry

cargo cargo install farscry

curl curl -fsSL https://farscry.dev/install | sh

$farscry setup

auto-detect MCP clients, wire up in 30 seconds

Models (~12MB English) download to ~/.farscry/models/ on first run. No account needed.

Protocol

The open protocol behind farscry

VASP (Visual Application State Protocol) defines how workflows receive visual context - as typed coordinates with positions, not descriptions. farscry is the reference implementation.

Like MCP standardized tool connectivity, VASP standardizes visual context for workflows. One format, any framework, any workflow.

vasp-protocol.github.io/spec -> Local docs ->

MCP = how workflows connect to tools
VASP = how workflows understand visual state