glovebox-mcp — Segentic Lab

glovebox-mcp is a Model Context Protocol server that gives an AI agent a real desktop to drive — mouse, keyboard, screenshots, and vision grounding — sealed inside a nested X11 window so it can never touch your actual screen, files, or other apps.

Like a lab glovebox: the agent reaches in and manipulates real applications, isolated from everything else. The host can run Wayland; the sandbox gives the agent a real X server to drive, and you can watch it live or close it instantly.

An AI agent filling a sign-up form inside the sandbox — cursor, unicode typing, submit

What makes it different

Any MCP client / harness. Claude Code, Cursor, Codex, or your own agent — it’s a standard MCP server, not tied to any host.
Selectable vision backends. none (the agent’s own vision reads the screenshots), basic (Tesseract OCR — text + coordinates, CPU), or local (OmniParser on a GPU — icons and text with pixel-precise boxes). One install flag switches modes.
Multi-instance, truly parallel. launch_app spins up any GUI app in its own display — each with its own cursor — so several sub-agents can work at once, one window each.
Unicode-safe input. Diacritics (č/š/ž…) are inserted via the clipboard, because synthetic unicode keystrokes get silently dropped by some GTK apps.
Files, both directions. Browser uploads go through the Chrome DevTools Protocol (the native picker hangs snap Chromium); native apps use open_file; every instance gets a files/<N>/ staging folder, and browser downloads land there automatically.
One-call observe. Actions can return the resulting screenshot in the same call, so routine steps don’t bloat the agent’s context.

Get it

Open-source under MIT → github.com/segentic-lab/glovebox-mcp.

One line per vision mode — the installer sets up the X11 sandbox (auto on Debian/Ubuntu, guided on Fedora/Arch) and writes your MCP client config with the right paths:

git clone https://github.com/segentic-lab/glovebox-mcp && cd glovebox-mcp && ./install.sh none

It ships with an AGENTS.md you paste into your agent’s system prompt — the observe → act → verify loop, grounding, the upload/unicode gotchas, and when to stop. If it’s useful, a ⭐ on the repo helps.