Let AI control anything on your computer through GUI interaction. All in one Exe, no install required.
  • C# 70.5%
  • JavaScript 12.6%
  • HTML 10.7%
  • CSS 6.1%
  • PowerShell 0.1%
Find a file
dikkadev 3c60c04724 feat(agent): render live output as markdown and add watch-based dev runner
Render agent thoughts and results as sanitized markdown so headings, lists, code blocks, and links display cleanly during execution, and switch the live step log to newest-first ordering for easier monitoring. Also document the new watch-based dev script flow and add a no-watch option for one-shot runs.
2026-05-30 15:12:36 +02:00
.github/assets Add icon to readme 2026-05-27 22:26:17 -07:00
src feat(agent): render live output as markdown and add watch-based dev runner 2026-05-30 15:12:36 +02:00
.gitattributes Add .gitattributes and .gitignore. 2026-03-01 13:29:44 -07:00
.gitignore Add .gitattributes and .gitignore. 2026-03-01 13:29:44 -07:00
AGENTS.md docs: add agent onboarding and repo safety notes 2026-05-30 10:45:09 +02:00
LICENSE Add note to license file 2026-05-27 22:01:37 -07:00
README.md feat(agent): render live output as markdown and add watch-based dev runner 2026-05-30 15:12:36 +02:00
run-dev.ps1 feat(agent): render live output as markdown and add watch-based dev runner 2026-05-30 15:12:36 +02:00

Universal Visual Agent

An AI desktop assistant app capable of interacting with your entire computer (and any apps) like you do.

For a one-level-down technical map, see ARCHITECTURE.md.

What It Does (And Why "Universal"?)

Simply put, it lets your AI works across the whole computer. Unlike most AI "computer use" tools which only work in a browser or via command line, this uses the computer like you do.

It controls Windows purely through visual perception and GUI interaction. By interpreting raw pixels and sending hardware-level input (mouse movements, clicks, keystrokes), it operates exactly like a human would. This makes it universally compatible with any graphical application on your machine.

Optional: Human Control Only Mode

It now supports two ways to use it: a Human Control Only Mode that is enabled by default, where the AI guides you step-by-step and shows where to click without sending any real input itself, and an autonomous mode where it can move the mouse and type on its own.

Demonstration

Example of it queuing multiple actions at once, while accurately clicking exact coordinates within the entire 4K screen.

Demo Gif

Prompt: In MS Paint draw a self portrait with multiple colors with the brush tool. Use the full action queue when possible.
Model: Gemini 3.5 Flash

Example Use Cases

  • Ask it to show you how to do something instead of just getting a text description like other AIs.
  • Ask it to troubleshoot some error you're getting and figure out the cause.
  • Ask it to visually find an image in a folder of un-labelled files, based on a description.
  • Tell it to check every 10 seconds if a video render is done, then when it is, copy the file somewhere.
    • (Though not recommended to leave unattended, unless in a controlled sandbox environment)

Frequently Asked Questions

Q: Is this like OpenClaw or Hermes?

A: No, this is not intended to be a 24/7 running agent. It also doesn't rely on CLI/Shell commands. It's meant for individual tasks or problems you'd normally have to do yourself.

Q: How long can it run?

A: There's not actually a limit. You can set the max number of steps to any number in the settings. The default is arbitrarily set to 100 steps.

Q: Which AI Services are supported?

A: Currently ChatGPT, Gemini, and Claude. Your own API key is required. Currently it seems Gemini works the best, especially gemini-flash-latest

Q: Doesn't this use a ton of tokens?

A: Sort of, but not as much as you might think. Each step is maybe 3k tokens, but input tokens are cheaper. Completion tokens are usually as few as 50, up to a few hundred for many queued actions. The big factor is how many thinking tokens are used.

  • For example, with Gemini 3.5 Flash, it seems each step with a single action costs about 1 cent or less.
  • I recommend setting thinking to the minimum, but even then it may use a few thousand tokens if it's going to queue up a lot of actions.
  • There's also context mitigation logic, such as summarization context every X steps, and removing past images from the context window (Both can be disabled in settings)

Q: Can I do other stuff while it's going?

A: Not really. It won't block your mouse or keyboard input or anything. But it's best to not touch anything while it's running to prevent interfering. You can do little stuff between steps to help it though, like if it clicked the wrong thing, click it yourself.

  • You can use global keyboard hotkeys to pause or stop it at any time.

Q: What is Human Only Control Mode?

A: By default it starts in Human Control Only Mode, where the AI tells you what to do and where to click while you perform the actions yourself. It draws crosshairs for where to click, and boxes around where to enter text. It also displays a small text box you can copy the recommended text from. This can be switched to fully autonomous mode in the config settings.


How It's Built Different

  • Single Portable Exe - NO Installation Required - Releases are compiled with single-exe mode, it's just one file.
    • No bloated 🤡Python🤡 or 🤡NodeJS🤡 or other environment installation.
    • ZERO Third-Party Dependencies 😤 (Uses core .NET Libraries and official Microsoft packages only) #AllMyHomiesHateDependencies
    • Ideal for running in VMs and Sandboxes. Spin up a fresh sandbox instance and it's ready to go, just set the API key.
  • Visual-Only Operation: Works on any app, regardless of underlying framework, because it relies strictly on screen pixels.
  • Multiple AI Providers: Supports Google Gemini (default), OpenAI (ChatGPT), and Anthropic (Claude).

Additional Features:

  • Human Control Only Mode: Enabled by default. The AI guides you with step-by-step instructions and on-screen click markers, but it never sends mouse or keyboard input until you disable that mode.
  • Global Hotkey Support: Pause, resume, or terminate the agent instantly even when the web UI is minimized.
  • Live Agent Redirection: Issue mid-flight text instructions to the agent to override or adjust its current execution plan.
  • Config Import/Export: Export your config options to a file and import it. Settings are also stored in the browser to survive between sessions.
  • .NET Based - Theoretically Cross Platform - Currently the only input providers are set up for Windows, but it could work with MacOS or even Linux if someone implemented the interfaces for their APIs.

Comparison With Computer-Use Tools

Feature Universal Visual Agent OpenAI Operator Google Gemini Computer Use Anthropic Computer Use Microsoft Research UFO (UFO³)
Ready-to-Run App
Ready Out of the Box

N/A
(Web Hosted)

Dev API,
Not an app4

Dev API,
Not an app5

Research Framework
Setup Difficulty
Easy
(Just launch the portable .exe)

Easy
(Log into web service)

Hard
(Requires Python, Playwright)7

Hard
(Requires custom tooling)8

Hard
(Conda, pip installs, YAML configuration)9
Computer-Wide Control
Yes

No
(Web-Only)10

No
(Web-Only)1

Not By Itself
(Needs external app to handle input)

Yes
Recommended / Max Resolution
4K+
(Depends on chosen model)

1600x900
(Recommended Resolution)6

1440x900
(Recommended Resolution)2

~2560x1440
(Max For Opus 4.7)3

Theoretically Any Resolution
(Hybrid UIA + Vision)
Supported Models Multiple
(Gemini, OpenAI, Claude)
OpenAI Only Gemini Only Claude Only Multiple
(Gemini, OpenAI, Claude)

1. Gemini computer use announcement post states "It is not yet optimized for desktop OS-level control.""
2. Gemini docs state "The recommended screen size ... is (1440, 900)." and performance "may be impacted" with other resolutions."
3. For Opus 4.7 - Max long edge: 2576 pixels & Max total pixels: 3.75 megapixels"
4. Gemini docs state: "you need to write the client-side application code to ... execute the corresponding actions""
5. Anthropic Computer Use has a demo app implementation, but requires MacOS with Python, or setup in Docker"
6. OpenAI recommends 1440x900 or 1600x900 for optimal click accuracy (see Azure OpenAI Computer Use Guide)."
7. Gemini Computer Use requires Python + dependencies, and downloading browser binaries via Playwright."
8. Anthropic's Computer Use API only outputs proposed tool calls; developers must implement their own OS-level execution harness."
9. UFO³ setup involves installing Conda/Python, and YAML configurations."
10. OpenAI Operator (now called ChatGPT agent mode) runs within a virtual web browser hosted by OpenAI.


How it Works

The Observe-Think-Act Loop:

  1. Observe: Captures the current desktop state as an image. It does not require UI Automation APIs, screen reader support, or application-specific hooks.
  2. Think: Sends the screenshot and prompt to the AI to determine the next action.
  3. Use Input Tools: The AI chooses the desired tool (e.g., LEFT_CLICK, TYPE_TEXT) and outputs the coordinates. Through some clever prompting tricks, this is highly reliable and accurate even with high resolution (4K) screenshots
  4. Act: Simulates physical hardware inputs via native OS APIs (user32.dll / gdi32.dll).

Special Sauce

  • Queued actions - The AI can queue multiple actions where appropriate to speed through multiple similar actions, such as drawing quickly.
  • Accurate Clicking - Prompts optimized so even on a 4K screen, the latest models can be spot-on with coordinates even for small UI elements.

Security Notice

⚠️ Prototype software - Not intended for production use.
This application executes real, unauthenticated OS-level input events. Do not expose the web server port to the internet or untrusted networks. Operate only in a supervised, isolated local environment.

Setup & Usage Instructions

How to Download

  1. Go to the project's Releases page.
  2. For the latest release, look under Assets and download the Windows executable.

Usage Instructions

Note: You'll need to add your own API key for your AI of choice. I'll add local model support when I get the chance.

  1. Run the compiled executable. A local web interface will initialize (default: http://localhost:51122).
  2. Navigate to the Config menu in the web UI.
  3. Enter your required API key (e.g., Google Gemini) and adjust any desired operational parameters (model, temperature, coordinate mode).
  • Human Control Only Mode is enabled by default near the top of Config. Leave it on if you want guided/manual control, or turn it off if you want the agent to click and type autonomously. Click Save to browser.
  1. Navigate to the Agent Control panel. The page shows a prominent Human Control Mode status banner so you can confirm whether the run will be guided or autonomous before starting.
  2. Select a target monitor, enter your task directive in the Goal field, and click Start.
  3. Interrupting Execution: Use the Pause/Stop buttons in the UI, or the default global hotkeys (Ctrl+Shift+Alt+P to pause, Ctrl+Shift+Alt+S to stop).

Screenshots

image

image

Development & Compilation

Requirements:

  • Visual Studio 2026
  • .NET 10.0 SDK

Instructions:

  1. Open the solution file in src/ with Visual Studio 2026.
  2. Select your desired build configuration (Debug or Release).
  3. Compile and run the solution.

For command-line development, run ./run-dev.ps1. It restores, builds, then starts dotnet watch so C# and embedded wwwroot changes rebuild/restart automatically. Use ./run-dev.ps1 -NoWatch for a one-shot run.

Licensing - Personal Use Only

This app is source-available. Free for PERSONAL use only. You may not use it for commercial purposes (it's not ready for production anyway).

Are you a big tech company who wants to buy it and/or bring me on to build it out properly? I could be convinced.