forked from mirrors/Thio-Universal-Agent

Let AI control anything on your computer through GUI interaction. All in one Exe, no install required.

C# 68.1%
JavaScript 13.7%
HTML 11.4%
CSS 6.5%
Shell 0.2%
Other 0.1%

Find a file

dikkadev 76a339e8f9 docs(skill): add repeatable upstream fork reconciliation workflow		2026-06-01 15:11:02 +02:00
.github/assets	Add icon to readme	2026-05-27 22:26:17 -07:00
.pi/skills/upstream-fork-reconciliation	docs(skill): add repeatable upstream fork reconciliation workflow	2026-06-01 15:11:02 +02:00
src	integrate fork updates onto upstream v0.5.2	2026-06-01 15:05:18 +02:00
.gitattributes	Add .gitattributes and .gitignore.	2026-03-01 13:29:44 -07:00
.gitignore	Update .gitignore	2026-05-30 16:40:50 -07:00
AGENTS.md	docs: add agent onboarding and repo safety notes	2026-05-30 10:45:09 +02:00
LICENSE	Add note to license file	2026-05-27 22:01:37 -07:00
README.md	Update README with Human-Only exe version details	2026-05-30 14:36:50 -07:00
run-dev.ps1	fix(dev-script): disable watch mode in run-dev script	2026-06-01 14:39:38 +02:00

README.md

Thio's Universal Agent

An AI desktop assistant app capable of interacting with your entire computer (and any apps) like you do.

➤ Also, it's just one portable .exe file, no bloated install required.

What It Does (And Why "Universal"?)

Simply put, it lets your AI works across the whole computer. Unlike most AI "computer use" tools which only work in a browser or via command line, this uses the computer like you do.

It controls Windows purely through visual perception and GUI interaction. By interpreting raw pixels and sending hardware-level input (mouse movements, clicks, keystrokes), it operates exactly like a human would. This makes it universally compatible with any graphical application on your machine.

Optional: Human Control Only Mode

It now supports two ways to use it: a Human Control Only Mode that is enabled by default, where the AI guides you step-by-step and shows where to click without sending any real input itself, and an autonomous mode where it can move the mouse and type on its own.

There is also a dedicated Human-Only exe version, in cases where you never even want the option to operate in autonomous mode. In this version, the input code is physically excluded from the exe. (The other main exe can still do both)

Also See: Planned Features Page

Demonstration

Example of it queuing multiple actions at once, while accurately clicking exact coordinates within the entire 4K screen.

Demo Gif

Prompt: In MS Paint draw a self portrait with multiple colors with the brush tool. Use the full action queue when possible.
Model: Gemini 3.5 Flash

Example Use Cases

Ask it to show you how to do something instead of just getting a text description like other AIs.
Ask it to troubleshoot some error you're getting and figure out the cause.
Ask it to visually find an image in a folder of un-labelled files, based on a description.
Tell it to check every 10 seconds if a video render is done, then when it is, copy the file somewhere.
- (Though not recommended to leave unattended, unless in a controlled sandbox environment)

Frequently Asked Questions

Q: Is this like OpenClaw or Hermes?

A: No, this is not intended to be a 24/7 running agent. It also doesn't rely on CLI/Shell commands. It's meant for individual tasks or problems you'd normally have to do yourself.

Q: How long can it run?

A: There's not actually a limit. You can set the max number of steps to any number in the settings. The default is arbitrarily set to 100 steps.

Q: Which AI Services are supported?

A: Currently ChatGPT, OpenAI-compatible APIs, Gemini, Claude, and local ONNX models (ChatGPT/Gemini/Claude require your own API key). Currently it seems Gemini works the best, especially gemini-flash-latest.

Q: Doesn't this use a ton of tokens?

A: Sort of, but not as much as you might think. Each step is maybe 3k tokens, but input tokens are cheaper. Completion tokens are usually as few as 50, up to a few hundred for many queued actions. The big factor is how many thinking tokens are used.

For example, with Gemini 3.5 Flash, it seems each step with a single action costs about 1 cent or less.
I recommend setting thinking to the minimum, but even then it may use a few thousand tokens if it's going to queue up a lot of actions.
There's also context mitigation logic, such as summarization context every X steps, and removing past images from the context window (Both can be disabled in settings)

Q: Can I do other stuff while it's going?

A: Not really. It won't block your mouse or keyboard input or anything. But it's best to not touch anything while it's running to prevent interfering. You can do little stuff between steps to help it though, like if it clicked the wrong thing, click it yourself.

You can use global keyboard hotkeys to pause or stop it at any time.

Q: What is Human Only Control Mode?

A: By default it starts in Human Control Only Mode, where the AI tells you what to do and where to click while you perform the actions yourself. It draws crosshairs for where to click, and boxes around where to enter text. It also displays a small text box you can copy the recommended text from. This can be switched to fully autonomous mode in the config settings.

How It's Built Different

Single Portable Exe - NO Installation Required - Releases are compiled with single-exe mode, it's just one file.
- No bloated 🤡Python🤡 or 🤡NodeJS🤡 or other environment installation.
- ZERO Third-Party Dependencies 😤 (Uses core .NET Libraries and official Microsoft packages only) #AllMyHomiesHateDependencies
- Ideal for running in VMs and Sandboxes. Spin up a fresh sandbox instance and it's ready to go, just configure your provider settings.
Visual-Only Operation: Works on any app, regardless of underlying framework, because it relies strictly on screen pixels.
Multiple AI Providers: Supports Google Gemini (default), OpenAI (ChatGPT), OpenAI-compatible chat-completions services, Anthropic (Claude), and local ONNX Runtime GenAI models.

Additional Features:

Human Control Only Mode: Enabled by default. The AI guides you with step-by-step instructions and on-screen click markers, but it never sends mouse or keyboard input until you disable that mode.
Global Hotkey Support: Pause, resume, or terminate the agent instantly even when the web UI is minimized.
Live Agent Redirection: Issue mid-flight text instructions to the agent to override or adjust its current execution plan.
Config Import/Export: Export your config options to a file and import it. Settings are also stored in the browser to survive between sessions.
.NET Based - Theoretically Cross Platform - Currently the only input providers are set up for Windows, but it could work with MacOS or even Linux if someone implemented the interfaces for their APIs.

Comparison With Computer-Use Tools

Feature	Thio's Universal Agent	OpenAI Operator	Google Gemini Computer Use	Anthropic Computer Use	Microsoft Research UFO (UFO³)
Ready-to-Run App	Ready Out of the Box	N/A _{(Web Hosted)}	Dev API, Not an app⁴	Dev API, Not an app⁵	Research Framework
Setup Difficulty	Easy _{(Just launch the portable .exe)}	Easy _{(Log into web service)}	Hard _{(Requires Python, Playwright)}⁷	Hard _{(Requires custom tooling)}⁸	Hard _{(Conda, pip installs, YAML configuration)}⁹
Computer-Wide Control	Yes	No _(Web-Only)¹⁰	No _(Web-Only)¹	Not By Itself _{(Needs external app to handle input)}	Yes
Recommended / Max Resolution	4K+ _{(Depends on chosen model)}	1600x900 _{(Recommended Resolution)}⁶	1440x900 _{(Recommended Resolution)}²	~2560x1440 _{(Max For Opus 4.7)}³	Theoretically Any Resolution _{(Hybrid UIA + Vision)}
Supported Models	Multiple _{(Gemini, OpenAI, OpenAI-Compatible, Claude, Local ONNX)}	OpenAI Only	Gemini Only	Claude Only	Multiple _{(Gemini, OpenAI, OpenAI-Compatible, Claude)}

_{1. Gemini computer use announcement post states "It is not yet optimized for desktop OS-level control."}"
_{2. Gemini docs state "The recommended screen size ... is (1440, 900)." and performance "may be impacted" with other resolutions.}"
_{3. For Opus 4.7 - Max long edge: 2576 pixels & Max total pixels: 3.75 megapixels}"
_{4. Gemini docs state: "you need to write the client-side application code to ... execute the corresponding actions"}"
_{5. Anthropic Computer Use has a demo app implementation, but requires MacOS with Python, or setup in Docker}"
_{6. OpenAI recommends 1440x900 or 1600x900 for optimal click accuracy (see Azure OpenAI Computer Use Guide).}"
_{7. Gemini Computer Use requires Python + dependencies, and downloading browser binaries via Playwright.}"
_{8. Anthropic's Computer Use API only outputs proposed tool calls; developers must implement their own OS-level execution harness.}"
_{9. UFO³ setup involves installing Conda/Python, and YAML configurations.}"
_{10. OpenAI Operator (now called ChatGPT agent mode) runs within a virtual web browser hosted by OpenAI.}

How it Works

The Observe-Think-Act Loop:

Observe: Captures the current desktop state as an image. It does not require UI Automation APIs, screen reader support, or application-specific hooks.
Think: Sends the screenshot and prompt to the AI to determine the next action.
Use Input Tools: The AI chooses the desired tool (e.g., LEFT_CLICK, TYPE_TEXT) and outputs the coordinates. Through some clever prompting tricks, this is highly reliable and accurate even with high resolution (4K) screenshots
Act: Simulates physical hardware inputs via native OS APIs (user32.dll / gdi32.dll).

Special Sauce

Queued actions - The AI can queue multiple actions where appropriate to speed through multiple similar actions, such as drawing quickly.
Accurate Clicking - Prompts optimized so even on a 4K screen, the latest models can be spot-on with coordinates even for small UI elements.

Security Notice

⚠️ Prototype software - Not intended for production use.
This application executes real, unauthenticated OS-level input events. Do not expose the web server port to the internet or untrusted networks. Operate only in a supervised, isolated local environment.

Setup & Usage Instructions

How to Download

Go to the Releases page.
For the latest release, look under Assets and download Thio-Universal-Agent.exe.

Usage Instructions

Note: Configure the provider you want to use in the Config page. ChatGPT, Gemini, and Claude need an API key. OpenAI-compatible endpoints can use a custom URL and may omit the key for local or self-hosted services. Local ONNX models instead need a model folder path that contains genai_config.json; for screenshot-based agent tasks, use a vision-capable ONNX Runtime GenAI export.

Run the compiled executable. A local web interface will initialize (default: http://localhost:51122).
Navigate to the Config menu in the web UI.
Choose your provider and configure its settings. For ChatGPT, Gemini, and Claude, enter your API key. For OpenAI-compatible services, enter the chat-completions endpoint URL and optionally an API key. For local ONNX, set the model folder path and optional execution provider (for example DML on Windows), then adjust any desired operational parameters (model, temperature, coordinate mode).

Human Control Only Mode is enabled by default near the top of Config. Leave it on if you want guided/manual control, or turn it off if you want the agent to click and type autonomously. Click Save to browser.

Navigate to the Agent Control panel. The page shows a prominent Human Control Mode status banner so you can confirm whether the run will be guided or autonomous before starting.
Select a target monitor, enter your task directive in the Goal field, and click Start.
Interrupting Execution: Use the Pause/Stop buttons in the UI, or the default global hotkeys (Ctrl+Shift+Alt+P to pause, Ctrl+Shift+Alt+S to stop).

Screenshots

Development & Compilation

Requirements:

Visual Studio 2026
.NET 10.0 SDK

Instructions:

Open the solution file (Thio-Universal-Agent.slnx) in Visual Studio 2026.
Select your desired build configuration (Debug or Release).
Compile and run the solution.

Licensing - Personal Use Only

This app is source-available. Free for PERSONAL use only. You may not use it for commercial purposes (it's not ready for production anyway).

Are you a big tech company who wants to buy it and/or bring me on to build it out properly? I could be convinced.