UI-TARS Desktop: Local GUI Automation Agent (2026)
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
Quick Facts: UI-TARS Desktop
What is UI-TARS?
UI-TARS (Task Automation and Reasoning System) represents a fundamental shift in how AI agents interact with computers. Developed by ByteDance and researchers at Tsinghua University, UI-TARS is the first open-source GUI agent that matches or exceeds cloud-based alternatives like Claude Computer Use and OpenAI Operator.
The key innovation: pure vision-based perception. Unlike traditional automation tools that parse HTML, inspect accessibility trees, or require element IDs, UI-TARS processes raw screenshots as its sole input. It sees your screen exactly as you do, then generates human-like mouse clicks, keyboard inputs, and gestures.
Released under the Apache 2.0 license in January 2025, UI-TARS includes:
- UI-TARS Model: Vision-language models in 2B, 7B, and 72B parameter sizes
- UI-TARS Desktop: Electron-based app for natural language computer control
- Agent TARS: Multimodal stack for browser and terminal automation
- UI-TARS SDK: Cross-platform toolkit for building custom automation
This guide covers local deployment, hardware requirements, benchmarks, and practical integrationāeverything you need to run AI-powered desktop automation without cloud dependencies.
Why Local GUI Automation Matters
Cloud-based computer use agents like Claude Computer Use and OpenAI Operator offer impressive capabilities, but they come with significant trade-offs:
Privacy concerns: Every screenshot is sent to remote servers for processing. For enterprise users handling sensitive data, this is often a dealbreaker.
Per-interaction costs: API pricing adds up quickly for high-volume automation. A workflow running thousands of interactions per day becomes expensive.
Latency: Round-trip to cloud servers adds 500ms-2s per action. For real-time automation, this is too slow.
Dependency: Your automation workflows break when the API is down or rate-limited.
UI-TARS solves these problems by running entirely on your local hardware. Once the model is loaded, you have zero cloud dependency, zero per-action costs, and complete data sovereignty.
How UI-TARS Works
Vision-Based Perception
UI-TARS uses a 675M-parameter Vision Transformer (from Qwen2-VL) to analyze screenshots. The architecture includes:
- Screenshot Capture: System captures the current screen state as an image
- Visual Encoding: The vision encoder processes the image, understanding UI elements, layouts, text, icons, and spatial relationships
- Multimodal Fusion: Visual features are combined with text instructions using M-RoPE (Multimodal Rotary Position Embedding)
- Action Prediction: The language model outputs specific actions with coordinates
- Execution: Actions are performed via PyAutoGUI (Python) or NutJS (Node.js)
- Feedback Loop: The next screenshot captures the result, enabling iterative task completion
Unified Action Space
UI-TARS standardizes actions across platforms:
# Desktop operations
click(x, y) # Single/double/right clicks at coordinates
type(text) # Keyboard text input
hotkey(keys) # Keyboard shortcuts (Ctrl+C, Alt+Tab)
scroll(direction, amount) # Vertical/horizontal scrolling
drag(x1, y1, x2, y2) # Drag-and-drop operations
# Mobile operations (Android/iOS)
long_press(x, y) # Extended touch
swipe(direction) # Touch gestures
open_app(name) # Application launching
press_home() # System navigation
press_back() # Back button
System-2 Reasoning
UI-TARS 1.5+ incorporates deliberate, slow thinking for complex decisions:
- Explicit reasoning chains before each action
- Error anticipation and prevention
- Alternative strategy consideration when stuck
- Goal-state verification to confirm task completion
This "reflection thinking" enables self-correction. When an action fails, UI-TARS analyzes what went wrong and adjusts its approachāsomething simple automation scripts can't do.
Model Versions and Capabilities
UI-TARS (January 2025)
The original release with three model sizes:
| Model | Parameters | VRAM (FP16) | OSWorld Score |
|---|---|---|---|
| UI-TARS-2B | 2 billion | ~4GB | - |
| UI-TARS-7B | 7 billion | ~14GB | 24.6% |
| UI-TARS-72B | 72 billion | ~144GB | - |
UI-TARS 1.5 (April 2025)
Enhanced with reinforcement learning-enabled reasoning:
- Chain-of-thought before action execution
- Improved inference-time scaling
- Better multi-step task handling
| Model | Parameters | OSWorld | ScreenSpot-V2 |
|---|---|---|---|
| UI-TARS-1.5-7B | 7 billion | 27.5% | 94.2% |
| UI-TARS-1.5-72B | 72 billion | 42.5% | 94.2% |
UI-TARS 2 (September 2025)
Major architectural upgrade:
- 532M vision encoder + 23B active parameters (230B total MoE)
- Multi-turn reinforcement learning framework
- Hybrid environment integrating file systems and terminals
- Game-playing capabilities (60% of human performance)
| Benchmark | UI-TARS 2 |
|---|---|
| OSWorld | 47.5% |
| AndroidWorld | 73.3% |
| WindowsAgentArena | 50.6% |
| SWE-Bench | 68.7% |
| Terminal Bench | 45.3% |
Benchmark Comparison
vs. Claude Computer Use
| Benchmark | UI-TARS 2 | Claude Computer Use |
|---|---|---|
| OSWorld (50 steps) | 47.5% | 22.0% |
| AndroidWorld | 73.3% | ~35% |
| WebVoyager | 84.8% | 56% |
Key insight: UI-TARS more than doubles Claude's performance on complex desktop tasks. The gap is even larger on mobile automation.
vs. OpenAI Operator
| Benchmark | UI-TARS 2 | OpenAI Operator |
|---|---|---|
| OSWorld | 47.5% | 38.1% |
| WebVoyager | 84.8% | 87% |
OpenAI Operator edges ahead on web-focused tasks, but UI-TARS leads on general desktop automation.
Grounding Accuracy
Coordinate prediction is critical for GUI automation. UI-TARS achieves:
- ScreenSpot-V2: 94.2% accuracy
- ScreenSpotPro: 61.6% (full model)
This means when UI-TARS decides to click a button, it correctly identifies the pixel coordinates over 94% of the time.
Hardware Requirements
Minimum Setup (7B Model)
| Component | Specification |
|---|---|
| GPU | NVIDIA RTX 3080 16GB or RTX 4070 Ti Super |
| VRAM | 16GB minimum |
| RAM | 16GB system memory |
| Storage | 20GB for model files |
| CUDA | 11.8 or higher |
| OS | Linux or macOS (Windows in development) |
Recommended Setup (7B Model)
| Component | Specification |
|---|---|
| GPU | NVIDIA RTX 3090 or RTX 4090 |
| VRAM | 24GB |
| RAM | 32GB system memory |
| Storage | SSD with 50GB free |
VRAM by Quantization
For limited VRAM, use quantized models:
| Model | FP16 | INT8 | INT4 (Q4_K) |
|---|---|---|---|
| UI-TARS-2B | ~4GB | ~2GB | ~1GB |
| UI-TARS-7B | ~14GB | ~7GB | ~4GB |
| UI-TARS-72B | ~144GB | ~72GB | ~36GB |
Practical guidance:
- 8GB VRAM (RTX 4060): Use 7B Q4_K quantization
- 12GB VRAM (RTX 4070): Use 7B Q6_K or INT8
- 24GB VRAM (RTX 4090): Use 7B FP16 (full precision)
- 48GB+ VRAM: Consider 72B quantized
For the 72B model at full precision, you need A100/H100 clusters or cloud GPU providers like RunPod, Lambda Labs, or Vast.ai.
Installation Guide
Method 1: UI-TARS Desktop (Easiest)
The Electron-based desktop application provides a user-friendly interface:
# Quick start with npx (no install needed)
npx @agent-tars/cli@latest
# Or install globally
npm install @agent-tars/cli@latest -g
agent-tars
Configuration with cloud provider:
agent-tars \
--provider anthropic \
--model claude-3-7-sonnet-latest \
--apiKey sk-ant-YOUR_KEY
Configuration with local model:
agent-tars \
--provider local \
--model UI-TARS-1.5-7B \
--endpoint http://localhost:8000
Method 2: vLLM Server (Production)
For maximum performance, run UI-TARS through vLLM:
# Install vLLM
pip install vllm
# Download model
pip install huggingface_hub
huggingface-cli download ByteDance-Seed/UI-TARS-1.5-7B
# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
--model ByteDance-Seed/UI-TARS-1.5-7B \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--port 8000
The server exposes an OpenAI-compatible API at http://localhost:8000/v1.
Method 3: Transformers Direct
For custom integrations:
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image
# Load model
processor = AutoProcessor.from_pretrained("ByteDance-Seed/UI-TARS-1.5-7B")
model = AutoModelForVision2Seq.from_pretrained(
"ByteDance-Seed/UI-TARS-1.5-7B",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
# Capture screenshot
import pyautogui
screenshot = pyautogui.screenshot()
# Get action
prompt = "Click the Settings button in the top right corner"
inputs = processor(images=screenshot, text=prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=512)
action = processor.decode(outputs[0], skip_special_tokens=True)
print(action) # "click(1450, 50)"
Python Integration
OpenAI-Compatible API
With vLLM running, integrate via the standard OpenAI client:
from openai import OpenAI
import pyautogui
import base64
from io import BytesIO
# Connect to local vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy-key" # Not needed for local
)
def screenshot_to_base64():
screenshot = pyautogui.screenshot()
buffer = BytesIO()
screenshot.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode()
def execute_task(instruction: str):
screenshot_b64 = screenshot_to_base64()
response = client.chat.completions.create(
model="UI-TARS-1.5-7B",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{screenshot_b64}"}
},
{"type": "text", "text": instruction}
]
}
],
max_tokens=512
)
return response.choices[0].message.content
# Example usage
action = execute_task("Click the Submit button")
print(action)
Action Parsing and Execution
Convert model output to executable code:
from ui_tars import parse_action_to_structure_output, parsing_response_to_pyautogui_code
import pyautogui
def execute_action(model_response: str):
# Parse to structured format
structured = parse_action_to_structure_output(model_response)
# Generate PyAutoGUI code
code = parsing_response_to_pyautogui_code(model_response)
# Execute (with safety check)
if "pyautogui" in code:
exec(code)
return structured
# Multi-step automation loop
def run_automation(task: str, max_steps: int = 10):
for step in range(max_steps):
screenshot_b64 = screenshot_to_base64()
response = client.chat.completions.create(
model="UI-TARS-1.5-7B",
messages=[
{
"role": "system",
"content": f"Complete this task: {task}. If done, respond with 'TASK_COMPLETE'."
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{screenshot_b64}"}},
{"type": "text", "text": "What action should I take next?"}
]
}
]
)
action = response.choices[0].message.content
if "TASK_COMPLETE" in action:
print("Task completed successfully!")
break
execute_action(action)
time.sleep(0.5) # Wait for UI to update
JavaScript/TypeScript SDK
For Node.js applications:
import { GUIAgent } from '@ui-tars/sdk';
import { NutJSOperator } from '@ui-tars/operator-nut-js';
const guiAgent = new GUIAgent({
model: {
baseURL: 'http://localhost:8000/v1',
apiKey: 'your-api-key',
model: 'UI-TARS-1.5-7B',
},
operator: new NutJSOperator(),
onData: ({ data }) => console.log('Action:', data),
onError: ({ error }) => console.error('Error:', error),
});
// Single task
await guiAgent.run('Open Chrome and search for weather');
// Multi-step with callback
await guiAgent.run('Fill out the contact form with test data', {
onStep: (step) => console.log(`Step ${step.number}: ${step.action}`),
onComplete: () => console.log('Form submitted!'),
});
Use Cases
Automated Testing
Transform testing with natural language:
# Traditional approach
driver.find_element(By.CSS_SELECTOR, "button.submit-form").click()
time.sleep(2)
driver.find_element(By.ID, "email").send_keys("test@example.com")
# UI-TARS approach
agent.run("Click the Submit button on the registration form")
agent.run("Enter test@example.com in the email field")
Benefits:
- Anyone familiar with the product can write tests
- Tests adapt to UI changes automatically
- Cross-platform coverage from single descriptions
- No brittle CSS selectors or XPath
Robotic Process Automation (RPA)
Automate repetitive business processes:
# Invoice processing workflow
tasks = [
"Open Outlook and find the latest invoice email",
"Download the PDF attachment",
"Open QuickBooks",
"Create a new expense entry",
"Fill in the vendor name from the invoice",
"Enter the amount and date",
"Save the expense"
]
for task in tasks:
run_automation(task)
Advantages over traditional RPA:
- Adapts to layout changes without script updates
- Handles unexpected dialogs and popups
- Works with any application (no connectors needed)
- Understands context to make intelligent decisions
Data Collection
Navigate complex websites:
# Extract competitor pricing
agent.run("Open the competitor's website")
agent.run("Navigate to the pricing page")
agent.run("Click on the Enterprise plan details")
# Screenshot analysis extracts visible prices
Accessibility Enhancement
Voice-controlled computer operation:
import speech_recognition as sr
def voice_to_action():
recognizer = sr.Recognizer()
with sr.Microphone() as source:
audio = recognizer.listen(source)
command = recognizer.recognize_whisper(audio)
agent.run(command)
# "Open my email" ā UI-TARS finds and opens mail app
# "Reply to the first message" ā Clicks reply on first email
Security Considerations
Local vs. Cloud Security Model
| Aspect | UI-TARS (Local) | Cloud Agents |
|---|---|---|
| Data transmission | None | Screenshots to servers |
| Processing | Local GPU | Cloud infrastructure |
| Audit trail | Local logs | Provider may retain |
| Compliance | Full control | Depends on provider |
Risk Mitigation
UI-TARS runs with full system access. Protect yourself:
1. Virtual Machine Isolation
# Run in VM for sensitive tasks
# Use VirtualBox, VMware, or Parallels
# Snapshot before automation runs
2. Limited User Account
# Create dedicated automation user
sudo useradd -m ui-tars-agent
# Grant only necessary permissions
3. Confirmation Prompts
def execute_with_confirmation(action: str):
print(f"Agent wants to: {action}")
if input("Approve? (y/n): ").lower() == 'y':
execute_action(action)
else:
print("Action cancelled")
4. Action Logging
import logging
logging.basicConfig(filename='ui_tars_actions.log')
def logged_execute(action: str):
logging.info(f"Executing: {action}")
try:
result = execute_action(action)
logging.info(f"Success: {result}")
except Exception as e:
logging.error(f"Failed: {e}")
5. Timeout Limits
from concurrent.futures import ThreadPoolExecutor, TimeoutError
def run_with_timeout(task: str, timeout: int = 60):
with ThreadPoolExecutor() as executor:
future = executor.submit(run_automation, task)
try:
return future.result(timeout=timeout)
except TimeoutError:
print("Task timed out - stopping agent")
return None
Performance Optimization
Quantization for Consumer Hardware
# Download GGUF quantized model
huggingface-cli download bartowski/UI-TARS-7B-DPO-GGUF \
--include "UI-TARS-7B-DPO-Q4_K_M.gguf"
# Run with llama.cpp
./llama-server \
-m UI-TARS-7B-DPO-Q4_K_M.gguf \
-c 4096 \
--port 8000
Quantization trade-offs (UI-TARS 2):
- W4A8: Token speed 29.6 ā 47 tokens/sec
- Latency: 4.0 ā 2.5 seconds per interaction
- Accuracy: 47.5% ā 44.4% on OSWorld (~3% drop)
vLLM Optimization Flags
python -m vllm.entrypoints.openai.api_server \
--model ByteDance-Seed/UI-TARS-1.5-7B \
--trust-remote-code \
--gpu-memory-utilization 0.95 \ # Use more VRAM
--max-model-len 4096 \ # Limit context
--enable-prefix-caching \ # Cache repeated prompts
--port 8000
Reduce Screenshot Resolution
def optimized_screenshot():
screenshot = pyautogui.screenshot()
# Resize to 1280x720 for faster processing
screenshot = screenshot.resize((1280, 720), Image.LANCZOS)
return screenshot
Limitations and Challenges
Current Gaps
- Complex Desktop Tasks: Best models achieve <50% on OSWorld (humans: 72.4%)
- Drag-and-Drop: Remains challenging for precise operations
- Professional Software: CAD, video editing, and specialized apps poorly understood
- Fine-Grained Spatial Reasoning: Small UI elements can be missed
Hallucination Issues
UI-TARS may confidently describe incorrect UI elements. Mitigation:
def verify_element_exists(action: str, screenshot: Image):
# Re-analyze with explicit verification prompt
verification = model.generate(
image=screenshot,
prompt=f"Is there a UI element that matches: {action}? Answer YES or NO."
)
return "YES" in verification.upper()
Latency Considerations
- First action: 2-4 seconds (model loading, initial processing)
- Subsequent actions: 1-2 seconds each
- For real-time automation, consider smaller models or edge deployment
UI-TARS vs Alternatives
Feature Comparison
| Feature | UI-TARS | Claude Computer Use | OpenAI Operator |
|---|---|---|---|
| Open Source | Yes (Apache 2.0) | No | No |
| Local Deployment | Yes | No (API only) | No (Cloud) |
| Desktop Support | Full | Full | Limited |
| Mobile Support | Yes | Limited | No |
| Browser Automation | Yes | Yes | Primary |
| Game Playing | Yes | No | No |
| Offline Mode | Yes | No | No |
| Cost | Free (self-hosted) | API pricing | $200/mo Pro |
When to Choose UI-TARS
- Privacy: All processing stays local
- Cost: No per-interaction charges
- Offline: Works without internet
- Mobile: Need Android/iOS automation
- Games: Automating game interactions
- Open Source: Want to modify or extend
When to Choose Alternatives
- Claude Computer Use: Better coding task performance, simpler API
- OpenAI Operator: Best UX for web tasks, managed infrastructure
- Traditional RPA (UiPath): Enterprise support, compliance certifications
Next Steps
- Set up Ollama for local model management
- Compare AI coding agents for development automation
- Check VRAM requirements for your hardware
- Build custom AI agents with frameworks
- Explore MCP integration for tool connectivity
Key Takeaways
- UI-TARS is the leading open-source GUI agentā47.5% OSWorld, 2x Claude's performance
- Pure vision-based approach works with any application, no HTML parsing needed
- 7B model runs on RTX 3080+ with 16GB VRAM; quantized versions work on less
- Completely local and offlineāzero cloud dependency after model download
- Apache 2.0 license enables commercial use without restrictions
- Multi-platform support for desktop, mobile, web, and games
- Security requires attentionārun in VMs or limited accounts for sensitive tasks
UI-TARS represents the future of local AI automation. By processing pure screenshots instead of structured data, it works with any interfaceālegacy software, custom applications, games, or mobile apps. The 7B model offers accessible performance on consumer GPUs, while larger variants push the boundaries of what's possible. For anyone serious about AI-powered automation without cloud dependencies, UI-TARS is the tool to learn.
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!