AI Agents

UI-TARS Desktop: Local GUI Automation Agent (2026)

February 6, 2026
20 min read
Local AI Master Research Team
šŸŽ 4 PDFs included
Newsletter

Before we dive deeper...

Get your free AI Starter Kit

Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.

No spam, everUnsubscribe anytime
12,000+ downloads

Quick Facts: UI-TARS Desktop

Performance
āœ“ 47.5% OSWorld (2x Claude)
āœ“ 73.3% AndroidWorld
āœ“ 94.2% coordinate accuracy
Requirements
āœ“ 7B: 16GB+ VRAM
āœ“ Quantized: 4-8GB VRAM
āœ“ macOS, Linux (Windows WIP)
Key Features
āœ“ Pure vision—no HTML parsing
āœ“ Fully open-source (Apache 2.0)
āœ“ Offline capable

What is UI-TARS?

UI-TARS (Task Automation and Reasoning System) represents a fundamental shift in how AI agents interact with computers. Developed by ByteDance and researchers at Tsinghua University, UI-TARS is the first open-source GUI agent that matches or exceeds cloud-based alternatives like Claude Computer Use and OpenAI Operator.

The key innovation: pure vision-based perception. Unlike traditional automation tools that parse HTML, inspect accessibility trees, or require element IDs, UI-TARS processes raw screenshots as its sole input. It sees your screen exactly as you do, then generates human-like mouse clicks, keyboard inputs, and gestures.

Released under the Apache 2.0 license in January 2025, UI-TARS includes:

  • UI-TARS Model: Vision-language models in 2B, 7B, and 72B parameter sizes
  • UI-TARS Desktop: Electron-based app for natural language computer control
  • Agent TARS: Multimodal stack for browser and terminal automation
  • UI-TARS SDK: Cross-platform toolkit for building custom automation

This guide covers local deployment, hardware requirements, benchmarks, and practical integration—everything you need to run AI-powered desktop automation without cloud dependencies.


Why Local GUI Automation Matters

Cloud-based computer use agents like Claude Computer Use and OpenAI Operator offer impressive capabilities, but they come with significant trade-offs:

Privacy concerns: Every screenshot is sent to remote servers for processing. For enterprise users handling sensitive data, this is often a dealbreaker.

Per-interaction costs: API pricing adds up quickly for high-volume automation. A workflow running thousands of interactions per day becomes expensive.

Latency: Round-trip to cloud servers adds 500ms-2s per action. For real-time automation, this is too slow.

Dependency: Your automation workflows break when the API is down or rate-limited.

UI-TARS solves these problems by running entirely on your local hardware. Once the model is loaded, you have zero cloud dependency, zero per-action costs, and complete data sovereignty.


How UI-TARS Works

Vision-Based Perception

UI-TARS uses a 675M-parameter Vision Transformer (from Qwen2-VL) to analyze screenshots. The architecture includes:

  1. Screenshot Capture: System captures the current screen state as an image
  2. Visual Encoding: The vision encoder processes the image, understanding UI elements, layouts, text, icons, and spatial relationships
  3. Multimodal Fusion: Visual features are combined with text instructions using M-RoPE (Multimodal Rotary Position Embedding)
  4. Action Prediction: The language model outputs specific actions with coordinates
  5. Execution: Actions are performed via PyAutoGUI (Python) or NutJS (Node.js)
  6. Feedback Loop: The next screenshot captures the result, enabling iterative task completion

Unified Action Space

UI-TARS standardizes actions across platforms:

# Desktop operations
click(x, y)           # Single/double/right clicks at coordinates
type(text)            # Keyboard text input
hotkey(keys)          # Keyboard shortcuts (Ctrl+C, Alt+Tab)
scroll(direction, amount)  # Vertical/horizontal scrolling
drag(x1, y1, x2, y2)  # Drag-and-drop operations

# Mobile operations (Android/iOS)
long_press(x, y)      # Extended touch
swipe(direction)      # Touch gestures
open_app(name)        # Application launching
press_home()          # System navigation
press_back()          # Back button

System-2 Reasoning

UI-TARS 1.5+ incorporates deliberate, slow thinking for complex decisions:

  • Explicit reasoning chains before each action
  • Error anticipation and prevention
  • Alternative strategy consideration when stuck
  • Goal-state verification to confirm task completion

This "reflection thinking" enables self-correction. When an action fails, UI-TARS analyzes what went wrong and adjusts its approach—something simple automation scripts can't do.


Model Versions and Capabilities

UI-TARS (January 2025)

The original release with three model sizes:

ModelParametersVRAM (FP16)OSWorld Score
UI-TARS-2B2 billion~4GB-
UI-TARS-7B7 billion~14GB24.6%
UI-TARS-72B72 billion~144GB-

UI-TARS 1.5 (April 2025)

Enhanced with reinforcement learning-enabled reasoning:

  • Chain-of-thought before action execution
  • Improved inference-time scaling
  • Better multi-step task handling
ModelParametersOSWorldScreenSpot-V2
UI-TARS-1.5-7B7 billion27.5%94.2%
UI-TARS-1.5-72B72 billion42.5%94.2%

UI-TARS 2 (September 2025)

Major architectural upgrade:

  • 532M vision encoder + 23B active parameters (230B total MoE)
  • Multi-turn reinforcement learning framework
  • Hybrid environment integrating file systems and terminals
  • Game-playing capabilities (60% of human performance)
BenchmarkUI-TARS 2
OSWorld47.5%
AndroidWorld73.3%
WindowsAgentArena50.6%
SWE-Bench68.7%
Terminal Bench45.3%

Benchmark Comparison

vs. Claude Computer Use

BenchmarkUI-TARS 2Claude Computer Use
OSWorld (50 steps)47.5%22.0%
AndroidWorld73.3%~35%
WebVoyager84.8%56%

Key insight: UI-TARS more than doubles Claude's performance on complex desktop tasks. The gap is even larger on mobile automation.

vs. OpenAI Operator

BenchmarkUI-TARS 2OpenAI Operator
OSWorld47.5%38.1%
WebVoyager84.8%87%

OpenAI Operator edges ahead on web-focused tasks, but UI-TARS leads on general desktop automation.

Grounding Accuracy

Coordinate prediction is critical for GUI automation. UI-TARS achieves:

  • ScreenSpot-V2: 94.2% accuracy
  • ScreenSpotPro: 61.6% (full model)

This means when UI-TARS decides to click a button, it correctly identifies the pixel coordinates over 94% of the time.


Hardware Requirements

Minimum Setup (7B Model)

ComponentSpecification
GPUNVIDIA RTX 3080 16GB or RTX 4070 Ti Super
VRAM16GB minimum
RAM16GB system memory
Storage20GB for model files
CUDA11.8 or higher
OSLinux or macOS (Windows in development)
ComponentSpecification
GPUNVIDIA RTX 3090 or RTX 4090
VRAM24GB
RAM32GB system memory
StorageSSD with 50GB free

VRAM by Quantization

For limited VRAM, use quantized models:

ModelFP16INT8INT4 (Q4_K)
UI-TARS-2B~4GB~2GB~1GB
UI-TARS-7B~14GB~7GB~4GB
UI-TARS-72B~144GB~72GB~36GB

Practical guidance:

  • 8GB VRAM (RTX 4060): Use 7B Q4_K quantization
  • 12GB VRAM (RTX 4070): Use 7B Q6_K or INT8
  • 24GB VRAM (RTX 4090): Use 7B FP16 (full precision)
  • 48GB+ VRAM: Consider 72B quantized

For the 72B model at full precision, you need A100/H100 clusters or cloud GPU providers like RunPod, Lambda Labs, or Vast.ai.


Installation Guide

Method 1: UI-TARS Desktop (Easiest)

The Electron-based desktop application provides a user-friendly interface:

# Quick start with npx (no install needed)
npx @agent-tars/cli@latest

# Or install globally
npm install @agent-tars/cli@latest -g
agent-tars

Configuration with cloud provider:

agent-tars \
  --provider anthropic \
  --model claude-3-7-sonnet-latest \
  --apiKey sk-ant-YOUR_KEY

Configuration with local model:

agent-tars \
  --provider local \
  --model UI-TARS-1.5-7B \
  --endpoint http://localhost:8000

Method 2: vLLM Server (Production)

For maximum performance, run UI-TARS through vLLM:

# Install vLLM
pip install vllm

# Download model
pip install huggingface_hub
huggingface-cli download ByteDance-Seed/UI-TARS-1.5-7B

# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
    --model ByteDance-Seed/UI-TARS-1.5-7B \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --port 8000

The server exposes an OpenAI-compatible API at http://localhost:8000/v1.

Method 3: Transformers Direct

For custom integrations:

from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
from PIL import Image

# Load model
processor = AutoProcessor.from_pretrained("ByteDance-Seed/UI-TARS-1.5-7B")
model = AutoModelForVision2Seq.from_pretrained(
    "ByteDance-Seed/UI-TARS-1.5-7B",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Capture screenshot
import pyautogui
screenshot = pyautogui.screenshot()

# Get action
prompt = "Click the Settings button in the top right corner"
inputs = processor(images=screenshot, text=prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model.generate(**inputs, max_new_tokens=512)
action = processor.decode(outputs[0], skip_special_tokens=True)
print(action)  # "click(1450, 50)"

Python Integration

OpenAI-Compatible API

With vLLM running, integrate via the standard OpenAI client:

from openai import OpenAI
import pyautogui
import base64
from io import BytesIO

# Connect to local vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy-key"  # Not needed for local
)

def screenshot_to_base64():
    screenshot = pyautogui.screenshot()
    buffer = BytesIO()
    screenshot.save(buffer, format="PNG")
    return base64.b64encode(buffer.getvalue()).decode()

def execute_task(instruction: str):
    screenshot_b64 = screenshot_to_base64()

    response = client.chat.completions.create(
        model="UI-TARS-1.5-7B",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{screenshot_b64}"}
                    },
                    {"type": "text", "text": instruction}
                ]
            }
        ],
        max_tokens=512
    )

    return response.choices[0].message.content

# Example usage
action = execute_task("Click the Submit button")
print(action)

Action Parsing and Execution

Convert model output to executable code:

from ui_tars import parse_action_to_structure_output, parsing_response_to_pyautogui_code
import pyautogui

def execute_action(model_response: str):
    # Parse to structured format
    structured = parse_action_to_structure_output(model_response)

    # Generate PyAutoGUI code
    code = parsing_response_to_pyautogui_code(model_response)

    # Execute (with safety check)
    if "pyautogui" in code:
        exec(code)

    return structured

# Multi-step automation loop
def run_automation(task: str, max_steps: int = 10):
    for step in range(max_steps):
        screenshot_b64 = screenshot_to_base64()

        response = client.chat.completions.create(
            model="UI-TARS-1.5-7B",
            messages=[
                {
                    "role": "system",
                    "content": f"Complete this task: {task}. If done, respond with 'TASK_COMPLETE'."
                },
                {
                    "role": "user",
                    "content": [
                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{screenshot_b64}"}},
                        {"type": "text", "text": "What action should I take next?"}
                    ]
                }
            ]
        )

        action = response.choices[0].message.content

        if "TASK_COMPLETE" in action:
            print("Task completed successfully!")
            break

        execute_action(action)
        time.sleep(0.5)  # Wait for UI to update

JavaScript/TypeScript SDK

For Node.js applications:

import { GUIAgent } from '@ui-tars/sdk';
import { NutJSOperator } from '@ui-tars/operator-nut-js';

const guiAgent = new GUIAgent({
  model: {
    baseURL: 'http://localhost:8000/v1',
    apiKey: 'your-api-key',
    model: 'UI-TARS-1.5-7B',
  },
  operator: new NutJSOperator(),
  onData: ({ data }) => console.log('Action:', data),
  onError: ({ error }) => console.error('Error:', error),
});

// Single task
await guiAgent.run('Open Chrome and search for weather');

// Multi-step with callback
await guiAgent.run('Fill out the contact form with test data', {
  onStep: (step) => console.log(`Step ${step.number}: ${step.action}`),
  onComplete: () => console.log('Form submitted!'),
});

Use Cases

Automated Testing

Transform testing with natural language:

# Traditional approach
driver.find_element(By.CSS_SELECTOR, "button.submit-form").click()
time.sleep(2)
driver.find_element(By.ID, "email").send_keys("test@example.com")

# UI-TARS approach
agent.run("Click the Submit button on the registration form")
agent.run("Enter test@example.com in the email field")

Benefits:

  • Anyone familiar with the product can write tests
  • Tests adapt to UI changes automatically
  • Cross-platform coverage from single descriptions
  • No brittle CSS selectors or XPath

Robotic Process Automation (RPA)

Automate repetitive business processes:

# Invoice processing workflow
tasks = [
    "Open Outlook and find the latest invoice email",
    "Download the PDF attachment",
    "Open QuickBooks",
    "Create a new expense entry",
    "Fill in the vendor name from the invoice",
    "Enter the amount and date",
    "Save the expense"
]

for task in tasks:
    run_automation(task)

Advantages over traditional RPA:

  • Adapts to layout changes without script updates
  • Handles unexpected dialogs and popups
  • Works with any application (no connectors needed)
  • Understands context to make intelligent decisions

Data Collection

Navigate complex websites:

# Extract competitor pricing
agent.run("Open the competitor's website")
agent.run("Navigate to the pricing page")
agent.run("Click on the Enterprise plan details")
# Screenshot analysis extracts visible prices

Accessibility Enhancement

Voice-controlled computer operation:

import speech_recognition as sr

def voice_to_action():
    recognizer = sr.Recognizer()
    with sr.Microphone() as source:
        audio = recognizer.listen(source)

    command = recognizer.recognize_whisper(audio)
    agent.run(command)

# "Open my email" → UI-TARS finds and opens mail app
# "Reply to the first message" → Clicks reply on first email

Security Considerations

Local vs. Cloud Security Model

AspectUI-TARS (Local)Cloud Agents
Data transmissionNoneScreenshots to servers
ProcessingLocal GPUCloud infrastructure
Audit trailLocal logsProvider may retain
ComplianceFull controlDepends on provider

Risk Mitigation

UI-TARS runs with full system access. Protect yourself:

1. Virtual Machine Isolation

# Run in VM for sensitive tasks
# Use VirtualBox, VMware, or Parallels
# Snapshot before automation runs

2. Limited User Account

# Create dedicated automation user
sudo useradd -m ui-tars-agent
# Grant only necessary permissions

3. Confirmation Prompts

def execute_with_confirmation(action: str):
    print(f"Agent wants to: {action}")
    if input("Approve? (y/n): ").lower() == 'y':
        execute_action(action)
    else:
        print("Action cancelled")

4. Action Logging

import logging

logging.basicConfig(filename='ui_tars_actions.log')

def logged_execute(action: str):
    logging.info(f"Executing: {action}")
    try:
        result = execute_action(action)
        logging.info(f"Success: {result}")
    except Exception as e:
        logging.error(f"Failed: {e}")

5. Timeout Limits

from concurrent.futures import ThreadPoolExecutor, TimeoutError

def run_with_timeout(task: str, timeout: int = 60):
    with ThreadPoolExecutor() as executor:
        future = executor.submit(run_automation, task)
        try:
            return future.result(timeout=timeout)
        except TimeoutError:
            print("Task timed out - stopping agent")
            return None

Performance Optimization

Quantization for Consumer Hardware

# Download GGUF quantized model
huggingface-cli download bartowski/UI-TARS-7B-DPO-GGUF \
  --include "UI-TARS-7B-DPO-Q4_K_M.gguf"

# Run with llama.cpp
./llama-server \
  -m UI-TARS-7B-DPO-Q4_K_M.gguf \
  -c 4096 \
  --port 8000

Quantization trade-offs (UI-TARS 2):

  • W4A8: Token speed 29.6 → 47 tokens/sec
  • Latency: 4.0 → 2.5 seconds per interaction
  • Accuracy: 47.5% → 44.4% on OSWorld (~3% drop)

vLLM Optimization Flags

python -m vllm.entrypoints.openai.api_server \
    --model ByteDance-Seed/UI-TARS-1.5-7B \
    --trust-remote-code \
    --gpu-memory-utilization 0.95 \  # Use more VRAM
    --max-model-len 4096 \            # Limit context
    --enable-prefix-caching \         # Cache repeated prompts
    --port 8000

Reduce Screenshot Resolution

def optimized_screenshot():
    screenshot = pyautogui.screenshot()
    # Resize to 1280x720 for faster processing
    screenshot = screenshot.resize((1280, 720), Image.LANCZOS)
    return screenshot

Limitations and Challenges

Current Gaps

  1. Complex Desktop Tasks: Best models achieve <50% on OSWorld (humans: 72.4%)
  2. Drag-and-Drop: Remains challenging for precise operations
  3. Professional Software: CAD, video editing, and specialized apps poorly understood
  4. Fine-Grained Spatial Reasoning: Small UI elements can be missed

Hallucination Issues

UI-TARS may confidently describe incorrect UI elements. Mitigation:

def verify_element_exists(action: str, screenshot: Image):
    # Re-analyze with explicit verification prompt
    verification = model.generate(
        image=screenshot,
        prompt=f"Is there a UI element that matches: {action}? Answer YES or NO."
    )
    return "YES" in verification.upper()

Latency Considerations

  • First action: 2-4 seconds (model loading, initial processing)
  • Subsequent actions: 1-2 seconds each
  • For real-time automation, consider smaller models or edge deployment

UI-TARS vs Alternatives

Feature Comparison

FeatureUI-TARSClaude Computer UseOpenAI Operator
Open SourceYes (Apache 2.0)NoNo
Local DeploymentYesNo (API only)No (Cloud)
Desktop SupportFullFullLimited
Mobile SupportYesLimitedNo
Browser AutomationYesYesPrimary
Game PlayingYesNoNo
Offline ModeYesNoNo
CostFree (self-hosted)API pricing$200/mo Pro

When to Choose UI-TARS

  • Privacy: All processing stays local
  • Cost: No per-interaction charges
  • Offline: Works without internet
  • Mobile: Need Android/iOS automation
  • Games: Automating game interactions
  • Open Source: Want to modify or extend

When to Choose Alternatives

  • Claude Computer Use: Better coding task performance, simpler API
  • OpenAI Operator: Best UX for web tasks, managed infrastructure
  • Traditional RPA (UiPath): Enterprise support, compliance certifications

Next Steps

  1. Set up Ollama for local model management
  2. Compare AI coding agents for development automation
  3. Check VRAM requirements for your hardware
  4. Build custom AI agents with frameworks
  5. Explore MCP integration for tool connectivity

Key Takeaways

  1. UI-TARS is the leading open-source GUI agent—47.5% OSWorld, 2x Claude's performance
  2. Pure vision-based approach works with any application, no HTML parsing needed
  3. 7B model runs on RTX 3080+ with 16GB VRAM; quantized versions work on less
  4. Completely local and offline—zero cloud dependency after model download
  5. Apache 2.0 license enables commercial use without restrictions
  6. Multi-platform support for desktop, mobile, web, and games
  7. Security requires attention—run in VMs or limited accounts for sensitive tasks

UI-TARS represents the future of local AI automation. By processing pure screenshots instead of structured data, it works with any interface—legacy software, custom applications, games, or mobile apps. The 7B model offers accessible performance on consumer GPUs, while larger variants push the boundaries of what's possible. For anyone serious about AI-powered automation without cloud dependencies, UI-TARS is the tool to learn.

šŸš€ Join 12K+ developers
Newsletter

Ready to start your AI career?

Get the complete roadmap

Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.

No spam, everUnsubscribe anytime
12,000+ downloads
Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

šŸ“… Published: February 6, 2026šŸ”„ Last Updated: February 6, 2026āœ“ Manually Reviewed

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Was this helpful?

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

āœ“ 10+ Years in ML/AIāœ“ 77K Dataset Creatorāœ“ Open Source Contributor
Free Tools & Calculators