Tools

Tabby: Self-Hosted GitHub Copilot Alternative

April 10, 2026
19 min read
Local AI Master Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Tabby: How to Set Up a Self-Hosted GitHub Copilot Alternative

Published on April 10, 2026 -- 19 min read

GitHub Copilot costs $10/month per developer and sends every keystroke to Microsoft's servers. For a 10-person team, that is $1,200/year -- and your proprietary code passes through infrastructure you do not control. Tabby eliminates both problems: it is free, open-source, and runs entirely on your network.

Tabby has 23,000+ GitHub stars, supports VS Code, JetBrains, Vim, and Neovim, and serves real-time code completions from models you choose. I have been running it for my team of five for three months on a single RTX 4090 workstation. Setup took 20 minutes. The completion quality is 85-90% of Copilot for standard coding patterns, and our code never leaves the building.

Here is how to set it up, which model to pick, and how to tune it for your team.


What is Tabby {#what-is-tabby}

Tabby is an open-source AI code completion server built by TabbyML. It provides:

  • Real-time code completion with sub-200ms latency
  • Multi-IDE support: VS Code, JetBrains (IntelliJ, PyCharm, WebStorm), Vim, Neovim
  • Model flexibility: StarCoder2, DeepSeek-Coder, CodeLlama, Qwen2.5-Coder
  • Repository indexing: learns your codebase patterns for better suggestions
  • Admin dashboard: user management, usage analytics, model configuration
  • Enterprise features: LDAP/OAuth auth, audit logging, access controls

The architecture is simple: Tabby runs as a server (standalone binary, Docker container, or Homebrew install), loads a code model into GPU memory, and serves completions over HTTP. IDE extensions connect to the server and inject suggestions inline, identical to how Copilot works.

What Tabby Does Not Do

Tabby focuses specifically on code completion -- the autocomplete experience. It does not include:

  • Chat interface (use Continue.dev or Claude for that)
  • Code explanation or documentation generation
  • Agent mode or autonomous task execution
  • Code review or PR analysis

This focused scope is actually an advantage: Tabby does one thing and does it well, with minimal resource usage.


Why Self-Host Code Completion {#why-self-host}

Privacy

Every character you type in Copilot gets sent to GitHub's servers. For companies handling regulated data (healthcare, finance, defense), customer PII, or proprietary algorithms, that is a compliance problem. Self-hosted Tabby keeps all code on your network.

Cost at Scale

Team SizeCopilot Cost/YearTabby Hardware CostTabby Breakeven
5 devs$600$1,600 (RTX 4090)2.7 years
10 devs$1,200$1,600 (RTX 4090)1.3 years
25 devs$3,000$1,600 (RTX 4090)6.4 months
50 devs$6,000$4,500 (A6000 48GB)9 months
100 devs$12,000$4,500 (A6000 48GB)4.5 months

At 10+ developers, Tabby pays for itself in the first year. At 25+, the savings are substantial.

Customization

Copilot gives you one model, take it or leave it. Tabby lets you:

  • Choose models optimized for your languages (DeepSeek-Coder for Python, StarCoder2 for polyglot)
  • Index your private repositories for context-aware completions
  • Fine-tune models on your codebase (advanced)
  • Control context window size and completion behavior

Uptime Independence

Copilot goes down when GitHub has an outage. Your Tabby server runs on your infrastructure, on your schedule.


Installation Methods {#installation}

Docker is the cleanest way to deploy Tabby, especially for team servers.

# NVIDIA GPU (CUDA)
docker run -it \
  --gpus all \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve --model StarCoder2-3B --device cuda

# AMD GPU (ROCm)
docker run -it \
  --device /dev/kfd --device /dev/dri \
  --group-add video \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby-rocm \
  serve --model StarCoder2-3B --device rocm

After startup, open http://localhost:8080 in your browser. You will see the admin dashboard where you can create user accounts, manage models, and view usage analytics.

Method 2: Homebrew (macOS)

# Install
brew install tabbyml/tabby/tabby

# Run with Apple Metal acceleration
tabby serve --model StarCoder2-3B --device metal

# Verify it is running
curl http://localhost:8080/v1/health

Method 3: Direct Binary (Linux)

# Download the latest release
curl -L https://github.com/TabbyML/tabby/releases/latest/download/tabby_x86_64-unknown-linux-gnu -o tabby
chmod +x tabby

# Run with CUDA
./tabby serve --model StarCoder2-3B --device cuda

# Or run as a systemd service for persistence
sudo tee /etc/systemd/system/tabby.service << 'EOF'
[Unit]
Description=Tabby AI Code Completion Server
After=network.target

[Service]
Type=simple
User=tabby
ExecStart=/usr/local/bin/tabby serve --model StarCoder2-3B --device cuda
Restart=always
RestartSec=10
Environment="TABBY_ROOT=/var/lib/tabby"

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable tabby
sudo systemctl start tabby

Method 4: Docker Compose (Production)

# docker-compose.yml
version: '3.8'
services:
  tabby:
    image: tabbyml/tabby
    command: serve --model StarCoder2-7B --device cuda
    ports:
      - "8080:8080"
    volumes:
      - tabby-data:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: always

volumes:
  tabby-data:
docker compose up -d

Model Selection Guide {#model-selection}

Choosing the right model is the single most important decision for your Tabby setup. The tradeoff is always latency vs quality.

Available Models

ModelParametersVRAM (Q8)Completion LatencyQuality
StarCoder2-3B3B3.5 GB80-150msGood
StarCoder2-7B7B7.5 GB150-250msVery Good
StarCoder2-15B15B16 GB300-500msExcellent
DeepSeek-Coder 1.3B1.3B1.5 GB40-80msBasic
DeepSeek-Coder 6.7B6.7B7 GB140-220msVery Good
CodeLlama-7B7B7.5 GB150-250msGood
CodeLlama-13B13B14 GB280-450msVery Good
Qwen2.5-Coder-3B3B3.5 GB80-150msGood
Qwen2.5-Coder-7B7B7.5 GB150-250msVery Good

Which Model to Pick

Under 200ms is the target. Above that, developers notice the delay and start typing ahead of the suggestions. This means:

  • 4 GB VRAM (GTX 1070, RX 580): StarCoder2-3B or DeepSeek-Coder 1.3B
  • 8 GB VRAM (RTX 3060 8GB, RTX 4060): StarCoder2-3B (fast) or DeepSeek-Coder 6.7B (quality)
  • 12-16 GB VRAM (RTX 3060 12GB, RTX 4060 Ti 16GB): StarCoder2-7B (recommended sweet spot)
  • 24 GB VRAM (RTX 4090): StarCoder2-7B with room for team serving, or StarCoder2-15B for single user
  • Apple Silicon 16 GB: StarCoder2-3B (fast) or Qwen2.5-Coder-3B
  • Apple Silicon 32 GB+: StarCoder2-7B or DeepSeek-Coder 6.7B

My recommendation for most teams: StarCoder2-7B on an RTX 4090. The 7B model hits the sweet spot between quality and latency, and the 24 GB VRAM on the 4090 leaves headroom for serving 15-20 concurrent developers.

Switching Models

# Stop current instance, start with new model
tabby serve --model DeepSeek-Coder-6.7B --device cuda

# Or in Docker
docker run -it --gpus all \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve --model DeepSeek-Coder-6.7B --device cuda

Models are downloaded automatically on first use. A 7B model downloads ~7 GB on the first run.


IDE Integration {#ide-integration}

VS Code

# Install the extension
code --install-extension TabbyML.vscode-tabby

Configure in VS Code settings (Cmd+Shift+P > Preferences: Open Settings JSON):

{
  "tabby.api.endpoint": "http://localhost:8080",
  "tabby.api.token": "your-auth-token",
  "tabby.inlineCompletion.triggerMode": "automatic",
  "tabby.inlineCompletion.debounce": 200
}

Completions appear inline as you type, identical to Copilot. Press Tab to accept, Escape to dismiss.

JetBrains (IntelliJ, PyCharm, WebStorm, etc.)

  1. Open Settings > Plugins > Marketplace
  2. Search "Tabby" and install
  3. Settings > Tools > Tabby > Server Endpoint: http://localhost:8080
  4. Enter your auth token
  5. Restart IDE

Vim / Neovim

" Using vim-plug
Plug 'TabbyML/vim-tabby'

" Configuration in .vimrc or init.vim
let g:tabby_server_url = 'http://localhost:8080'
let g:tabby_token = 'your-auth-token'

For Neovim with Lua config:

-- In init.lua
require('tabby').setup({
  server_url = 'http://localhost:8080',
  token = 'your-auth-token',
})

Verifying IDE Connection

After configuring any IDE, type some code and wait 200-300ms. If completions appear grayed out inline, the connection works. If not:

# Check server is running
curl http://localhost:8080/v1/health

# Test completion endpoint directly
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-auth-token" \
  -d '{
    "language": "python",
    "segments": {
      "prefix": "def fibonacci(n):\n    if n <= 1:\n        return n\n    ",
      "suffix": ""
    }
  }'

GPU Requirements and Performance {#gpu-requirements}

Benchmarks: Completions Per Second by GPU

GPUVRAMStarCoder2-3BStarCoder2-7BDeepSeek-Coder 6.7B
RTX 3060 12GB12 GB45 comp/s22 comp/s24 comp/s
RTX 4060 Ti 16GB16 GB65 comp/s35 comp/s38 comp/s
RTX 4070 Ti Super16 GB72 comp/s40 comp/s42 comp/s
RTX 409024 GB95 comp/s55 comp/s58 comp/s
RTX 509032 GB110 comp/s68 comp/s72 comp/s
A600048 GB85 comp/s48 comp/s50 comp/s
Apple M2 Max 32GB32 GB38 comp/s20 comp/s22 comp/s
Apple M3 Max 48GB48 GB52 comp/s30 comp/s32 comp/s

Completions/sec to concurrent users: One developer triggers roughly 12-20 completions/minute during active coding. So an RTX 4090 running StarCoder2-3B at 95 comp/s supports ~20 concurrent active developers.

Power Consumption

GPUIdle (Model Loaded)Active InferenceMonthly Cost ($0.12/kWh)
RTX 3060 12GB25W170W$4-15
RTX 409040W300W$7-26
A600045W300W$8-26
Apple M3 Max5W40W$1-3

Apple Silicon is remarkably efficient for this use case. An M3 Max running Tabby draws less power than a desk lamp.


Repository Indexing {#repository-indexing}

One of Tabby's strongest features: it can index your private repositories and use that context to improve completions. Instead of generic code suggestions, you get completions that match your project's patterns, API usage, and naming conventions.

Setting Up Repository Indexing

  1. Open the Tabby admin panel at http://localhost:8080
  2. Navigate to Settings > Repositories
  3. Add your Git repository:
# Through the admin API
curl -X POST http://localhost:8080/v1/repositories \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer admin-token" \
  -d '{
    "name": "my-project",
    "git_url": "file:///path/to/my-project",
    "branch": "main"
  }'
  1. Tabby clones the repository and builds a code index in the background
  2. Once indexed, completions automatically incorporate your codebase patterns

What Indexing Improves

  • Import suggestions: Tabby learns which modules your project uses and suggests correct imports
  • Function signatures: Completions match your naming conventions (camelCase, snake_case, etc.)
  • API patterns: If your codebase always calls db.query().where().first(), Tabby suggests that chain
  • Type patterns: Consistent with your TypeScript types, Python type hints, etc.

Indexing Performance

Codebase SizeIndex Build TimeIndex SizeMemory Overhead
10K lines30-60 sec~50 MB~200 MB
100K lines3-8 min~200 MB~500 MB
500K lines15-30 min~800 MB~1.5 GB
1M+ lines45-90 min~2 GB~3 GB

The index runs in the background and does not block completions. Memory overhead is additive to model VRAM, so factor it into your GPU planning.


Tabby vs Continue.dev {#tabby-vs-continue}

Both Tabby and Continue.dev are open-source tools for local AI coding, but they serve different purposes. For a detailed Continue.dev setup, see our Continue.dev + Ollama guide.

FeatureTabbyContinue.dev + Ollama
Primary purposeCode completion (autocomplete)Full AI coding assistant
Tab autocompleteExcellent (purpose-built)Good (secondary feature)
ChatNoYes
Edit modeNoYes
Agent modeNoYes
Model hostingBuilt-inRequires Ollama
Repository indexingBuilt-inVia embeddings model
Team featuresBuilt-in (auth, analytics)None
IDE supportVS Code, JetBrains, VimVS Code, JetBrains
Setup complexityOne commandOllama + Continue + config
Resource usageLow (one model)Higher (autocomplete + chat models)

The Ideal Setup: Use Both

The best local AI coding setup combines them:

  1. Tabby for real-time autocomplete (StarCoder2-3B, ~3.5 GB VRAM)
  2. Continue.dev with Ollama for chat, debugging, and refactoring (Qwen2.5-Coder 7B, ~7.5 GB VRAM)
  3. Total VRAM: ~11 GB, fits on RTX 3060 12GB or RTX 4060 Ti 16GB

This gives you Copilot-level autocomplete plus ChatGPT-level code assistance, all running locally.

Configure Continue.dev to not use its own autocomplete (since Tabby handles that):

# ~/.continue/config.yaml
models:
  - name: Qwen2.5-Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    roles:
      - chat
      - edit
      - apply
# No autocomplete model - Tabby handles it

Team Deployment {#team-deployment}

Authentication Setup

By default, Tabby runs without authentication. For team deployments, enable auth:

# Start with authentication enabled
tabby serve --model StarCoder2-7B --device cuda

# On first run, create admin account at http://your-server:8080
# Then invite team members through the admin panel

Tabby supports:

  • Built-in email/password authentication
  • OAuth (GitHub, Google, GitLab)
  • LDAP (enterprise)

Network Configuration

For team access, bind Tabby to your LAN:

# Bind to all interfaces
tabby serve --model StarCoder2-7B --device cuda --host 0.0.0.0 --port 8080

# Or use a reverse proxy (nginx)
# /etc/nginx/sites-available/tabby
server {
    listen 443 ssl;
    server_name tabby.internal.company.com;

    ssl_certificate /etc/ssl/certs/internal.crt;
    ssl_certificate_key /etc/ssl/private/internal.key;

    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Monitoring and Analytics

The Tabby admin dashboard (http://your-server:8080) shows:

  • Active users and sessions
  • Completion acceptance rate (how often developers press Tab)
  • Completions per hour/day
  • Model latency percentiles
  • GPU utilization

A healthy deployment shows:

  • Acceptance rate: 25-35% (similar to Copilot's public metrics)
  • P95 latency: Under 300ms
  • GPU utilization: 30-60% during business hours

Scaling for Larger Teams

For 50+ developers, consider:

  1. Horizontal scaling: Run multiple Tabby instances behind a load balancer
# Instance 1 on GPU 0
CUDA_VISIBLE_DEVICES=0 tabby serve --model StarCoder2-7B --port 8080

# Instance 2 on GPU 1
CUDA_VISIBLE_DEVICES=1 tabby serve --model StarCoder2-7B --port 8081
  1. Dedicated hardware: An NVIDIA A6000 48GB or dual RTX 4090s handle 30-50 concurrent users

  2. Cloud deployment: Deploy on a cloud GPU instance (RunPod, Lambda) if you do not want on-premise hardware


Performance Tuning {#performance-tuning}

Reduce Latency

# Use a smaller model for faster completions
tabby serve --model StarCoder2-3B --device cuda

# Limit completion length (fewer tokens = faster)
# In the admin panel: Settings > Completion > Max tokens: 128

# Increase GPU memory allocation
CUDA_MEM_FRACTION=0.9 tabby serve --model StarCoder2-7B --device cuda

IDE-Side Tuning

In VS Code settings:

{
  "tabby.inlineCompletion.debounce": 250,
  "tabby.inlineCompletion.triggerMode": "automatic",
  "tabby.maxPrefixLines": 20,
  "tabby.maxSuffixLines": 20
}

Increasing debounce from 200ms to 250-300ms reduces server load (fewer requests) at the cost of slightly delayed suggestions. For slow servers, this trade is worth it.

Model Warm-Up

The first completion after a cold start is slow because the model loads into GPU memory. Keep the model warm:

# Send periodic health checks to prevent unloading
while true; do
  curl -s http://localhost:8080/v1/health > /dev/null
  sleep 60
done &

Monitoring GPU Usage

# Watch GPU utilization in real-time
watch -n 1 nvidia-smi

# Or use nvtop for a better visualization
nvtop

If GPU utilization is consistently above 80%, either add another GPU, switch to a smaller model, or increase the debounce time on IDE clients.


Privacy Advantages {#privacy-advantages}

The core value proposition of Tabby over cloud services deserves emphasis:

  1. Zero data exfiltration: Your code stays on your hardware. Period. No telemetry, no training data collection, no third-party access.

  2. Compliance friendly: Self-hosted satisfies SOC 2, HIPAA, GDPR, and FedRAMP data residency requirements. Your security team will appreciate not having to review another SaaS vendor's DPA.

  3. No vendor lock-in: Switch models, modify the source code, or migrate to different hardware anytime. Apache 2.0 license means you own the deployment.

  4. Air-gapped support: Tabby runs entirely offline after the initial model download. Disconnect from the internet and it works identically. Critical for defense, government, and high-security environments.

For teams already running local AI for other tasks, see our local AI programming models guide for complementary tools.


Conclusion

Tabby is the most mature self-hosted code completion server available. The 23,000+ GitHub stars are not hype -- it genuinely works. Setup takes 20 minutes, the completion quality rivals Copilot for common patterns, and the privacy guarantee is absolute.

For a single developer, the ROI calculation depends on how much you value privacy. For a team of 10+, the math is clear: one RTX 4090 ($1,600) replaces $1,200/year in Copilot subscriptions and eliminates code leaving your network.

Start with Docker and StarCoder2-3B. If the completions feel useful, upgrade to StarCoder2-7B. Index your repositories for the biggest quality improvement. And combine it with Continue.dev + Ollama for a complete local AI coding stack that matches what the cloud providers charge $20-50/month per seat to deliver.


Building a complete local AI development environment? Check our best local AI coding models ranking or the AI hardware requirements guide to plan your hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 10, 2026🔄 Last Updated: April 10, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor

Build Real AI on Your Machine

RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.

Was this helpful?

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators