Tabby: Self-Hosted GitHub Copilot Alternative
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Tabby: How to Set Up a Self-Hosted GitHub Copilot Alternative
Published on April 10, 2026 -- 19 min read
GitHub Copilot costs $10/month per developer and sends every keystroke to Microsoft's servers. For a 10-person team, that is $1,200/year -- and your proprietary code passes through infrastructure you do not control. Tabby eliminates both problems: it is free, open-source, and runs entirely on your network.
Tabby has 23,000+ GitHub stars, supports VS Code, JetBrains, Vim, and Neovim, and serves real-time code completions from models you choose. I have been running it for my team of five for three months on a single RTX 4090 workstation. Setup took 20 minutes. The completion quality is 85-90% of Copilot for standard coding patterns, and our code never leaves the building.
Here is how to set it up, which model to pick, and how to tune it for your team.
What is Tabby {#what-is-tabby}
Tabby is an open-source AI code completion server built by TabbyML. It provides:
- Real-time code completion with sub-200ms latency
- Multi-IDE support: VS Code, JetBrains (IntelliJ, PyCharm, WebStorm), Vim, Neovim
- Model flexibility: StarCoder2, DeepSeek-Coder, CodeLlama, Qwen2.5-Coder
- Repository indexing: learns your codebase patterns for better suggestions
- Admin dashboard: user management, usage analytics, model configuration
- Enterprise features: LDAP/OAuth auth, audit logging, access controls
The architecture is simple: Tabby runs as a server (standalone binary, Docker container, or Homebrew install), loads a code model into GPU memory, and serves completions over HTTP. IDE extensions connect to the server and inject suggestions inline, identical to how Copilot works.
What Tabby Does Not Do
Tabby focuses specifically on code completion -- the autocomplete experience. It does not include:
- Chat interface (use Continue.dev or Claude for that)
- Code explanation or documentation generation
- Agent mode or autonomous task execution
- Code review or PR analysis
This focused scope is actually an advantage: Tabby does one thing and does it well, with minimal resource usage.
Why Self-Host Code Completion {#why-self-host}
Privacy
Every character you type in Copilot gets sent to GitHub's servers. For companies handling regulated data (healthcare, finance, defense), customer PII, or proprietary algorithms, that is a compliance problem. Self-hosted Tabby keeps all code on your network.
Cost at Scale
| Team Size | Copilot Cost/Year | Tabby Hardware Cost | Tabby Breakeven |
|---|---|---|---|
| 5 devs | $600 | $1,600 (RTX 4090) | 2.7 years |
| 10 devs | $1,200 | $1,600 (RTX 4090) | 1.3 years |
| 25 devs | $3,000 | $1,600 (RTX 4090) | 6.4 months |
| 50 devs | $6,000 | $4,500 (A6000 48GB) | 9 months |
| 100 devs | $12,000 | $4,500 (A6000 48GB) | 4.5 months |
At 10+ developers, Tabby pays for itself in the first year. At 25+, the savings are substantial.
Customization
Copilot gives you one model, take it or leave it. Tabby lets you:
- Choose models optimized for your languages (DeepSeek-Coder for Python, StarCoder2 for polyglot)
- Index your private repositories for context-aware completions
- Fine-tune models on your codebase (advanced)
- Control context window size and completion behavior
Uptime Independence
Copilot goes down when GitHub has an outage. Your Tabby server runs on your infrastructure, on your schedule.
Installation Methods {#installation}
Method 1: Docker (Recommended for Teams)
Docker is the cleanest way to deploy Tabby, especially for team servers.
# NVIDIA GPU (CUDA)
docker run -it \
--gpus all \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model StarCoder2-3B --device cuda
# AMD GPU (ROCm)
docker run -it \
--device /dev/kfd --device /dev/dri \
--group-add video \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby-rocm \
serve --model StarCoder2-3B --device rocm
After startup, open http://localhost:8080 in your browser. You will see the admin dashboard where you can create user accounts, manage models, and view usage analytics.
Method 2: Homebrew (macOS)
# Install
brew install tabbyml/tabby/tabby
# Run with Apple Metal acceleration
tabby serve --model StarCoder2-3B --device metal
# Verify it is running
curl http://localhost:8080/v1/health
Method 3: Direct Binary (Linux)
# Download the latest release
curl -L https://github.com/TabbyML/tabby/releases/latest/download/tabby_x86_64-unknown-linux-gnu -o tabby
chmod +x tabby
# Run with CUDA
./tabby serve --model StarCoder2-3B --device cuda
# Or run as a systemd service for persistence
sudo tee /etc/systemd/system/tabby.service << 'EOF'
[Unit]
Description=Tabby AI Code Completion Server
After=network.target
[Service]
Type=simple
User=tabby
ExecStart=/usr/local/bin/tabby serve --model StarCoder2-3B --device cuda
Restart=always
RestartSec=10
Environment="TABBY_ROOT=/var/lib/tabby"
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable tabby
sudo systemctl start tabby
Method 4: Docker Compose (Production)
# docker-compose.yml
version: '3.8'
services:
tabby:
image: tabbyml/tabby
command: serve --model StarCoder2-7B --device cuda
ports:
- "8080:8080"
volumes:
- tabby-data:/data
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: always
volumes:
tabby-data:
docker compose up -d
Model Selection Guide {#model-selection}
Choosing the right model is the single most important decision for your Tabby setup. The tradeoff is always latency vs quality.
Available Models
| Model | Parameters | VRAM (Q8) | Completion Latency | Quality |
|---|---|---|---|---|
| StarCoder2-3B | 3B | 3.5 GB | 80-150ms | Good |
| StarCoder2-7B | 7B | 7.5 GB | 150-250ms | Very Good |
| StarCoder2-15B | 15B | 16 GB | 300-500ms | Excellent |
| DeepSeek-Coder 1.3B | 1.3B | 1.5 GB | 40-80ms | Basic |
| DeepSeek-Coder 6.7B | 6.7B | 7 GB | 140-220ms | Very Good |
| CodeLlama-7B | 7B | 7.5 GB | 150-250ms | Good |
| CodeLlama-13B | 13B | 14 GB | 280-450ms | Very Good |
| Qwen2.5-Coder-3B | 3B | 3.5 GB | 80-150ms | Good |
| Qwen2.5-Coder-7B | 7B | 7.5 GB | 150-250ms | Very Good |
Which Model to Pick
Under 200ms is the target. Above that, developers notice the delay and start typing ahead of the suggestions. This means:
- 4 GB VRAM (GTX 1070, RX 580): StarCoder2-3B or DeepSeek-Coder 1.3B
- 8 GB VRAM (RTX 3060 8GB, RTX 4060): StarCoder2-3B (fast) or DeepSeek-Coder 6.7B (quality)
- 12-16 GB VRAM (RTX 3060 12GB, RTX 4060 Ti 16GB): StarCoder2-7B (recommended sweet spot)
- 24 GB VRAM (RTX 4090): StarCoder2-7B with room for team serving, or StarCoder2-15B for single user
- Apple Silicon 16 GB: StarCoder2-3B (fast) or Qwen2.5-Coder-3B
- Apple Silicon 32 GB+: StarCoder2-7B or DeepSeek-Coder 6.7B
My recommendation for most teams: StarCoder2-7B on an RTX 4090. The 7B model hits the sweet spot between quality and latency, and the 24 GB VRAM on the 4090 leaves headroom for serving 15-20 concurrent developers.
Switching Models
# Stop current instance, start with new model
tabby serve --model DeepSeek-Coder-6.7B --device cuda
# Or in Docker
docker run -it --gpus all \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model DeepSeek-Coder-6.7B --device cuda
Models are downloaded automatically on first use. A 7B model downloads ~7 GB on the first run.
IDE Integration {#ide-integration}
VS Code
# Install the extension
code --install-extension TabbyML.vscode-tabby
Configure in VS Code settings (Cmd+Shift+P > Preferences: Open Settings JSON):
{
"tabby.api.endpoint": "http://localhost:8080",
"tabby.api.token": "your-auth-token",
"tabby.inlineCompletion.triggerMode": "automatic",
"tabby.inlineCompletion.debounce": 200
}
Completions appear inline as you type, identical to Copilot. Press Tab to accept, Escape to dismiss.
JetBrains (IntelliJ, PyCharm, WebStorm, etc.)
- Open Settings > Plugins > Marketplace
- Search "Tabby" and install
- Settings > Tools > Tabby > Server Endpoint:
http://localhost:8080 - Enter your auth token
- Restart IDE
Vim / Neovim
" Using vim-plug
Plug 'TabbyML/vim-tabby'
" Configuration in .vimrc or init.vim
let g:tabby_server_url = 'http://localhost:8080'
let g:tabby_token = 'your-auth-token'
For Neovim with Lua config:
-- In init.lua
require('tabby').setup({
server_url = 'http://localhost:8080',
token = 'your-auth-token',
})
Verifying IDE Connection
After configuring any IDE, type some code and wait 200-300ms. If completions appear grayed out inline, the connection works. If not:
# Check server is running
curl http://localhost:8080/v1/health
# Test completion endpoint directly
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-auth-token" \
-d '{
"language": "python",
"segments": {
"prefix": "def fibonacci(n):\n if n <= 1:\n return n\n ",
"suffix": ""
}
}'
GPU Requirements and Performance {#gpu-requirements}
Benchmarks: Completions Per Second by GPU
| GPU | VRAM | StarCoder2-3B | StarCoder2-7B | DeepSeek-Coder 6.7B |
|---|---|---|---|---|
| RTX 3060 12GB | 12 GB | 45 comp/s | 22 comp/s | 24 comp/s |
| RTX 4060 Ti 16GB | 16 GB | 65 comp/s | 35 comp/s | 38 comp/s |
| RTX 4070 Ti Super | 16 GB | 72 comp/s | 40 comp/s | 42 comp/s |
| RTX 4090 | 24 GB | 95 comp/s | 55 comp/s | 58 comp/s |
| RTX 5090 | 32 GB | 110 comp/s | 68 comp/s | 72 comp/s |
| A6000 | 48 GB | 85 comp/s | 48 comp/s | 50 comp/s |
| Apple M2 Max 32GB | 32 GB | 38 comp/s | 20 comp/s | 22 comp/s |
| Apple M3 Max 48GB | 48 GB | 52 comp/s | 30 comp/s | 32 comp/s |
Completions/sec to concurrent users: One developer triggers roughly 12-20 completions/minute during active coding. So an RTX 4090 running StarCoder2-3B at 95 comp/s supports ~20 concurrent active developers.
Power Consumption
| GPU | Idle (Model Loaded) | Active Inference | Monthly Cost ($0.12/kWh) |
|---|---|---|---|
| RTX 3060 12GB | 25W | 170W | $4-15 |
| RTX 4090 | 40W | 300W | $7-26 |
| A6000 | 45W | 300W | $8-26 |
| Apple M3 Max | 5W | 40W | $1-3 |
Apple Silicon is remarkably efficient for this use case. An M3 Max running Tabby draws less power than a desk lamp.
Repository Indexing {#repository-indexing}
One of Tabby's strongest features: it can index your private repositories and use that context to improve completions. Instead of generic code suggestions, you get completions that match your project's patterns, API usage, and naming conventions.
Setting Up Repository Indexing
- Open the Tabby admin panel at
http://localhost:8080 - Navigate to Settings > Repositories
- Add your Git repository:
# Through the admin API
curl -X POST http://localhost:8080/v1/repositories \
-H "Content-Type: application/json" \
-H "Authorization: Bearer admin-token" \
-d '{
"name": "my-project",
"git_url": "file:///path/to/my-project",
"branch": "main"
}'
- Tabby clones the repository and builds a code index in the background
- Once indexed, completions automatically incorporate your codebase patterns
What Indexing Improves
- Import suggestions: Tabby learns which modules your project uses and suggests correct imports
- Function signatures: Completions match your naming conventions (camelCase, snake_case, etc.)
- API patterns: If your codebase always calls
db.query().where().first(), Tabby suggests that chain - Type patterns: Consistent with your TypeScript types, Python type hints, etc.
Indexing Performance
| Codebase Size | Index Build Time | Index Size | Memory Overhead |
|---|---|---|---|
| 10K lines | 30-60 sec | ~50 MB | ~200 MB |
| 100K lines | 3-8 min | ~200 MB | ~500 MB |
| 500K lines | 15-30 min | ~800 MB | ~1.5 GB |
| 1M+ lines | 45-90 min | ~2 GB | ~3 GB |
The index runs in the background and does not block completions. Memory overhead is additive to model VRAM, so factor it into your GPU planning.
Tabby vs Continue.dev {#tabby-vs-continue}
Both Tabby and Continue.dev are open-source tools for local AI coding, but they serve different purposes. For a detailed Continue.dev setup, see our Continue.dev + Ollama guide.
| Feature | Tabby | Continue.dev + Ollama |
|---|---|---|
| Primary purpose | Code completion (autocomplete) | Full AI coding assistant |
| Tab autocomplete | Excellent (purpose-built) | Good (secondary feature) |
| Chat | No | Yes |
| Edit mode | No | Yes |
| Agent mode | No | Yes |
| Model hosting | Built-in | Requires Ollama |
| Repository indexing | Built-in | Via embeddings model |
| Team features | Built-in (auth, analytics) | None |
| IDE support | VS Code, JetBrains, Vim | VS Code, JetBrains |
| Setup complexity | One command | Ollama + Continue + config |
| Resource usage | Low (one model) | Higher (autocomplete + chat models) |
The Ideal Setup: Use Both
The best local AI coding setup combines them:
- Tabby for real-time autocomplete (StarCoder2-3B, ~3.5 GB VRAM)
- Continue.dev with Ollama for chat, debugging, and refactoring (Qwen2.5-Coder 7B, ~7.5 GB VRAM)
- Total VRAM: ~11 GB, fits on RTX 3060 12GB or RTX 4060 Ti 16GB
This gives you Copilot-level autocomplete plus ChatGPT-level code assistance, all running locally.
Configure Continue.dev to not use its own autocomplete (since Tabby handles that):
# ~/.continue/config.yaml
models:
- name: Qwen2.5-Coder 7B
provider: ollama
model: qwen2.5-coder:7b
roles:
- chat
- edit
- apply
# No autocomplete model - Tabby handles it
Team Deployment {#team-deployment}
Authentication Setup
By default, Tabby runs without authentication. For team deployments, enable auth:
# Start with authentication enabled
tabby serve --model StarCoder2-7B --device cuda
# On first run, create admin account at http://your-server:8080
# Then invite team members through the admin panel
Tabby supports:
- Built-in email/password authentication
- OAuth (GitHub, Google, GitLab)
- LDAP (enterprise)
Network Configuration
For team access, bind Tabby to your LAN:
# Bind to all interfaces
tabby serve --model StarCoder2-7B --device cuda --host 0.0.0.0 --port 8080
# Or use a reverse proxy (nginx)
# /etc/nginx/sites-available/tabby
server {
listen 443 ssl;
server_name tabby.internal.company.com;
ssl_certificate /etc/ssl/certs/internal.crt;
ssl_certificate_key /etc/ssl/private/internal.key;
location / {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Monitoring and Analytics
The Tabby admin dashboard (http://your-server:8080) shows:
- Active users and sessions
- Completion acceptance rate (how often developers press Tab)
- Completions per hour/day
- Model latency percentiles
- GPU utilization
A healthy deployment shows:
- Acceptance rate: 25-35% (similar to Copilot's public metrics)
- P95 latency: Under 300ms
- GPU utilization: 30-60% during business hours
Scaling for Larger Teams
For 50+ developers, consider:
- Horizontal scaling: Run multiple Tabby instances behind a load balancer
# Instance 1 on GPU 0
CUDA_VISIBLE_DEVICES=0 tabby serve --model StarCoder2-7B --port 8080
# Instance 2 on GPU 1
CUDA_VISIBLE_DEVICES=1 tabby serve --model StarCoder2-7B --port 8081
-
Dedicated hardware: An NVIDIA A6000 48GB or dual RTX 4090s handle 30-50 concurrent users
-
Cloud deployment: Deploy on a cloud GPU instance (RunPod, Lambda) if you do not want on-premise hardware
Performance Tuning {#performance-tuning}
Reduce Latency
# Use a smaller model for faster completions
tabby serve --model StarCoder2-3B --device cuda
# Limit completion length (fewer tokens = faster)
# In the admin panel: Settings > Completion > Max tokens: 128
# Increase GPU memory allocation
CUDA_MEM_FRACTION=0.9 tabby serve --model StarCoder2-7B --device cuda
IDE-Side Tuning
In VS Code settings:
{
"tabby.inlineCompletion.debounce": 250,
"tabby.inlineCompletion.triggerMode": "automatic",
"tabby.maxPrefixLines": 20,
"tabby.maxSuffixLines": 20
}
Increasing debounce from 200ms to 250-300ms reduces server load (fewer requests) at the cost of slightly delayed suggestions. For slow servers, this trade is worth it.
Model Warm-Up
The first completion after a cold start is slow because the model loads into GPU memory. Keep the model warm:
# Send periodic health checks to prevent unloading
while true; do
curl -s http://localhost:8080/v1/health > /dev/null
sleep 60
done &
Monitoring GPU Usage
# Watch GPU utilization in real-time
watch -n 1 nvidia-smi
# Or use nvtop for a better visualization
nvtop
If GPU utilization is consistently above 80%, either add another GPU, switch to a smaller model, or increase the debounce time on IDE clients.
Privacy Advantages {#privacy-advantages}
The core value proposition of Tabby over cloud services deserves emphasis:
-
Zero data exfiltration: Your code stays on your hardware. Period. No telemetry, no training data collection, no third-party access.
-
Compliance friendly: Self-hosted satisfies SOC 2, HIPAA, GDPR, and FedRAMP data residency requirements. Your security team will appreciate not having to review another SaaS vendor's DPA.
-
No vendor lock-in: Switch models, modify the source code, or migrate to different hardware anytime. Apache 2.0 license means you own the deployment.
-
Air-gapped support: Tabby runs entirely offline after the initial model download. Disconnect from the internet and it works identically. Critical for defense, government, and high-security environments.
For teams already running local AI for other tasks, see our local AI programming models guide for complementary tools.
Conclusion
Tabby is the most mature self-hosted code completion server available. The 23,000+ GitHub stars are not hype -- it genuinely works. Setup takes 20 minutes, the completion quality rivals Copilot for common patterns, and the privacy guarantee is absolute.
For a single developer, the ROI calculation depends on how much you value privacy. For a team of 10+, the math is clear: one RTX 4090 ($1,600) replaces $1,200/year in Copilot subscriptions and eliminates code leaving your network.
Start with Docker and StarCoder2-3B. If the completions feel useful, upgrade to StarCoder2-7B. Index your repositories for the biggest quality improvement. And combine it with Continue.dev + Ollama for a complete local AI coding stack that matches what the cloud providers charge $20-50/month per seat to deliver.
Building a complete local AI development environment? Check our best local AI coding models ranking or the AI hardware requirements guide to plan your hardware.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!