Vicuna-33B: LMSys ShareGPT Model

Updated: March 16, 2026

LLaMA 1 33B fine-tuned on 70K ShareGPT conversations by LMSys. A historically significant model that pioneered LLM-as-judge evaluation and the ChatGPT Arena.

LLaMA 1 BaseNon-Commercial LicenseNot on OllamaLegacy Model (2023)

What Is Vicuna-33B?

Vicuna-33B v1.3 is a 33-billion parameter language model created by LMSys (Large Model Systems Organization) — a collaboration between UC Berkeley, CMU, Stanford, and UC San Diego. Released in June 2023, it was fine-tuned from Meta's LLaMA 1 33B on approximately 70,000 user conversations collected from ShareGPT.

Technical Specs

Base Model: LLaMA 1 33B (Meta, Feb 2023)
Parameters: 33 billion
Context Window: 2048 tokens (LLaMA 1 default)
Training Data: ~70K ShareGPT conversations
Version: v1.3 (June 2023)
Architecture: Standard LLaMA 1 transformer (GQA, RoPE, SwiGLU)

Key Facts

Developer: LMSys (UC Berkeley, CMU, Stanford, UCSD)
License: LLaMA 1 license (non-commercial research only)
Ollama: NOT available (only 7B and 13B sizes)
HuggingFace: lmsys/vicuna-33b-v1.3
Status: Legacy — superseded by modern models

License Warning: Many sites incorrectly claim Vicuna-33B is “Apache 2.0.” While the Vicuna delta weights are Apache 2.0, the model requires LLaMA 1 base weights which are under Meta's non-commercial research license. Commercial use is not permitted. Vicuna v1.5 (7B/13B only) uses LLaMA 2 with a more permissive license, but no v1.5 33B exists.

Historical Significance

Vicuna is one of the most historically important open-source LLMs. It contributed two innovations that became industry standards:

1. ShareGPT Training Data

Vicuna was one of the first models to demonstrate that fine-tuning on real user conversations(collected from ShareGPT, where users shared their ChatGPT conversations) produced models that felt significantly more natural and helpful than those trained on synthetic instruction data. This approach influenced the entire field.

2. LLM-as-Judge Evaluation

The Vicuna team pioneered using GPT-4 as an automated evaluator — comparing model outputs head-to-head and having GPT-4 score them. This “LLM-as-judge” approach became the standard evaluation methodology across the industry. The accompanying paper is “Judging LLM-as-a-Judge” (arXiv:2306.05685).

Chatbot Arena Legacy

LMSys also created the Chatbot Arena — a crowdsourced platform where users compare model outputs in blind A/B tests. Vicuna was a founding model on this platform. The Arena's Elo-based rankings remain one of the most trusted LLM evaluation benchmarks in 2026.

Real Benchmarks

MMLU Comparison

Approximate scores from the Open LLM Leaderboard. Note how modern 7B models now outperform Vicuna 33B.

MMLU Score (%)

Qwen 2.5 7B74 Score
74
Llama 3.1 8B68 Score
68
Mistral 7B v0.363 Score
63
Vicuna-33B v1.359 Score
59
Vicuna-13B v1.555 Score
55

MMLU

~59.2%

Multi-task language understanding

HellaSwag

~82.8%

Commonsense reasoning

ARC-Challenge

~54.7%

Science question answering

TruthfulQA

~51%

Factual accuracy

WinoGrande

~76%

Commonsense coreference

HumanEval

~15-20%

Code generation (not code-focused model)

Note: Exact benchmark numbers vary by evaluation setup. Scores shown are approximate from the Open LLM Leaderboard and community evaluations. Vicuna's original evaluation used GPT-4-as-judge (not standardized benchmarks), reporting ~90% of ChatGPT quality in conversational tasks.

VRAM & Quantization Guide

Vicuna-33B is a large model. Quantization is essential for consumer hardware. GGUF files are available from community contributors on HuggingFace.

QuantizationFile SizeVRAM NeededRAM (CPU)QualityGPU Compatibility
Q2_K~13GB~14GB~16GBNoticeable degradationRTX 4070 Ti (16GB)
Q3_K_M~16GB~17GB~20GBAcceptableRTX 3090/4090 (24GB)
Q4_K_M~20GB~21GB~24GBGood (recommended)RTX 3090/4090 (24GB)
Q5_K_M~24GB~25GB~28GBVery good2x RTX 3090 or A6000
Q8_0~35GB~36GB~40GBNear-losslessA6000 (48GB)
FP16~66GB~68GB~72GBFull precisionA100 80GB / 2x A6000

Hardware Recommendations

System Requirements

Operating System
Windows 10/11, macOS 12+, Ubuntu 20.04+
RAM
32GB minimum (Q4_K_M), 64GB for comfortable CPU inference
Storage
20-66GB depending on quantization level
GPU
RTX 3090/4090 24GB for Q4_K_M. A6000 48GB for Q8_0. CPU-only possible but slow.
CPU
Intel i7/AMD Ryzen 7 or better. ARM (Apple Silicon M1+) works well with llama.cpp.

Installation (llama.cpp)

Vicuna-33B is NOT on Ollama

Ollama offers vicuna in 7B and 13B sizes only. The 33B model is not in the Ollama registry. Use llama.cpp or text-generation-webui instead.

Using llama.cpp

Terminal
$git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j
# Build llama.cpp (add LLAMA_CUBLAS=1 for NVIDIA GPU support) # For Apple Silicon: make -j LLAMA_METAL=1
$# Download a GGUF file from HuggingFace (community upload)
# Example: TheBloke/vicuna-33B-v1.3-GGUF # wget https://huggingface.co/TheBloke/vicuna-33B-v1.3-GGUF/resolve/main/vicuna-33b-v1.3.Q4_K_M.gguf # Run with llama.cpp: ./main -m vicuna-33b-v1.3.Q4_K_M.gguf \ -n 512 \ --temp 0.7 \ -p "A chat between a curious user and an assistant.\n\nUSER: What is the significance of the Turing test?\nASSISTANT:"
$_

Note: Vicuna uses a specific chat template. The prompt format is: “A chat between a curious user and an assistant. USER: [message] ASSISTANT:” — getting this wrong significantly degrades output quality.

Using Python (Transformers)

Terminal
$pip install transformers torch accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "lmsys/vicuna-33b-v1.3" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto", load_in_4bit=True, # Requires bitsandbytes ) prompt = "A chat between a curious user and an assistant.\n\nUSER: Explain quantum entanglement simply.\nASSISTANT:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
$_

Requires ~20GB VRAM with 4-bit quantization. Full FP16 requires ~66GB VRAM (A100 or multi-GPU setup).

Vicuna-33B vs Modern Alternatives

ModelSizeRAM RequiredSpeedQualityCost/Month
Vicuna-33B v1.333B20-66GBSlow
59%
Non-commercial
Llama 3.1 8B8B5-16GBFast
68%
Meta Community
Mistral 7B v0.37B4-14GBFast
63%
Apache 2.0
Qwen 2.5 7B7B4-14GBFast
74%
Apache 2.0
Vicuna-13B v1.513B8-26GBMedium
55%
Llama 2 Community

Quality = approximate MMLU score. Modern 7B models outperform Vicuna-33B while using 3-5x less resources.

Should You Use Vicuna-33B in 2026?

Reasons NOT to Use It

  • Outperformed by smaller models: Llama 3.1 8B beats it on MMLU while using 4x less resources
  • Non-commercial license: LLaMA 1 base restricts commercial use
  • Small context window: Only 2048 tokens vs 8K-128K in modern models
  • Not on Ollama: No easy one-command setup; requires llama.cpp knowledge
  • No safety training: No RLHF or constitutional AI alignment
  • Weak at code: ~15-20% HumanEval vs 60%+ for modern code models

Reasons You Might Still Want It

  • Research / Historical study: Understanding the evolution of open-source LLMs
  • Benchmark comparison: As a baseline when evaluating newer models
  • Uncensored output: Less content filtering than modern models (for research)
  • ShareGPT conversation style: Natural conversational feel due to training data source

Recommendation

For new projects in 2026, use Qwen 2.5 7B, Llama 3.1 8B, or Mistral 7B instead. They're faster, more capable, have permissive licenses, larger context windows, and run on Ollama with a single command. Vicuna's legacy is in the innovations it brought to the field, not in its ongoing competitiveness.

Frequently Asked Questions

Is Vicuna-33B available on Ollama?

No. Ollama offers Vicuna in 7B and 13B sizes only. Vicuna-33B is not in the Ollama library because it's based on LLaMA 1 (which has a non-commercial license) and has been superseded by newer models. To run Vicuna-33B locally, use llama.cpp with a GGUF conversion from HuggingFace.

Can I use Vicuna-33B commercially?

No. Vicuna-33B is fine-tuned from LLaMA 1 33B, which was released under Meta's original LLaMA license that restricts commercial use. The Vicuna delta weights are Apache 2.0, but you need the LLaMA 1 base weights (non-commercial) to use the model. For commercial use, consider LLaMA 3 models or Mistral, which have permissive licenses.

What made Vicuna important historically?

Vicuna was one of the first open models to approach GPT-3.5 quality in conversations. Created by LMSys (UC Berkeley, CMU, Stanford, UCSD), it pioneered two important concepts: (1) fine-tuning on real user conversations from ShareGPT, and (2) using GPT-4 as an automated judge for evaluation — the 'LLM-as-judge' approach that became standard in the field.

How much VRAM does Vicuna-33B need?

At full FP16 precision: ~66GB VRAM. With Q4_K_M quantization: ~20GB VRAM (fits on RTX 3090/4090). With Q3_K_M: ~16GB. With Q2_K: ~13GB (fits on RTX 4070 Ti). CPU-only inference is possible but very slow.

What's better than Vicuna-33B today?

In 2026, almost every modern 7B-14B model outperforms Vicuna-33B while using far less resources. Llama 3.1 8B (MMLU ~68%), Mistral 7B v0.3 (MMLU ~63%), and Qwen 2.5 7B (MMLU ~74%) are all better choices. These models also have permissive licenses, larger context windows, and Ollama support.

Sources

  • LMSys. “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.” Blog Post (March 2023)
  • Zheng, L., et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” arXiv:2306.05685
  • HuggingFace. “lmsys/vicuna-33b-v1.3.” Model Card

Was this helpful?

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Reading now
Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
📅 Published: 2025-10-29🔄 Last Updated: March 16, 2026✓ Manually Reviewed
Free Tools & Calculators