Question 1

Is Apple Silicon actually good for running LLMs?

Accepted Answer

Surprisingly yes — for inference, especially at 32B+ model sizes. The unified memory architecture means a Mac Studio with 192GB can serve a 70B model that no single discrete GPU can. The trade-off: per-token compute is lower than NVIDIA discrete GPUs, so small-model throughput (7B-13B) is below what an RTX 4090 delivers. The win zone for Apple Silicon: 32B-200B models that need lots of memory but where you only need 10-30 tok/s for personal/dev use.

Question 2

M3 Max 128GB vs M3 Ultra 192GB — which is better for AI?

Accepted Answer

M3 Ultra wins on throughput (800 GB/s memory bandwidth vs 400 GB/s) and capacity (192GB vs 128GB), but costs 2x more. If you only run models that fit in 96GB (Llama 3.1 70B at Q4 + 32K context), M3 Max is the better value. If you want to run DeepSeek V3 / Qwen 72B BF16 / multiple models concurrently / large MoE configs, the M3 Ultra 192GB or 256GB is the right call. M3 Ultra 512GB unlocks DeepSeek V4 territory but costs $10K+ and is overkill for most users.

Question 3

How do these tokens-per-second estimates compare to real benchmarks?

Accepted Answer

The estimates use a simple bandwidth-bound model: throughput ≈ 0.85 × effective_bandwidth / model_size_GB. Calibrated against published llama.cpp Metal benchmarks on Reddit r/LocalLLaMA, Hugging Face leaderboards, and our own Mac runs. Accuracy is ±20% — actual numbers depend on exact model architecture, macOS version, prompt length, and KV cache size. Long context (32K+) drops throughput proportionally because KV cache reads dominate.

Question 4

MLX vs Ollama vs llama.cpp on Mac — which should I use?

Accepted Answer

Ollama for ease (one command install, model registry, OpenAI-compatible API). MLX for fine-tuning and PyTorch-like APIs (Apple's native framework, often 10-30% faster on 70B+ models). llama.cpp directly when you want bleeding-edge quantization formats or need maximum control. For 95% of users on 32GB+ Macs: install Ollama and stop thinking about it. For 96GB+ M3 Max/Ultra running 70B+ models: try MLX-LM if you want maximum throughput.

Question 5

How much memory does macOS reserve from unified memory?

Accepted Answer

macOS reserves roughly 25% of unified memory for the OS, app framework, KV cache overhead, and headroom. On a 64GB Mac, expect ~48GB usable for model weights + KV cache. On a 192GB Mac, ~144GB usable. You can push past this with `sudo sysctl iogpu.wired_limit_mb` to increase the GPU memory cap, but this risks system instability if the OS hits memory pressure mid-inference.

Question 6

Can I fine-tune models on Apple Silicon?

Accepted Answer

Yes for small-to-medium models. MLX-LM supports LoRA / QLoRA fine-tuning natively. On M3 Max 128GB, you can QLoRA-tune up to a 70B model. Throughput is slower than discrete GPUs (1-3 tok/s training throughput vs 30-50 on H100), so plan for hours-to-days runs rather than minutes. For serious fine-tuning workloads at 70B+, cloud H100s are still more cost-effective. For experimentation, prototyping, and 7B-32B fine-tuning: Apple Silicon works well.

Question 7

Why does my Mac get slower over long inference sessions?

Accepted Answer

Three causes. (1) Thermal throttling — sustained inference heats the chip, after 5-15 minutes the SoC down-clocks. M3 Pro / M4 Pro throttle harder than M3 Max / M3 Ultra (better cooling). (2) KV cache growth — long context fills more memory, KV reads dominate decode, throughput drops. (3) Memory pressure — if other apps allocate memory, inference can spill to swap, killing throughput. Mitigations: close other apps, use external cooling for long runs, cap context to what you actually need.

Question 8

Will the M5 / M5 Max / M5 Ultra change this picture?

Accepted Answer

Expected late 2026. Rumors suggest M5 Max/Ultra will push memory bandwidth to 1.0+ TB/s and add native FP8 tensor cores via the Neural Engine. If accurate, that would close most of the per-token throughput gap with H100/H200 for inference, while preserving the 192GB+ unified memory advantage. We'll update this calculator within a week of any M5 launch.

Model	Size at Q4_K_M	Fits?	Est. tok/s	Notes
Llama 3.2 3B	1.9 GB	✓	~143	Edge / mobile / on-device
Phi-4 Mini 3.8B	2.4 GB	✓	~113	Reasoning at edge
Gemma 3 4B	2.6 GB	✓	~105	Google small model
Mistral 7B	4.2 GB	✓	~65	Battle-tested 7B baseline
Llama 3.1 8B	4.7 GB	✓	~58	Most-deployed open 8B
Qwen 2.5 14B	8.2 GB	✓	~33	Strong general 14B
Phi-4 14B	8.2 GB	✓	~33	Best small reasoning model
Mistral Small 22B	13.0 GB	✓	~21	Mistral mid-tier
Gemma 3 27B	16.0 GB	✓	~17	Strong general-purpose 27B
Qwen3.6-27B	16.0 GB	✓	~17	Dense 27B beating older 397B
Qwen 2.5 32B	19.0 GB	✓	~14	Solid mid-size dense
Llama 3.3 70B	42.0 GB	✓	~6	Most-deployed open 70B
Llama 3.1 70B	42.0 GB	✓	~6	Long-context 70B baseline
Qwen 2.5 72B	43.0 GB	✓	~6	Top open dense 72B
Mistral Large 2	73.0 GB	✗	—	123B dense multilingual
DeepSeek V3 (671B MoE)	380.0 GB	✗	—	671B MoE / 37B active
DeepSeek V4-Pro (1.6T MoE)	900.0 GB	✗	—	Current open frontier — needs M3 Ultra 512GB at ~Q4 only
Kimi K2.6 (1T MoE)	575.0 GB	✗	—	1T MoE / 32B active

Apple Silicon AI Calculator

MLX vs Ollama vs llama.cpp on this Mac

Go from reading about AI to building with AI

Why Apple Silicon is interesting for local AI

Frequently asked questions

You picked your Mac. Now learn how to actually use it.

Related tools & resources

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide