Local AI Access Control: Role-Based Permissions for Self-Hosted LLMs
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Local AI Access Control: Role-Based Permissions for Self-Hosted LLMs
Published April 23, 2026 - 20 min read
The moment your self-hosted LLM has more than three users, "everyone hits the same Ollama port" stops being a tenable design. Someone in marketing pulls a 70B model and saturates your VRAM. The intern uses the same endpoint as your CFO. Logs become a single firehose with no per-user attribution. By the time someone asks "did the contractor see the executive comp data?" you have no answer. Nothing about Ollama, llama.cpp, or vLLM out of the box gives you proper authentication, role-based authorization, per-user budgets, prompt-level redaction, or audit trails. You build that stack yourself, on top of well-understood components. This guide shows you exactly how.
Quick Start: Multi-User Ollama with SSO in 30 Minutes
The minimum viable production stack for a team of 5-50 people:
# docker-compose.yml - quick-start RBAC stack for self-hosted LLMs
services:
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices: [{driver: nvidia, count: all, capabilities: [gpu]}]
volumes:
- ollama-data:/root/.ollama
networks: [llm-internal]
# IMPORTANT: do NOT expose 11434 to the LAN. Only the proxy talks to it.
litellm:
image: ghcr.io/berriai/litellm:main-stable
environment:
LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
DATABASE_URL: postgresql://litellm:${DB_PASS}@postgres/litellm
depends_on: [ollama, postgres]
ports: ["4000:4000"] # proxy port for API users
networks: [llm-internal, public]
postgres:
image: postgres:16
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: ${DB_PASS}
volumes: [pg-data:/var/lib/postgresql/data]
networks: [llm-internal]
open-webui:
image: ghcr.io/open-webui/open-webui:main
environment:
OPENAI_API_BASE_URL: http://litellm:4000/v1
OPENAI_API_KEY: ${LITELLM_MASTER_KEY}
WEBUI_AUTH: "true"
ENABLE_OAUTH_SIGNUP: "true"
OAUTH_CLIENT_ID: ${OAUTH_CLIENT_ID}
OAUTH_CLIENT_SECRET: ${OAUTH_CLIENT_SECRET}
OPENID_PROVIDER_URL: ${OIDC_DISCOVERY_URL}
OAUTH_PROVIDER_NAME: "Company SSO"
ports: ["3000:8080"]
depends_on: [litellm]
networks: [llm-internal, public]
volumes:
ollama-data:
pg-data:
networks:
llm-internal:
internal: true
public:
Drop that into a host with an NVIDIA GPU, set five environment variables, run docker compose up -d, point your IdP (Okta, Authentik, Keycloak, Google Workspace) at https://ai.yourdomain.com/oauth/oidc/callback, and you have authenticated multi-user LLM access with API key issuance, per-user budgets, and audit-ready logs. The rest of this guide explains how each piece works, where to harden it, and what to add when 50 users becomes 500.
Table of Contents
- The Threat Model You Actually Face
- Architecture: Why a Proxy Is Mandatory
- Authentication: OAuth and OIDC Integration
- Authorization: LiteLLM Virtual Keys + Teams
- Per-User Rate Limits and Spend Caps
- Prompt and Response Redaction
- Audit Logging You Can Hand to a Compliance Team
- Network Hardening: Firewalls, mTLS, Egress
- Migration Path from Single-User Ollama
- Common Mistakes and How They Fail
The Threat Model You Actually Face {#threat-model}
Before you write any RBAC code, decide which threats matter. Most teams confuse "we want SSO" with "we have a security need." Be specific:
- Insider misuse. A legitimate user feeds confidential customer data into a model with a logging hook that you do not control. Mitigation: prompt redaction + audit logs.
- Lateral movement. An attacker compromises one developer's laptop and now has unconstrained access to a model that has read all your repos via RAG. Mitigation: per-user keys + revocation + network segmentation.
- Cost runaway. A buggy script ships to production and runs Llama 3 70B in an infinite loop. Mitigation: per-key budget caps + concurrency limits.
- Data exfiltration. A privileged user exports a sensitive document, asks the model to summarize, then pastes the summary into ChatGPT. Mitigation: outbound network policy + DLP on the WebUI.
- Audit gaps. Your security team asks "who used the model on Tuesday between 2 and 4 PM and what did they ask?" You have no answer. Mitigation: structured prompt/response logging with retention policy.
If you cannot describe which of those five matter most for your org, the rest of this guide is academic. Pick the threat. Build for it. Skip the rest.
For broader compliance context, see our GDPR-compliant local AI guide and the SOC 2 self-hosted AI primer.
Architecture: Why a Proxy Is Mandatory {#architecture}
Ollama listens on port 11434 and assumes a single, trusted user. There is no native concept of API keys, user identities, or rate limits. You add those by putting a proxy in front of Ollama and never exposing Ollama directly.
The recommended stack:
[Browser] -> nginx (TLS, WAF) -> Open WebUI -> LiteLLM -> Ollama
(browser-based UI)
[App/CLI] -> nginx (TLS, WAF) -> LiteLLM -> Ollama
(programmatic API access)
[Admin] -> nginx (TLS, mTLS) -> LiteLLM admin endpoints
(key issuance, budget edits)
Why LiteLLM? Three reasons. It speaks OpenAI-compatible API on the front end, it routes to Ollama (and other backends) on the back end, and it has a first-class concept of "virtual keys" with per-key budgets, model allowlists, and rate limits. Open WebUI gives you a ChatGPT-style UI for the human users who will not write code; LiteLLM gives you the API for the systems that will. Both authenticate via your IdP.
Hard rule: Ollama's port (11434) must be on an internal Docker network only. If it is on the LAN, anyone who finds the port owns the model. There are tens of thousands of misconfigured Ollama instances exposed on Shodan as of 2026; do not become one of them.
Authentication: OAuth and OIDC Integration {#authentication}
Two authentication scenarios need different setups.
Browser users via Open WebUI + OIDC
Open WebUI's OAuth/OIDC support is solid as of 2026. Configure your IdP to allow Open WebUI as a relying party:
# Required environment variables for Open WebUI
WEBUI_AUTH=true
ENABLE_OAUTH_SIGNUP=true
OAUTH_CLIENT_ID=open-webui
OAUTH_CLIENT_SECRET=<from your IdP>
OPENID_PROVIDER_URL=https://idp.yourdomain.com/.well-known/openid-configuration
OAUTH_REDIRECT_URI=https://ai.yourdomain.com/oauth/oidc/callback
# Restrict signups to a domain
OAUTH_SIGNUP_EMAIL_DOMAIN=yourdomain.com
# Map IdP groups to Open WebUI roles
ENABLE_OAUTH_ROLE_MANAGEMENT=true
OAUTH_ROLES_CLAIM=groups
OAUTH_ALLOWED_ROLES=ai-users,ai-admins
OAUTH_ADMIN_ROLES=ai-admins
For Authentik (an excellent self-hosted IdP for small teams), Keycloak, Okta, or Google Workspace, the documentation maps cleanly. The critical piece is OAUTH_ROLES_CLAIM=groups: it tells Open WebUI to read the IdP's group list out of the OIDC ID token and map them to admin/user roles inside the WebUI.
Programmatic users via LiteLLM virtual keys
For scripts, jobs, and applications, browser OAuth is the wrong tool. Issue per-application virtual keys via LiteLLM:
# As an admin, mint a key for the analytics service
curl -X POST https://ai.yourdomain.com/litellm/key/generate \
-H "Authorization: Bearer ${LITELLM_MASTER_KEY}" \
-d '{
"models": ["llama3.1-8b", "qwen2.5-14b"],
"max_budget": 50,
"budget_duration": "30d",
"rpm_limit": 60,
"tpm_limit": 80000,
"metadata": {"team": "analytics", "environment": "production"}
}'
# Response:
# {"key": "sk-litellm-abc123...", "expires": "2026-05-23T00:00:00Z"}
Each key is independently revocable, capped, and tagged. Application code uses it like any OpenAI key:
from openai import OpenAI
client = OpenAI(
api_key="sk-litellm-abc123...",
base_url="https://ai.yourdomain.com/v1"
)
resp = client.chat.completions.create(
model="llama3.1-8b",
messages=[{"role": "user", "content": "Summarize Q1 metrics."}]
)
If the analytics team's contractor leaves, you revoke that single key and everything else keeps working. No password rotations, no shared secrets.
Authorization: LiteLLM Virtual Keys + Teams {#authorization}
Authentication answers "who is this?" Authorization answers "what can they do?" LiteLLM's "team" abstraction is the cleanest way to encode this for LLM workloads.
# Create teams that match your org
curl -X POST https://ai.yourdomain.com/litellm/team/new \
-H "Authorization: Bearer ${LITELLM_MASTER_KEY}" \
-d '{
"team_alias": "engineering",
"max_budget": 500,
"budget_duration": "30d",
"models": ["llama3.1-8b", "qwen2.5-coder-14b", "deepseek-coder-v2-lite"]
}'
curl -X POST https://ai.yourdomain.com/litellm/team/new \
-H "Authorization: Bearer ${LITELLM_MASTER_KEY}" \
-d '{
"team_alias": "marketing",
"max_budget": 100,
"budget_duration": "30d",
"models": ["llama3.1-8b", "qwen2.5-7b"]
}'
# Issue a key bound to a team (inherits team's model allowlist + budget)
curl -X POST https://ai.yourdomain.com/litellm/key/generate \
-H "Authorization: Bearer ${LITELLM_MASTER_KEY}" \
-d '{"team_id": "team_engineering", "user_id": "alice@yourdomain.com"}'
A few rules I have learned from running this at multiple companies:
- Map teams to org units, not projects. Project-scoped teams turn into a permission management mess. Org-unit teams (engineering, marketing, finance) align with how access already flows.
- Use
metadatato tag everything. Every key gets at least{environment: prod|dev, owner: email, justification: ticket-number}. When you do a quarterly access review, those tags save your team hours. - Cap models, not just budgets. Marketing should not be able to ask for Llama 3 70B in the first place. Restrict via the
modelsallowlist on the team, not just by budget. - Sync keys to your secrets manager. Long-lived API keys belong in Vault, AWS Secrets Manager, or 1Password Secrets Automation, not in CI environment variables forever. Rotate every 90 days at most.
For the broader pattern of multi-tenant LLM infrastructure, see our Ollama API rate limiting guide.
Per-User Rate Limits and Spend Caps {#rate-limits}
Three knobs you need on every key:
# RPM (requests per minute), TPM (tokens per minute), and budget
curl -X POST https://ai.yourdomain.com/litellm/key/generate \
-H "Authorization: Bearer ${LITELLM_MASTER_KEY}" \
-d '{
"user_id": "intern-summer-2026",
"models": ["llama3.2-3b", "qwen2.5-7b"],
"rpm_limit": 30,
"tpm_limit": 20000,
"max_budget": 5,
"budget_duration": "7d",
"max_parallel_requests": 2
}'
The interns get tiny budgets and the smaller models. Senior engineers get 10x those numbers. Service accounts get even more. The shape of the limits matters as much as the size: max_parallel_requests=2 prevents one user from monopolizing GPU concurrency, even within their daily token budget.
When a key hits its limit, LiteLLM returns HTTP 429 with a structured error body. Your application code should respect that and back off; if it does not, you have an application bug, not an infrastructure problem.
Prompt and Response Redaction {#redaction}
This is where most "compliance-friendly" local AI stacks fall over. Authenticating users is easy. Stopping a user from accidentally pasting customer SSNs into the model is harder. The pattern: a redaction middleware between LiteLLM and Ollama that scans both prompts and responses for sensitive data.
LiteLLM supports custom guardrails via a hook. Wire in Microsoft Presidio for PII detection:
# litellm_config.yaml
guardrails:
- guardrail_name: "presidio-redact-input"
litellm_params:
guardrail: "presidio"
mode: "pre_call"
anonymize: true
analyze_args:
language: "en"
entities:
- "EMAIL_ADDRESS"
- "US_SSN"
- "CREDIT_CARD"
- "US_BANK_NUMBER"
- "PERSON"
- "PHONE_NUMBER"
- guardrail_name: "presidio-redact-output"
litellm_params:
guardrail: "presidio"
mode: "post_call"
anonymize: true
Now an inbound prompt of Process this for John Smith, SSN 123-45-6789 becomes Process this for <PERSON>, SSN <US_SSN> before it reaches Ollama, and the response gets the same scan on the way back. Redaction is recoverable for legitimate workflows (Presidio supports a reversible mode keyed by a per-tenant key) and irreversible for everyone else.
For workflows that need full data fidelity (legal review, medical drafting), pair this with a separate "high-trust" team whose keys bypass redaction and whose audit logs are scrutinized weekly.
Audit Logging You Can Hand to a Compliance Team {#audit}
The minimum useful audit record per request:
| Field | Why |
|---|---|
| timestamp (UTC, RFC 3339) | when |
| user_id | who |
| key_id | which key |
| team_id | which org unit |
| model | what model |
| prompt_hash (SHA-256) | content fingerprint without storing PII |
| prompt_redacted | redacted prompt for security review |
| response_hash | response fingerprint |
| tokens_in / tokens_out | cost attribution |
| latency_ms | performance trends |
| client_ip | network attribution |
| status | success / blocked / errored |
| reason | which guardrail or limit triggered |
LiteLLM ships a logging callback system. Wire it to your SIEM:
# litellm_config.yaml
litellm_settings:
success_callback: ["s3", "datadog", "custom_callback"]
failure_callback: ["s3", "datadog"]
s3_callback_params:
s3_bucket_name: "yourdomain-llm-audit"
s3_region_name: "us-east-1"
s3_aws_access_key_id: env/S3_KEY
s3_aws_secret_access_key: env/S3_SECRET
Audit logs go to S3 (or any object store) with versioning + retention enabled. They are immutable. They are also the only thing your auditor will care about during a SOC 2 review, so structure them well.
For a deeper audit-trail pattern see our audit trail for local AI guide.
Network Hardening: Firewalls, mTLS, Egress {#network}
Three network controls that pay back fast:
1. Inbound: TLS + WAF. Put nginx or Caddy in front, terminate TLS, use a real cert (Let's Encrypt is fine), and add basic WAF rules. ModSecurity has decent OWASP Core Rule Set bundles.
2. Internal: only the proxy talks to Ollama. Docker user-defined networks make this trivial: networks: {llm-internal: {internal: true}} blocks the network from the outside world. Even if the host is compromised, an attacker on a sibling container cannot reach Ollama directly.
3. Outbound: egress lockdown. A self-hosted LLM should rarely make outbound network calls. Ollama only talks out to pull models. LiteLLM only talks out for callbacks (S3, Datadog). Block everything else with iptables or Cilium NetworkPolicies. This protects you against subtle data-exfiltration through misconfigured custom guardrails or a poisoned dependency.
# Example: deny outbound by default, allow registries + your SIEM
iptables -A OUTPUT -d ollama.com -j ACCEPT
iptables -A OUTPUT -d registry.npmmirror.com -j ACCEPT
iptables -A OUTPUT -d s3.us-east-1.amazonaws.com -j ACCEPT
iptables -A OUTPUT -d intake.logs.datadoghq.com -j ACCEPT
iptables -A OUTPUT -p tcp --dport 443 -j REJECT
iptables -A OUTPUT -p tcp --dport 80 -j REJECT
For higher-trust deployments add mTLS between LiteLLM and Ollama using nginx as a TLS terminator on the Ollama side, and require client certs. LiteLLM has native support for client TLS certs in its router config.
Migration Path from Single-User Ollama {#migration}
If you already have an Ollama box that the whole team SSHs into, here is the sane migration path:
- Day 0: stand up the proxy stack in parallel. New users get the proxy URL; existing users keep their direct access for a week.
- Day 7: revoke direct access to port 11434 from anywhere except the proxy. Update SSH and firewall rules.
- Day 14: switch all internal apps to use LiteLLM virtual keys. Audit code for hardcoded
localhost:11434URLs. - Day 30: turn on prompt redaction for general-purpose teams; high-trust teams keep raw access.
- Day 60: enable mTLS, rotate all keys to the secrets manager, run your first quarterly access review.
Communicate the change as "your existing models still work; you now log in once with your company SSO and we have an audit trail." Engineers who hate auth requirements love that they can self-serve API keys via a portal instead of asking IT.
Common Mistakes and How They Fail {#mistakes}
The list of things I have seen go wrong:
1. Exposing Ollama on the LAN "for convenience." Within a week someone outside the team finds the open port. Mitigation: never expose 11434 to anything but the proxy.
2. Single shared API key for an entire team. Defeats the purpose of RBAC. When the contractor leaves you cannot revoke just their access without breaking the whole team. Mitigation: per-user keys, always.
3. No model allowlist on cheap-tier teams. A junior engineer asks the model for a 70B Mixtral run, the GPU pegs at 100% for 40 minutes, real production traffic times out. Mitigation: models allowlist on every team.
4. Storing raw prompts in audit logs. First incident, your audit logs themselves become a PII liability. Mitigation: SHA-256 hash + redacted prompt in audit logs, raw prompt only in a separate, encrypted, short-retention store.
5. No rate limit on a free-tier team. Bot or runaway script DDoSes your own infrastructure. Mitigation: rpm_limit and max_parallel_requests defaults set on team creation.
6. Forgetting to renew the IdP signing certificate. SSO breaks at 2 AM and nobody can log in. Mitigation: cert expiry alerts, document the rotation runbook, test it annually.
7. Skipping prompt redaction because "we trust our team." Trust is not the issue; accidental paste of customer data is. Mitigation: turn on Presidio for general teams; reserve unredacted access for explicitly justified high-trust roles.
For deeper hardening see our securing Ollama guide.
External authoritative reference: OWASP LLM Top 10 covers many of these threats with formal mitigations.
Frequently Asked Questions
Q: Does Ollama have built-in authentication?
A: No. Ollama listens on port 11434 and assumes a single trusted user. There is no native concept of API keys, user identities, or per-user rate limits. You add those by putting a proxy like LiteLLM or a custom FastAPI service in front of Ollama and never exposing the Ollama port to anything but the proxy.
Q: Can I use SSO with Open WebUI and Ollama?
A: Yes, via Open WebUI's OAuth/OIDC integration. Set WEBUI_AUTH=true, ENABLE_OAUTH_SIGNUP=true, and provide your IdP's discovery URL plus a client ID/secret. Open WebUI supports any OIDC-compliant identity provider including Okta, Authentik, Keycloak, Google Workspace, and Azure AD/Entra.
Q: How do I issue per-user API keys for a self-hosted LLM?
A: Use LiteLLM's virtual key system. The proxy mints OpenAI-compatible keys (sk-litellm-...) bound to a user, a team, a model allowlist, and budget/rate-limit caps. Your applications use the keys exactly like OpenAI keys. Revoke a single key without affecting others.
Q: What is the best way to redact sensitive prompts before they reach Ollama?
A: Microsoft Presidio integrated as a LiteLLM guardrail in pre_call mode. Presidio detects PII (SSN, credit card, email, phone, person name) and replaces matches with structured placeholders. The model sees the redacted prompt; legitimate workflows can recover originals via the reversible mode keyed by a per-tenant secret.
Q: How do I prevent one user from saturating the GPU?
A: Set max_parallel_requests on the LiteLLM key, plus tpm_limit and rpm_limit. Concurrency caps matter as much as throughput caps; without them a single user with normal token budget can still queue 50 simultaneous requests and starve everyone else.
Q: What does an auditor want to see for a self-hosted LLM?
A: An immutable audit log per request that records who, when, which key, which model, prompt hash, redacted prompt, response hash, tokens in/out, and outcome. Write to S3 or another object store with versioning and retention enabled. Pair with quarterly access reviews and documented key rotation procedures.
Q: Can I run this stack without Docker?
A: Yes, but Docker compose makes the network segmentation trivial. The same architecture works with systemd-managed services, Kubernetes pods, or Nomad jobs. The hard requirement is that Ollama is unreachable except via the proxy.
Q: How does this compare to commercial AI gateways like AWS Bedrock or Azure OpenAI?
A: Functionally similar (auth, RBAC, budgets, audit logs), but you keep all data on your own infrastructure and pay nothing per token. The tradeoff is operational: you maintain the stack yourself. For teams under ~50 users with a single GPU server, the LiteLLM + Ollama path is dramatically cheaper. For 1000+ users with strict SLA requirements, commercial gateways may make more sense.
Conclusion
Access control on a self-hosted LLM is not optional once you have more than a handful of users. The good news is that the stack is now mature: Ollama for inference, LiteLLM for the proxy, Open WebUI for the browser experience, your existing IdP for identity, Presidio for redaction, S3 for audit logs. None of those parts are new or experimental in 2026; they are well-trodden, well-documented, and they fit together cleanly.
The pattern to remember: never expose your inference engine directly. Authenticate at a proxy. Authorize via virtual keys with model allowlists, budgets, and rate limits. Redact sensitive prompts. Log every request immutably. Lock down outbound network access. Review keys quarterly.
If your security team has been asking when the local AI deployment will be "audit-ready," the answer is now. The pieces are there. The work is plumbing, not research.
Building local AI for a regulated team? Subscribe to the LocalAIMaster newsletter for compliance-focused deep dives every week.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!