SOC 2 for Self-Hosted AI: What Auditors Actually Want to See
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
SOC 2 for Self-Hosted AI: What Auditors Actually Want to See
Published April 23, 2026 - 23 min read
I have sat through eight SOC 2 Type II audits over the last six years, three of them for organizations that ran self-hosted LLMs. The first time an auditor asked, "Where is your evidence that the inference server enforces least-privilege access?", we had no answer. The second time we did. The difference is what this guide is about.
The AICPA's SOC 2 framework never mentions "AI" or "LLM" anywhere. That is not a loophole - it means every Trust Service Criterion you are already required to meet now applies to the inference layer too. If your customers ask for a SOC 2 report, your self-hosted Ollama (or vLLM, TGI, llama.cpp) deployment must be in scope.
This is the audit-readiness checklist I wish someone had handed me in 2023. It maps Trust Service Criteria to concrete self-hosted AI controls, shows the evidence auditors actually accept, and flags the questions that come up in every Type II.
Quick Start: The Auditor's First Five Questions
Within 20 minutes of opening the AI portion of an audit, every auditor asks:
- "Show me the inventory of every model you run and where the weights live."
- "Show me access logs for the inference endpoint over the last 90 days."
- "Walk me through how a user is provisioned and deprovisioned for AI access."
- "Show me the change management record for the most recent model version bump."
- "Show me the incident response runbook for a confirmed prompt injection."
If you can answer all five with a click, the rest of the audit is mechanical. If any one of them takes more than a few minutes to find, expect the engagement to drag for weeks.
Table of Contents
- SOC 2 Trust Service Criteria Mapped to AI
- Scoping the AI Boundary
- Security (CC1-CC9) for Inference Servers
- Availability for Self-Hosted Models
- Confidentiality and Prompt Data
- Processing Integrity for AI Outputs
- Privacy When Models Touch User Data
- Evidence Templates
- The 12-Week Readiness Plan
- Pitfalls That Fail Audits
SOC 2 Trust Service Criteria Mapped to AI {#tsc-mapping}
SOC 2 has five Trust Service Criteria. Every one applies to self-hosted AI:
| TSC | What It Means for Self-Hosted AI |
|---|---|
| Security (CC1-CC9, common) | Authenticated, encrypted, monitored access to the inference endpoint |
| Availability (A1) | SLAs on AI service uptime, capacity planning, DR plans |
| Confidentiality (C1) | Prompts and completions handled like other classified data |
| Processing Integrity (PI1) | Outputs are complete, valid, accurate, timely, authorized |
| Privacy (P1-P8) | If models process PII, full GDPR/CCPA-aligned controls |
Most companies start with Security only (Type II Common Criteria). If you sell to enterprise, customers often request Confidentiality too. Healthcare and financial services demand all five.
For the GDPR-specific overlay, see our GDPR-compliant local AI guide. For HIPAA, see HIPAA-compliant local AI.
Scoping the AI Boundary {#scoping}
Auditors will not accept "the model is local so it's out of scope." If the model touches customer data or affects a customer-facing decision, it's in scope. Documenting scope is your first deliverable.
Sample scope statement (from a real readiness package):
In scope:
- Ollama 0.6.4 inference cluster (3 nodes, us-east-1 colo)
- nginx reverse proxy with mTLS and audit logging
- Model registry (S3 bucket with object lock)
- Embedding pipeline (workers + Postgres pgvector)
- Customer-facing AI features in app.example.com (5 endpoints)
Out of scope:
- Internal experimentation cluster (no customer data)
- Marketing demos (sandbox only, fake data)
- HuggingFace Hub (only used for one-time model downloads)
Pin this. The boundary is what every other control will be evaluated against.
Security (CC1-CC9) for Inference Servers {#security}
Security is the largest section and where most AI audits go sideways. Here is what each control number actually requires for a self-hosted AI box:
CC6.1 - Logical and Physical Access
Auditor asks: "How do you control who can access the inference server?"
Required evidence:
- IAM policy file showing who can SSH, who can call the API, separated by role
- mTLS certificates with documented issuance and revocation procedures
- Quarterly access reviews with sign-off (screenshot of completed reviews)
- Termination checklist that includes revoking AI bearer tokens within 24h
Sample script for the access review evidence:
#!/bin/bash
# /opt/ops/quarterly_access_review.sh
DATE=$(date +%Y-%m-%d)
{
echo "# Q$(date +%q) $(date +%Y) Inference Access Review"
echo
echo "## Active mTLS clients"
openssl x509 -in /etc/nginx/clients/*.crt -noout -subject -enddate
echo
echo "## Active bearer tokens (hashed)"
jq '.tokens[] | {id, owner, last_used, expires}' /etc/ollama-proxy/tokens.json
} > /var/audit/reviews/$DATE-access-review.md
CC6.6 - Encryption in Transit
TLS 1.2+ on every API call. Auditors run testssl.sh against your endpoint. If they get a B grade, that's a finding.
docker run --rm drwetter/testssl.sh https://ai.example.com
Aim for A+. The full nginx hardening config is covered in Securing Ollama: Auth, TLS, Network Isolation.
CC6.7 - Encryption at Rest
Model weights, prompt logs, and embedding databases must all be encrypted at rest. Acceptable evidence:
- LUKS or dm-crypt for the model storage volume (
cryptsetup statusoutput) - S3 SSE-KMS for any cloud-stored model artifacts
- Postgres TDE or full-disk for pgvector
- Documented key rotation procedure (typically 90 days for KMS keys)
CC7.2 - System Monitoring
Every API request must be logged with user identity, request size, response size, latency, and outcome. Bare logs are not enough; you need alerts:
# /etc/prometheus/alerts/ai.yml
groups:
- name: ai_security
rules:
- alert: HighAuthFailureRate
expr: rate(nginx_http_requests_total{status="401"}[5m]) > 0.5
for: 5m
annotations:
summary: "Sustained auth failures on AI endpoint"
- alert: AnomalousPromptVolume
expr: rate(ai_requests_total[10m]) > 3 * avg_over_time(rate(ai_requests_total[1d])[7d:1h])
for: 10m
annotations:
summary: "Prompt volume anomaly - possible scraping"
CC7.3 - Incident Response
You need a runbook specifically for AI incidents. Generic infosec runbooks will not pass. The categories auditors expect:
- Confirmed prompt injection escalating privilege
- Model output disclosing training data verbatim
- Inference service unavailable beyond SLA
- Tampering with model weights detected (SHA mismatch)
- Data exfiltration via long prompts or tool use
For each, document detection signals, containment, eradication, recovery, and post-mortem template. Test annually with tabletop exercises and keep notes.
CC8.1 - Change Management
Every model swap is a change. The artifact every auditor wants is a ticket showing: who requested, what changed (model name + SHA), peer review, test results, rollback plan, deployment timestamp.
Sample CHANGE-1284:
Title: Bump primary chat model from llama3.1:70b to llama3.3:70b
Author: alice@example.com
Reviewer: bob@example.com (approved 2026-04-12)
Tested: Eval suite v3.2 - 87.3% pass (vs 84.1% prev), 14 regressions reviewed
Risk: Medium - new model, similar architecture, identical context length
SHA: b8c7e4...d2a1f9
Deployed: 2026-04-15 14:22 UTC by ops
Rollback: ollama pull llama3.1:70b && systemctl reload ai-router
Outcome: No incidents in 7-day post-deployment window
Availability for Self-Hosted Models {#availability}
A1 controls require:
- Documented SLA (e.g., 99.5% monthly availability for AI features)
- Capacity model (tokens/second supported, peak headroom)
- DR plan with RTO/RPO (typical: RTO 4h, RPO 1h)
- Quarterly DR test with evidence
Capacity model template:
Component Capacity Peak observed Headroom
Ollama node 1 220 tok/s 140 tok/s 36%
Ollama node 2 220 tok/s 135 tok/s 39%
Ollama node 3 220 tok/s 128 tok/s 42%
Cluster total 660 tok/s 403 tok/s 39%
If headroom drops below 20%, capacity expansion ticket auto-fires. Auditors love this.
For DR, document model artifact backups, configuration backups, and a tested restore procedure. Most production teams use S3 with object lock for model weights and Velero or restic for system state.
Confidentiality and Prompt Data {#confidentiality}
This is where AI is genuinely different from traditional SaaS. Prompts and completions can contain:
- Customer credentials they accidentally pasted
- Source code with proprietary algorithms
- PHI, PII, financial data, legal data
- Model outputs that themselves are derivative IP
Confidentiality controls auditors expect:
- DLP scanning on inbound prompts (regex + ML classifier)
- Field-level redaction before persistence in audit logs
- Tiered retention: 30 days for full prompts, 365 days for redacted metadata
- Clear data classification policy that includes "AI prompt content" as a class
- Customer disclosure: "Your prompts are logged for X days, retained in Y location, accessed only by Z roles"
Sample retention policy excerpt:
AI prompt and completion data is classified as Confidential.
- Full prompt content is retained for 30 days in encrypted log storage.
- Redacted metadata (user, latency, tokens, model) is retained 365 days.
- Access requires "ai-auditor" role assignment.
- Quarterly review of access list by Security Officer.
Processing Integrity for AI Outputs {#processing-integrity}
PI1 is the criterion most teams fumble for AI. The question is: "How do you know your AI output is complete, valid, accurate, and authorized?"
You cannot promise the model is correct - it's a probabilistic system. You can promise:
- Inputs are validated against a schema
- Outputs go through guardrails (Llama Guard 3, regex denylists)
- Tool calls are validated and authorized before execution
- Results are stored with provenance (model version, timestamp, input hash)
- Human-in-the-loop for high-stakes decisions
Sample provenance record persisted with every completion:
{
"request_id": "01HW8K...",
"user": "alice@example.com",
"model": "llama3.3:70b",
"model_sha": "b8c7e4...d2a1f9",
"ts": "2026-04-23T14:01:33Z",
"input_hash": "sha256:f3d2...",
"guardrail_pass": true,
"tool_calls": ["search.docs", "ticket.create"],
"tool_authorization": "user_self_ticket_only"
}
This single JSON, persisted alongside outputs, satisfies PI1 evidence requests without breaking a sweat.
Privacy When Models Touch User Data {#privacy}
If your scope includes Privacy criterion (P1-P8), you need:
- A signed Data Processing Agreement covering AI processing
- Data subject access procedures that include AI-derived data
- Documented purpose limitation (AI used only for stated purposes)
- Mechanism to honor right-to-erasure across prompt logs and any fine-tuned weights
- Cross-border transfer assessments if data leaves jurisdiction (none, ideally, for self-hosted)
The "right to erasure" for fine-tuned weights is particularly thorny. The defensible answer: maintain a documented fine-tune lineage with training data references. If a deletion request matches data used in training, retraining or pruning is required within a defined window (typically 30-60 days).
Evidence Templates {#evidence}
Auditors love templates. Here are the four they ask for most.
1. Model Inventory
Model name Version SHA256 Source Approved by Date
llama3.3:70b v1.0 b8c7e4...d2a1f9 ollama.com/library alice@ 2026-03-12
nomic-embed v1.5 3a2c1d...e7f8b2 ollama.com/library alice@ 2026-02-04
guard-3:8b v1.0 9d8e7f...c1b4a3 ollama.com/library bob@ 2026-03-12
2. Audit Log Sample
JSON Lines format, one event per line:
{"ts":"2026-04-23T14:01:33Z","user":"alice@example.com","ip":"10.20.30.10","model":"llama3.3:70b","input_tokens":450,"output_tokens":1240,"latency_ms":4823,"status":200,"guardrail_pass":true}
3. Access Review Snapshot (quarterly)
A markdown file generated by the script earlier, signed off by the security officer in your ticketing system.
4. Incident Post-Mortem (when applicable)
Use a standard template with sections for timeline, impact, root cause, remediation, prevention. AI incidents specifically should call out: prompt or input that triggered, model version, guardrail outcome, customer notification status.
The 12-Week Readiness Plan {#readiness}
This is the schedule I run when a customer says "we're going for SOC 2 Type II in 6 months and AI needs to be in scope."
| Week | Deliverable |
|---|---|
| 1 | Scope definition, asset inventory, model registry |
| 2 | TLS hardening, mTLS rollout, network isolation |
| 3 | Authentication and authorization (SSO + token rotation) |
| 4 | Audit logging pipeline + retention policy |
| 5 | Encryption at rest verification + key rotation procedure |
| 6 | Change management workflow + first model bump as test case |
| 7 | Incident response runbooks + tabletop exercise |
| 8 | DR plan + first restore test |
| 9 | DLP and guardrails for prompts and outputs |
| 10 | Access reviews + termination procedures |
| 11 | Vendor management for any third-party tools |
| 12 | Internal audit + remediation |
Then start the formal audit observation window. Most Type II audits cover a 3-month window minimum, 12 months ideal.
Pitfalls That Fail Audits {#pitfalls}
Mistakes I've watched cost real money:
1. Treating "self-hosted" as automatic compliance It is not. Self-hosted shifts the responsibility from a vendor's SOC 2 report to yours. You absorb every control.
2. No model version pinning "Latest" is not a version. Auditors will mark this as a change-management finding within minutes.
3. Audit logs without integrity protection Logs that anyone can edit are worthless. Use append-only sinks (Loki + S3 object lock, or AWS CloudTrail equivalents).
4. No defined retention for prompt data Indefinite retention is a privacy violation. Zero retention loses you incident-response evidence. Pick a defined window and document it.
5. Sharing service accounts "The team uses one Ollama bearer token" is a CC6.1 fail. Per-user tokens or SSO-mapped tokens, period.
6. Skipping the tabletop Untested incident runbooks are paperwork, not controls. Run the exercise once a year and document the outcome.
7. Forgetting fine-tuning data If you fine-tune on customer data, the fine-tuned weights inherit that data's classification. Treat them like a derivative database.
8. No DLP on prompts Customer accidentally pastes their AWS root key. It lands in your logs. Now you have a data-handling incident. DLP at ingress prevents this.
Frequently Asked Questions
Does SOC 2 require self-hosted AI to be a separate certification?
No. Self-hosted AI is just another part of your existing SOC 2 scope. You add it to your asset inventory, map controls to it, and provide evidence like any other system. There is no separate AI certification, though some customers ask for AI-specific addendums to your standard report.
Can I claim SOC 2 if my AI runs on someone else's infrastructure?
If you're running Ollama on AWS or GCP, the cloud provider's SOC 2 covers the underlying compute, but your controls over the AI workload are still yours to evidence. Use the carve-out method: rely on the cloud provider's report for infrastructure, document your controls for everything you operate.
How do auditors test AI access controls?
They request a list of users with access, sample 5-10 of them, ask for evidence those users were properly provisioned (ticket with manager approval), and verify deprovisioned users no longer have tokens valid against the API. Expect them to actually try a revoked token.
What evidence works best for prompt-injection incident response?
A documented runbook plus a recorded tabletop exercise transcript. Auditors do not require you to have had a real incident - they require you to have a tested response capability.
How long should I retain AI audit logs for SOC 2?
Common practice is 365 days for security-relevant audit metadata and 30-90 days for full prompt content. Anything shorter requires explicit risk acceptance. HIPAA and SOX add 6-7 years of retention for any logs in their scope.
Do open-source models reduce SOC 2 burden?
Slightly - open weights mean no vendor SLA dependency for the model itself. You still own every infrastructure control. The savings is mostly in vendor management (CC9) where you can carve out the model provider.
How do I handle SOC 2 for fine-tuned models trained on customer data?
Document the lineage: which customer data was included, who authorized it, when. Treat fine-tuned weights as a controlled artifact. Honor deletion requests by retraining or pruning within your stated SLA. Auditors increasingly ask about this in 2026.
Are LLM benchmarks part of processing integrity evidence?
Yes. An evaluation harness with checked-in test cases, run before every model bump with results stored, satisfies PI1.1 nicely. We use LMEval and our own task-specific suite. Keep the test results forever; they are gold during audits.
Bottom Line
SOC 2 for self-hosted AI is not exotic; it's the same controls you already owe customers, applied to a new system. The hard part is recognizing that the AI server isn't out of scope just because the model file lives on your disk. Once you accept that, the readiness work is straightforward: inventory, isolate, encrypt, authenticate, log, review, document, repeat.
Auditors will not punish you for honest gaps. They will punish you for missing inventories, lazy access controls, or "AI is special, we don't apply normal controls" hand-waving. Treat your inference cluster like a database that occasionally writes English instead of SQL, and you'll pass.
Bookmark this page, walk it once a quarter, and your next audit becomes a paperwork exercise instead of a fire drill.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!