Free course — 2 free chapters of every course. No credit card.Start learning free
Compliance

SOC 2 for Self-Hosted AI: What Auditors Actually Want to See

April 23, 2026
23 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

SOC 2 for Self-Hosted AI: What Auditors Actually Want to See

Published April 23, 2026 - 23 min read

I have sat through eight SOC 2 Type II audits over the last six years, three of them for organizations that ran self-hosted LLMs. The first time an auditor asked, "Where is your evidence that the inference server enforces least-privilege access?", we had no answer. The second time we did. The difference is what this guide is about.

The AICPA's SOC 2 framework never mentions "AI" or "LLM" anywhere. That is not a loophole - it means every Trust Service Criterion you are already required to meet now applies to the inference layer too. If your customers ask for a SOC 2 report, your self-hosted Ollama (or vLLM, TGI, llama.cpp) deployment must be in scope.

This is the audit-readiness checklist I wish someone had handed me in 2023. It maps Trust Service Criteria to concrete self-hosted AI controls, shows the evidence auditors actually accept, and flags the questions that come up in every Type II.

Quick Start: The Auditor's First Five Questions

Within 20 minutes of opening the AI portion of an audit, every auditor asks:

  1. "Show me the inventory of every model you run and where the weights live."
  2. "Show me access logs for the inference endpoint over the last 90 days."
  3. "Walk me through how a user is provisioned and deprovisioned for AI access."
  4. "Show me the change management record for the most recent model version bump."
  5. "Show me the incident response runbook for a confirmed prompt injection."

If you can answer all five with a click, the rest of the audit is mechanical. If any one of them takes more than a few minutes to find, expect the engagement to drag for weeks.

Table of Contents

  1. SOC 2 Trust Service Criteria Mapped to AI
  2. Scoping the AI Boundary
  3. Security (CC1-CC9) for Inference Servers
  4. Availability for Self-Hosted Models
  5. Confidentiality and Prompt Data
  6. Processing Integrity for AI Outputs
  7. Privacy When Models Touch User Data
  8. Evidence Templates
  9. The 12-Week Readiness Plan
  10. Pitfalls That Fail Audits

SOC 2 Trust Service Criteria Mapped to AI {#tsc-mapping}

SOC 2 has five Trust Service Criteria. Every one applies to self-hosted AI:

TSCWhat It Means for Self-Hosted AI
Security (CC1-CC9, common)Authenticated, encrypted, monitored access to the inference endpoint
Availability (A1)SLAs on AI service uptime, capacity planning, DR plans
Confidentiality (C1)Prompts and completions handled like other classified data
Processing Integrity (PI1)Outputs are complete, valid, accurate, timely, authorized
Privacy (P1-P8)If models process PII, full GDPR/CCPA-aligned controls

Most companies start with Security only (Type II Common Criteria). If you sell to enterprise, customers often request Confidentiality too. Healthcare and financial services demand all five.

For the GDPR-specific overlay, see our GDPR-compliant local AI guide. For HIPAA, see HIPAA-compliant local AI.


Scoping the AI Boundary {#scoping}

Auditors will not accept "the model is local so it's out of scope." If the model touches customer data or affects a customer-facing decision, it's in scope. Documenting scope is your first deliverable.

Sample scope statement (from a real readiness package):

In scope:
- Ollama 0.6.4 inference cluster (3 nodes, us-east-1 colo)
- nginx reverse proxy with mTLS and audit logging
- Model registry (S3 bucket with object lock)
- Embedding pipeline (workers + Postgres pgvector)
- Customer-facing AI features in app.example.com (5 endpoints)

Out of scope:
- Internal experimentation cluster (no customer data)
- Marketing demos (sandbox only, fake data)
- HuggingFace Hub (only used for one-time model downloads)

Pin this. The boundary is what every other control will be evaluated against.


Security (CC1-CC9) for Inference Servers {#security}

Security is the largest section and where most AI audits go sideways. Here is what each control number actually requires for a self-hosted AI box:

CC6.1 - Logical and Physical Access

Auditor asks: "How do you control who can access the inference server?"

Required evidence:

  • IAM policy file showing who can SSH, who can call the API, separated by role
  • mTLS certificates with documented issuance and revocation procedures
  • Quarterly access reviews with sign-off (screenshot of completed reviews)
  • Termination checklist that includes revoking AI bearer tokens within 24h

Sample script for the access review evidence:

#!/bin/bash
# /opt/ops/quarterly_access_review.sh
DATE=$(date +%Y-%m-%d)
{
  echo "# Q$(date +%q) $(date +%Y) Inference Access Review"
  echo
  echo "## Active mTLS clients"
  openssl x509 -in /etc/nginx/clients/*.crt -noout -subject -enddate
  echo
  echo "## Active bearer tokens (hashed)"
  jq '.tokens[] | {id, owner, last_used, expires}' /etc/ollama-proxy/tokens.json
} > /var/audit/reviews/$DATE-access-review.md

CC6.6 - Encryption in Transit

TLS 1.2+ on every API call. Auditors run testssl.sh against your endpoint. If they get a B grade, that's a finding.

docker run --rm drwetter/testssl.sh https://ai.example.com

Aim for A+. The full nginx hardening config is covered in Securing Ollama: Auth, TLS, Network Isolation.

CC6.7 - Encryption at Rest

Model weights, prompt logs, and embedding databases must all be encrypted at rest. Acceptable evidence:

  • LUKS or dm-crypt for the model storage volume (cryptsetup status output)
  • S3 SSE-KMS for any cloud-stored model artifacts
  • Postgres TDE or full-disk for pgvector
  • Documented key rotation procedure (typically 90 days for KMS keys)

CC7.2 - System Monitoring

Every API request must be logged with user identity, request size, response size, latency, and outcome. Bare logs are not enough; you need alerts:

# /etc/prometheus/alerts/ai.yml
groups:
- name: ai_security
  rules:
  - alert: HighAuthFailureRate
    expr: rate(nginx_http_requests_total{status="401"}[5m]) > 0.5
    for: 5m
    annotations:
      summary: "Sustained auth failures on AI endpoint"
  - alert: AnomalousPromptVolume
    expr: rate(ai_requests_total[10m]) > 3 * avg_over_time(rate(ai_requests_total[1d])[7d:1h])
    for: 10m
    annotations:
      summary: "Prompt volume anomaly - possible scraping"

CC7.3 - Incident Response

You need a runbook specifically for AI incidents. Generic infosec runbooks will not pass. The categories auditors expect:

  1. Confirmed prompt injection escalating privilege
  2. Model output disclosing training data verbatim
  3. Inference service unavailable beyond SLA
  4. Tampering with model weights detected (SHA mismatch)
  5. Data exfiltration via long prompts or tool use

For each, document detection signals, containment, eradication, recovery, and post-mortem template. Test annually with tabletop exercises and keep notes.

CC8.1 - Change Management

Every model swap is a change. The artifact every auditor wants is a ticket showing: who requested, what changed (model name + SHA), peer review, test results, rollback plan, deployment timestamp.

Sample CHANGE-1284:

Title:    Bump primary chat model from llama3.1:70b to llama3.3:70b
Author:   alice@example.com
Reviewer: bob@example.com (approved 2026-04-12)
Tested:   Eval suite v3.2 - 87.3% pass (vs 84.1% prev), 14 regressions reviewed
Risk:     Medium - new model, similar architecture, identical context length
SHA:     b8c7e4...d2a1f9
Deployed: 2026-04-15 14:22 UTC by ops
Rollback: ollama pull llama3.1:70b && systemctl reload ai-router
Outcome:  No incidents in 7-day post-deployment window

Availability for Self-Hosted Models {#availability}

A1 controls require:

  • Documented SLA (e.g., 99.5% monthly availability for AI features)
  • Capacity model (tokens/second supported, peak headroom)
  • DR plan with RTO/RPO (typical: RTO 4h, RPO 1h)
  • Quarterly DR test with evidence

Capacity model template:

Component       Capacity         Peak observed     Headroom
Ollama node 1   220 tok/s        140 tok/s         36%
Ollama node 2   220 tok/s        135 tok/s         39%
Ollama node 3   220 tok/s        128 tok/s         42%
Cluster total   660 tok/s        403 tok/s         39%

If headroom drops below 20%, capacity expansion ticket auto-fires. Auditors love this.

For DR, document model artifact backups, configuration backups, and a tested restore procedure. Most production teams use S3 with object lock for model weights and Velero or restic for system state.


Confidentiality and Prompt Data {#confidentiality}

This is where AI is genuinely different from traditional SaaS. Prompts and completions can contain:

  • Customer credentials they accidentally pasted
  • Source code with proprietary algorithms
  • PHI, PII, financial data, legal data
  • Model outputs that themselves are derivative IP

Confidentiality controls auditors expect:

  • DLP scanning on inbound prompts (regex + ML classifier)
  • Field-level redaction before persistence in audit logs
  • Tiered retention: 30 days for full prompts, 365 days for redacted metadata
  • Clear data classification policy that includes "AI prompt content" as a class
  • Customer disclosure: "Your prompts are logged for X days, retained in Y location, accessed only by Z roles"

Sample retention policy excerpt:

AI prompt and completion data is classified as Confidential.
- Full prompt content is retained for 30 days in encrypted log storage.
- Redacted metadata (user, latency, tokens, model) is retained 365 days.
- Access requires "ai-auditor" role assignment.
- Quarterly review of access list by Security Officer.

Processing Integrity for AI Outputs {#processing-integrity}

PI1 is the criterion most teams fumble for AI. The question is: "How do you know your AI output is complete, valid, accurate, and authorized?"

You cannot promise the model is correct - it's a probabilistic system. You can promise:

  • Inputs are validated against a schema
  • Outputs go through guardrails (Llama Guard 3, regex denylists)
  • Tool calls are validated and authorized before execution
  • Results are stored with provenance (model version, timestamp, input hash)
  • Human-in-the-loop for high-stakes decisions

Sample provenance record persisted with every completion:

{
  "request_id": "01HW8K...",
  "user": "alice@example.com",
  "model": "llama3.3:70b",
  "model_sha": "b8c7e4...d2a1f9",
  "ts": "2026-04-23T14:01:33Z",
  "input_hash": "sha256:f3d2...",
  "guardrail_pass": true,
  "tool_calls": ["search.docs", "ticket.create"],
  "tool_authorization": "user_self_ticket_only"
}

This single JSON, persisted alongside outputs, satisfies PI1 evidence requests without breaking a sweat.


Privacy When Models Touch User Data {#privacy}

If your scope includes Privacy criterion (P1-P8), you need:

  • A signed Data Processing Agreement covering AI processing
  • Data subject access procedures that include AI-derived data
  • Documented purpose limitation (AI used only for stated purposes)
  • Mechanism to honor right-to-erasure across prompt logs and any fine-tuned weights
  • Cross-border transfer assessments if data leaves jurisdiction (none, ideally, for self-hosted)

The "right to erasure" for fine-tuned weights is particularly thorny. The defensible answer: maintain a documented fine-tune lineage with training data references. If a deletion request matches data used in training, retraining or pruning is required within a defined window (typically 30-60 days).


Evidence Templates {#evidence}

Auditors love templates. Here are the four they ask for most.

1. Model Inventory

Model name      Version   SHA256             Source              Approved by   Date
llama3.3:70b    v1.0      b8c7e4...d2a1f9    ollama.com/library  alice@        2026-03-12
nomic-embed     v1.5      3a2c1d...e7f8b2    ollama.com/library  alice@        2026-02-04
guard-3:8b      v1.0      9d8e7f...c1b4a3    ollama.com/library  bob@          2026-03-12

2. Audit Log Sample

JSON Lines format, one event per line:

{"ts":"2026-04-23T14:01:33Z","user":"alice@example.com","ip":"10.20.30.10","model":"llama3.3:70b","input_tokens":450,"output_tokens":1240,"latency_ms":4823,"status":200,"guardrail_pass":true}

3. Access Review Snapshot (quarterly)

A markdown file generated by the script earlier, signed off by the security officer in your ticketing system.

4. Incident Post-Mortem (when applicable)

Use a standard template with sections for timeline, impact, root cause, remediation, prevention. AI incidents specifically should call out: prompt or input that triggered, model version, guardrail outcome, customer notification status.


The 12-Week Readiness Plan {#readiness}

This is the schedule I run when a customer says "we're going for SOC 2 Type II in 6 months and AI needs to be in scope."

WeekDeliverable
1Scope definition, asset inventory, model registry
2TLS hardening, mTLS rollout, network isolation
3Authentication and authorization (SSO + token rotation)
4Audit logging pipeline + retention policy
5Encryption at rest verification + key rotation procedure
6Change management workflow + first model bump as test case
7Incident response runbooks + tabletop exercise
8DR plan + first restore test
9DLP and guardrails for prompts and outputs
10Access reviews + termination procedures
11Vendor management for any third-party tools
12Internal audit + remediation

Then start the formal audit observation window. Most Type II audits cover a 3-month window minimum, 12 months ideal.


Pitfalls That Fail Audits {#pitfalls}

Mistakes I've watched cost real money:

1. Treating "self-hosted" as automatic compliance It is not. Self-hosted shifts the responsibility from a vendor's SOC 2 report to yours. You absorb every control.

2. No model version pinning "Latest" is not a version. Auditors will mark this as a change-management finding within minutes.

3. Audit logs without integrity protection Logs that anyone can edit are worthless. Use append-only sinks (Loki + S3 object lock, or AWS CloudTrail equivalents).

4. No defined retention for prompt data Indefinite retention is a privacy violation. Zero retention loses you incident-response evidence. Pick a defined window and document it.

5. Sharing service accounts "The team uses one Ollama bearer token" is a CC6.1 fail. Per-user tokens or SSO-mapped tokens, period.

6. Skipping the tabletop Untested incident runbooks are paperwork, not controls. Run the exercise once a year and document the outcome.

7. Forgetting fine-tuning data If you fine-tune on customer data, the fine-tuned weights inherit that data's classification. Treat them like a derivative database.

8. No DLP on prompts Customer accidentally pastes their AWS root key. It lands in your logs. Now you have a data-handling incident. DLP at ingress prevents this.


Frequently Asked Questions

Does SOC 2 require self-hosted AI to be a separate certification?

No. Self-hosted AI is just another part of your existing SOC 2 scope. You add it to your asset inventory, map controls to it, and provide evidence like any other system. There is no separate AI certification, though some customers ask for AI-specific addendums to your standard report.

Can I claim SOC 2 if my AI runs on someone else's infrastructure?

If you're running Ollama on AWS or GCP, the cloud provider's SOC 2 covers the underlying compute, but your controls over the AI workload are still yours to evidence. Use the carve-out method: rely on the cloud provider's report for infrastructure, document your controls for everything you operate.

How do auditors test AI access controls?

They request a list of users with access, sample 5-10 of them, ask for evidence those users were properly provisioned (ticket with manager approval), and verify deprovisioned users no longer have tokens valid against the API. Expect them to actually try a revoked token.

What evidence works best for prompt-injection incident response?

A documented runbook plus a recorded tabletop exercise transcript. Auditors do not require you to have had a real incident - they require you to have a tested response capability.

How long should I retain AI audit logs for SOC 2?

Common practice is 365 days for security-relevant audit metadata and 30-90 days for full prompt content. Anything shorter requires explicit risk acceptance. HIPAA and SOX add 6-7 years of retention for any logs in their scope.

Do open-source models reduce SOC 2 burden?

Slightly - open weights mean no vendor SLA dependency for the model itself. You still own every infrastructure control. The savings is mostly in vendor management (CC9) where you can carve out the model provider.

How do I handle SOC 2 for fine-tuned models trained on customer data?

Document the lineage: which customer data was included, who authorized it, when. Treat fine-tuned weights as a controlled artifact. Honor deletion requests by retraining or pruning within your stated SLA. Auditors increasingly ask about this in 2026.

Are LLM benchmarks part of processing integrity evidence?

Yes. An evaluation harness with checked-in test cases, run before every model bump with results stored, satisfies PI1.1 nicely. We use LMEval and our own task-specific suite. Keep the test results forever; they are gold during audits.


Bottom Line

SOC 2 for self-hosted AI is not exotic; it's the same controls you already owe customers, applied to a new system. The hard part is recognizing that the AI server isn't out of scope just because the model file lives on your disk. Once you accept that, the readiness work is straightforward: inventory, isolate, encrypt, authenticate, log, review, document, repeat.

Auditors will not punish you for honest gaps. They will punish you for missing inventories, lazy access controls, or "AI is special, we don't apply normal controls" hand-waving. Treat your inference cluster like a database that occasionally writes English instead of SQL, and you'll pass.

Bookmark this page, walk it once a quarter, and your next audit becomes a paperwork exercise instead of a fire drill.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Get the SOC 2 Evidence Pack

Subscribers receive my reusable SOC 2 evidence templates - access review scripts, change tickets, runbook outlines - tuned for self-hosted AI.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators