Build a Local AI Slack & Discord Bot with Ollama (Full Tutorial)
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Build a Local AI Slack & Discord Bot with Ollama (Full Tutorial)
Published on April 23, 2026 • 19 min read
The pitch deck for cloud chatbots is always the same: connect your team chat, get an AI assistant. The fine print is also always the same: the assistant reads your messages, your DMs, your customer data, your internal docs, and the vendor stores it long enough to "improve the service." For most engineering teams that is a non-starter.
I built the first version of this bot for a startup whose CTO had a simple request: "Give my team an AI in Slack that doesn't leak our roadmap." The first iteration took an afternoon. The version this guide describes — with RAG over a private docs folder, per-channel rate limiting, threaded replies, and slash commands — took about two days. It has been running on a $40/month hetzner box for nine months and processed roughly 180,000 messages.
This is the production blueprint. Both Slack and Discord. Real Python code that you can drop into a repo and run today.
Quick Start: 8 Minutes to a Working Bot {#quick-start}
If you just want to see a bot reply in your channel:
# Prerequisites: Python 3.11+, Ollama already installed
ollama pull llama3.1:8b
# Slack version
pip install slack-bolt ollama python-dotenv
export SLACK_BOT_TOKEN=xoxb-...
export SLACK_APP_TOKEN=xapp-...
python slack_bot.py
# Or Discord version
pip install discord.py ollama python-dotenv
export DISCORD_TOKEN=...
python discord_bot.py
The minimal Slack bot (full version below) is 35 lines of Python. By the end of this guide you will have something a real team can use without you babysitting it.
Table of Contents
- Why Self-Host Your Team AI Bot
- Architecture Overview
- Hardware & Hosting
- Ollama Setup for Bot Workloads
- Slack Bot — Full Implementation
- Discord Bot — Full Implementation
- Adding RAG over Team Docs
- Slash Commands & Tools
- Rate Limiting & Cost Control
- Production Deployment
- Pitfalls We Hit (So You Do Not)
- FAQs
Why Self-Host Your Team AI Bot {#why-self-host}
Three concrete reasons that came up in customer interviews:
1. Channel content is sensitive by default. Engineering, security, finance, and exec channels routinely contain credentials, customer names, and unannounced product details. Sending them to OpenAI or Anthropic — even with their enterprise privacy promises — creates an audit trail you do not want.
2. Cloud per-message pricing punishes adoption. A 200-person team that adopts an AI bot easily generates 50,000 messages a month routed through the LLM. At GPT-4o pricing that is $300-800/month. A self-hosted bot on a $40 GPU VPS handles the same traffic for the cost of electricity.
3. RAG over private docs needs to stay private. The most useful team bots answer "what does our pricing tier contain?" or "where is the runbook for X service?" That requires indexing internal docs. Cloud RAG means uploading your wiki to a third party. Local RAG keeps it on your hardware.
A well-built local bot is also faster: 200-400ms latency to first token versus 600-1200ms for cloud APIs because the network round trip drops out.
Architecture Overview {#architecture}
┌──────────┐ websocket ┌────────────┐ HTTP ┌────────┐
│ Slack │◄──────────────►│ Bot Proc │──────────►│ Ollama │
│ Discord │ socket mode │ (Python) │ :11434 │ LLM │
└──────────┘ └────────────┘ └────────┘
│
▼
┌─────────────┐
│ ChromaDB │ ← team docs RAG
│ (local) │
└─────────────┘
Three components:
- Bot process — Python event loop that listens to Slack/Discord events and orchestrates responses.
- Ollama — Local LLM server. Same machine or a different one on your network.
- ChromaDB (optional) — Vector store for RAG. Docker container on the same host.
No public ingress required. Slack uses Socket Mode and Discord uses websockets, so your bot connects out — no inbound port exposure.
Hardware & Hosting {#hardware}
What you actually need depends on team size:
| Team Size | Avg msgs/day | Concurrent | Hardware | Model |
|---|---|---|---|---|
| 1-10 | <500 | 1-2 | 16 GB RAM, integrated GPU | llama3.1:8b |
| 10-50 | 1,000-5,000 | 3-5 | 32 GB RAM, RTX 3060 12 GB | qwen2.5:7b |
| 50-200 | 5,000-20,000 | 5-15 | RTX 4060 Ti 16 GB | qwen2.5:14b |
| 200-1000 | 20,000+ | 15-40 | RTX 4090 or 2× RTX 3090 | llama3.3:70b-q4 |
Concrete VPS picks that work:
- Hetzner GEX44 (RTX 4000 Ada, 16 GB VRAM, 64 GB RAM): €184/month — fits 50-200 user teams
- Vast.ai RTX 3090: $0.20-0.40/hr on-demand
- Self-host on a NUC + eGPU: ~$1,500 one-time, runs forever
For team deployments behind your firewall, see Ollama production deployment.
Ollama Setup for Bot Workloads {#ollama-setup}
Default Ollama settings are tuned for single-user laptops. For a bot serving multiple concurrent users, change three things:
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_PARALLEL=4" # 4 concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2" # keep 2 models hot
Environment="OLLAMA_KEEP_ALIVE=24h" # don't unload between messages
Environment="OLLAMA_HOST=0.0.0.0:11434" # if bot is on different host
Then:
sudo systemctl daemon-reload
sudo systemctl restart ollama
ollama pull llama3.1:8b
ollama pull nomic-embed-text # for RAG
Verify it's serving:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Say hello in 5 words"
}'
Slack Bot — Full Implementation {#slack-bot}
Step 1: Create the Slack App
- Go to api.slack.com/apps → Create New App → From scratch
- Socket Mode: enable it (you don't need a public URL)
- OAuth Scopes (Bot Token Scopes):
app_mentions:read,chat:write,channels:history,im:history,im:write,commands - Event Subscriptions → enable → subscribe to
app_mention,message.im - Slash Commands → create
/aiwith description "Ask the local AI" - Install to Workspace — copy the Bot Token (starts with
xoxb-) - Basic Information → App-Level Tokens → generate one with
connections:writescope (starts withxapp-)
Step 2: The Bot Code
# slack_bot.py
import os
import re
import logging
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler
import ollama
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("slack_bot")
OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1:8b")
app = App(token=os.environ["SLACK_BOT_TOKEN"])
client = ollama.Client(host=OLLAMA_HOST)
SYSTEM_PROMPT = (
"You are a helpful assistant for an engineering team in Slack. "
"Be concise. Format code in fenced blocks. "
"If you do not know, say so. Do not invent facts about the team."
)
def strip_mention(text: str) -> str:
return re.sub(r"<@[A-Z0-9]+>", "", text).strip()
def ask_ollama(prompt: str, history: list[dict]) -> str:
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
messages.extend(history[-10:]) # keep last 10 turns
messages.append({"role": "user", "content": prompt})
resp = client.chat(model=MODEL, messages=messages,
options={"temperature": 0.4, "num_predict": 800})
return resp["message"]["content"].strip()
# Per-thread conversation memory (production: use Redis)
THREAD_HISTORY: dict[str, list[dict]] = {}
@app.event("app_mention")
def on_mention(event, say, client_slack):
text = strip_mention(event["text"])
thread_ts = event.get("thread_ts") or event["ts"]
history = THREAD_HISTORY.setdefault(thread_ts, [])
# Show "thinking" reaction
client_slack.reactions_add(channel=event["channel"], name="hourglass_flowing_sand", timestamp=event["ts"])
try:
answer = ask_ollama(text, history)
history.append({"role": "user", "content": text})
history.append({"role": "assistant", "content": answer})
say(text=answer, thread_ts=thread_ts)
except Exception as e:
log.exception("ollama error")
say(text=f"Sorry — backend error: {e}", thread_ts=thread_ts)
finally:
client_slack.reactions_remove(channel=event["channel"], name="hourglass_flowing_sand", timestamp=event["ts"])
@app.command("/ai")
def slash_ai(ack, respond, command):
ack()
prompt = command["text"]
if not prompt:
respond("Usage: /ai <your question>")
return
answer = ask_ollama(prompt, [])
respond(text=answer, response_type="in_channel")
@app.event("message")
def on_dm(event, say):
# Only respond to DMs, ignore channel messages (handled by app_mention)
if event.get("channel_type") != "im":
return
if event.get("subtype") == "bot_message":
return
history = THREAD_HISTORY.setdefault(event["channel"], [])
answer = ask_ollama(event["text"], history)
history.append({"role": "user", "content": event["text"]})
history.append({"role": "assistant", "content": answer})
say(answer)
if __name__ == "__main__":
handler = SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"])
log.info("Slack bot starting...")
handler.start()
That's a fully functional Slack bot. @bot what does our deploy script do? works in any channel where the bot is added, /ai slash command works anywhere, and DMs work too.
Discord Bot — Full Implementation {#discord-bot}
Step 1: Create the Discord App
- Discord Developer Portal → New Application
- Bot tab → Reset Token → copy it (this is your
DISCORD_TOKEN) - Bot tab → enable Message Content Intent (required to read message text)
- OAuth2 → URL Generator → scopes:
bot,applications.commands→ permissions: Send Messages, Read Messages, Add Reactions, Use Slash Commands → invite the bot to your server
Step 2: The Bot Code
# discord_bot.py
import os
import logging
import asyncio
import discord
from discord import app_commands
import ollama
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("discord_bot")
OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
MODEL = os.environ.get("OLLAMA_MODEL", "llama3.1:8b")
intents = discord.Intents.default()
intents.message_content = True
client_d = discord.Client(intents=intents)
tree = app_commands.CommandTree(client_d)
ollama_client = ollama.Client(host=OLLAMA_HOST)
SYSTEM_PROMPT = "You are a helpful assistant in Discord. Be concise. Use markdown."
CHANNEL_HISTORY: dict[int, list[dict]] = {}
async def ask_ollama_async(prompt: str, history: list[dict]) -> str:
def _call():
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
messages.extend(history[-10:])
messages.append({"role": "user", "content": prompt})
return ollama_client.chat(model=MODEL, messages=messages,
options={"temperature": 0.4, "num_predict": 800})
resp = await asyncio.to_thread(_call)
return resp["message"]["content"].strip()
@client_d.event
async def on_ready():
await tree.sync()
log.info(f"Logged in as {client_d.user}")
@client_d.event
async def on_message(message: discord.Message):
if message.author == client_d.user or message.author.bot:
return
# Respond on mention or DM
is_dm = isinstance(message.channel, discord.DMChannel)
is_mention = client_d.user in message.mentions
if not (is_dm or is_mention):
return
prompt = message.content.replace(f"<@{client_d.user.id}>", "").strip()
if not prompt:
return
history = CHANNEL_HISTORY.setdefault(message.channel.id, [])
async with message.channel.typing():
try:
answer = await ask_ollama_async(prompt, history)
history.append({"role": "user", "content": prompt})
history.append({"role": "assistant", "content": answer})
# Discord max message length is 2000 chars
for i in range(0, len(answer), 1900):
await message.reply(answer[i:i+1900], mention_author=False)
except Exception as e:
log.exception("ollama error")
await message.reply(f"Backend error: {e}")
@tree.command(name="ai", description="Ask the local AI a question")
async def slash_ai(interaction: discord.Interaction, prompt: str):
await interaction.response.defer()
answer = await ask_ollama_async(prompt, [])
for i in range(0, len(answer), 1900):
if i == 0:
await interaction.followup.send(answer[i:i+1900])
else:
await interaction.followup.send(answer[i:i+1900])
if __name__ == "__main__":
client_d.run(os.environ["DISCORD_TOKEN"])
Mention the bot or DM it — it replies. /ai prompt slash command works server-wide. The 1900-char chunking handles long responses (Discord limits messages to 2000 chars).
Adding RAG over Team Docs {#rag}
This is the killer feature. The bot answers questions from your internal docs, runbooks, and wikis instead of generic training data.
Step 1: Run ChromaDB
docker run -d -p 8000:8000 -v chroma-data:/chroma/chroma --name chroma chromadb/chroma:latest
Step 2: Index Your Docs
# index_docs.py
import os, glob, chromadb
import ollama
ollama_client = ollama.Client(host="http://localhost:11434")
chroma = chromadb.HttpClient(host="localhost", port=8000)
coll = chroma.get_or_create_collection(name="team_docs")
def embed(text: str):
return ollama_client.embeddings(model="nomic-embed-text", prompt=text)["embedding"]
def chunk(text: str, size: int = 800, overlap: int = 100):
chunks = []
for i in range(0, len(text), size - overlap):
chunks.append(text[i:i+size])
return chunks
for path in glob.glob("./docs/**/*.md", recursive=True):
with open(path) as f:
content = f.read()
for i, ch in enumerate(chunk(content)):
coll.upsert(
ids=[f"{path}:{i}"],
documents=[ch],
embeddings=[embed(ch)],
metadatas=[{"source": path, "chunk": i}],
)
print("Indexed all docs.")
Run python index_docs.py whenever your docs change. For automatic re-indexing on file changes, wrap it in watchdog.
Step 3: RAG-Enabled Chat Function
Replace the ask_ollama function in either bot:
def ask_with_rag(prompt: str, history: list[dict]) -> str:
q_embedding = ollama_client.embeddings(model="nomic-embed-text", prompt=prompt)["embedding"]
results = coll.query(query_embeddings=[q_embedding], n_results=5)
context = "\n\n---\n\n".join(results["documents"][0])
augmented_system = (
SYSTEM_PROMPT + "\n\n"
"Use the following context from internal team docs to answer. "
"If the answer is not in the context, say so plainly.\n\n"
f"CONTEXT:\n{context}"
)
messages = [{"role": "system", "content": augmented_system}]
messages.extend(history[-6:]) # shorter history when context is large
messages.append({"role": "user", "content": prompt})
resp = ollama_client.chat(model=MODEL, messages=messages,
options={"temperature": 0.2, "num_predict": 800})
return resp["message"]["content"].strip()
Now @bot what is our deploy procedure? returns answers grounded in your actual runbook. For deeper RAG tuning, see RAG local setup guide.
Slash Commands & Tools {#slash-commands}
Add structured commands beyond /ai:
# Slack
@app.command("/summarize")
def summarize_thread(ack, respond, command, client_slack):
ack()
channel = command["channel_id"]
# Fetch last 50 messages
history = client_slack.conversations_history(channel=channel, limit=50)
text = "\n".join(m["text"] for m in history["messages"] if "text" in m)
summary = ask_ollama(f"Summarize this Slack channel in 5 bullets:\n{text}", [])
respond(summary, response_type="in_channel")
@app.command("/translate")
def translate(ack, respond, command):
ack()
args = command["text"].split(" ", 1)
if len(args) != 2:
respond("Usage: /translate <lang_code> <text>")
return
lang, text = args
answer = ask_ollama(f"Translate to {lang}: {text}", [])
respond(answer)
Discord equivalent uses @tree.command decorator with the same logic.
Useful slash commands seen in the wild:
/summarize— collapse a long channel into bullets/translate— quick translation/sql— natural language to SQL/onboard— generate onboarding steps for a new team member/runbook— pull a runbook by name from RAG
Rate Limiting & Cost Control {#rate-limiting}
Without limits, one curious user will lock up the bot for everyone. The pattern:
import time
from collections import defaultdict, deque
USER_REQUESTS: dict[str, deque] = defaultdict(deque)
USER_RATE_LIMIT = 10 # max requests
USER_RATE_WINDOW = 60 # per 60 seconds
GLOBAL_QUEUE_SIZE = 4 # match OLLAMA_NUM_PARALLEL
def check_rate_limit(user_id: str) -> bool:
now = time.time()
q = USER_REQUESTS[user_id]
while q and q[0] < now - USER_RATE_WINDOW:
q.popleft()
if len(q) >= USER_RATE_LIMIT:
return False
q.append(now)
return True
In the message handler:
if not check_rate_limit(event["user"]):
say("You're sending too fast — try again in a minute.", thread_ts=thread_ts)
return
For team-wide cost monitoring, track tokens per user in Redis or Postgres. Most teams flag any user exceeding 50,000 tokens/day for review.
For multi-user rate-limiting at the Ollama layer itself, see Ollama rate limiting for multi-user setups.
Production Deployment {#deployment}
systemd unit (Linux)
# /etc/systemd/system/ai-bot.service
[Unit]
Description=Local AI Slack/Discord bot
After=network.target ollama.service
[Service]
Type=simple
User=botuser
WorkingDirectory=/opt/ai-bot
EnvironmentFile=/opt/ai-bot/.env
ExecStart=/opt/ai-bot/venv/bin/python slack_bot.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now ai-bot
sudo journalctl -u ai-bot -f
Docker
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "slack_bot.py"]
docker build -t ai-bot .
docker run -d --restart always --env-file .env --name ai-bot --network host ai-bot
Monitoring
Bare minimum: log every request, response length, and latency. Plug into Prometheus for real metrics:
from prometheus_client import Counter, Histogram, start_http_server
REQ = Counter("bot_requests_total", "Total requests", ["channel_type"])
LAT = Histogram("bot_latency_seconds", "End-to-end latency")
start_http_server(9100)
For full observability, see Ollama Prometheus + Grafana.
Pitfalls We Hit (So You Do Not) {#pitfalls}
- Slack rate limits are per-method, not global. If you call
reactions_addon every message, you will hit 429s under load. Cache per-channel reaction state. - Discord intents must be enabled in the developer portal AND in code. Forgetting one or the other causes silent message-ignore behavior with no error.
- Streaming responses break Slack. Slack does not support edit-as-you-stream. Buffer the full response then post once.
- OLLAMA_KEEP_ALIVE matters. Default is 5 minutes; once unloaded, first message after a quiet period takes 8-15 seconds to respond. Set it to 24h.
- Thread history grows unbounded in memory. Use Redis with a TTL of 24 hours. Otherwise expect to restart the bot weekly.
- The model will roleplay as your CEO if asked. Add a system prompt rule: "Never impersonate specific people. Never claim to be a human."
- Bot owners get DM'd weird stuff. Add an admin command
/auditthat lets you see anonymized log samples to spot abuse patterns.
Wrap-Up
A self-hosted team chat bot is one of the highest-leverage pieces of software you can build for your company in 2026. It costs roughly nothing to run, gives non-technical employees an AI helper without leaking proprietary data, and turns your internal documentation into something people actually read. The Slack version above is in production at three companies I know of — including the original startup that asked for "an AI in Slack that doesn't leak our roadmap."
Start with the Quick Start. Get a basic mention working. Then add RAG over your handbook. Then add the two slash commands your team uses most. By month three, your team will treat the bot like a colleague. The fact that nobody outside the company can see anything it does is the part you sell to your security team.
Want a deeper integration story? Read Ollama function calling and tool use for action-taking bots, or private OpenAI-compatible API to expose your bot's brain to other apps.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!