apr_
home/portfolio/Slack Incident Bot
Enterprise Incident Management Platform · Part 1 of 2
Enterprise Incident Management Platform2 parts
  1. Part 1 of 2Slack Incident Botcurrent
  2. Part 2 of 2Health Status Dashboard

April 16, 2026 · 7 min read · Engineering / Incident Management

Building an Enterprise Slack Incident Command Bot

From manual triage to automated incident coordination across 30+ teams

How I built an enterprise Slack bot that auto-creates incident channels, routes responders from OpsGenie team mappings, enforces a 30-minute update cadence, and later consolidated into Next.js as the backbone of a full status platform.

TL;DR

Manual incident triage across 30+ siloed teams was unsustainable. I built a Slack bot — initially in Python on AWS Lambda with DynamoDB — that auto-creates incident channels, routes the right responders, and enforces update cadence. Later refactored into Next.js as one codebase with the status dashboard.

The client had over 3,000 locations and a Digital Enablement org spanning more than 30 teams. My team was the DevOps Reliability and Monitoring team. Service dependencies across those teams weren't straightforward — everyone knew their immediate domain well, but few had a clear picture of how their component's health rippled up and down the stack. When something broke, the gaps in the manual process showed fast.


A note on the images

The diagrams and mockups in this post are generalized representations. Component names, team names, and organizational details have been abstracted to protect client confidentiality.

The Problem with Manual Incident Triage

Before the bot existed, an incident meant someone getting paged at 2am, manually creating a Slack channel, cross-referencing a spreadsheet to figure out who to invite, and then DMing stakeholders one by one for status updates. There was no enforced process. Channels ended up with the wrong people. Stakeholders had no visibility unless someone remembered to update them. Incident timelines varied wildly depending on who was on-call that night. The org was growing, and the entropy was growing with it.

Architecture

Bot runtimePython, deployed as AWS Lambda
TriggerSlack slash command /incidentAPI Gateway endpoint
Data storeAWS DynamoDB — component config, team membership, incident history
Alert integrationOpsGenie API — team lookup, on-call membership, team dropdown population
Multi-workspaceEnterprise Slack app installed across multiple workspaces in the org
Later consolidatedRefactored into Next.js API routes on Vercel, sharing the same DynamoDB schema as the status dashboard
Request flow from /incident to auto-created channel. The dashed green box shows the later consolidation into Next.js API routes.

The `/incident` Slash Command

When a user ran /incident in Slack, the API Gateway endpoint received the request along with the user's Slack ID. The bot resolved that ID to an email address, then used the email to determine which components the user was mapped to and what role they held. Non-admins could declare incidents for their mapped components but couldn't modify component configuration. Admins had a superset — they could create components, update any component's state, and manage team membership. For admins, running /incident configure opened a Slack modal to add or update a component, name it, and select the owning OpsGenie team from a dynamically populated dropdown fetched live from the OpsGenie API.

The /incident modal. Component and severity dropdowns are populated dynamically. Admins see additional fields not shown here.

Role resolution was email-driven

Slack IDs were resolved to email addresses on every request so that Okta-managed team membership changes propagated automatically — no manual bot sync required when someone joined or left a team.

Component and Team Configuration

Each component in DynamoDB stored the component name, the owning OpsGenie team, and a map of every team member's email address to their role on the team. This mapping was the core of everything. It controlled which components appeared in the /incident dropdown for a given user, which responders were auto-invited when an incident was declared, and whether a user had update permissions or was restricted to read and declare. Component membership was sourced from OpsGenie team rosters, keeping it consistent with who was actually on-call.

Automated Incident Channels

When an incident was declared, the bot called channels.create to provision a dedicated Slack channel, then called conversations.invite for every team member mapped to the affected component. It posted an opening message with the incident details, the designated primary POC, and a link to the status page. The incident commander could adjust permissions from inside the channel — granting update rights to specific responders or revoking them as the incident evolved.

Channel naming convention

Using a consistent inc-{component}-{YYYYMMDD} naming scheme made it possible to find any historical incident channel instantly, and kept postmortem linking unambiguous across months of incidents.

Enforced Update Cadence

One of the most operationally impactful features was a simple reminder loop. If no update had been posted to the incident channel in the previous 30 minutes and the incident wasn't in a closed state, the bot sent a reminder into the channel. This eliminated the most common stakeholder complaint — incidents going silent for long stretches. Engineers deep in diagnosis don't naturally think about stakeholder comms. The bot handled the cadence enforcement automatically so they didn't have to context-switch away from the problem.

The reminder fires whenever 30 minutes pass without an update and the incident is still open. Engineers only need to post an update to silence it.

Permission Model During an Active Incident

Permissions weren't static. Inside an active incident channel, the incident commander had the ability to promote responders to update status, bring in engineers from other teams who weren't originally mapped to the component, or restrict who could post status changes. This was important for cross-team incidents where the owning team needed to pull in a dependency team mid-incident without that team having standing access to update the component's public status.

Migrating from Lambda to Next.js

Later in the engagement, when the custom status dashboard was built on Next.js and deployed on Vercel, it made sense to consolidate. The Lambda functions were refactored into Next.js API routes. The DynamoDB schema remained unchanged — the bot and dashboard had been designed around the same data model from the start. The result was one codebase, one deployment, one authentication layer. If you want to see how the dashboard was built on top of this same data model, read Part 2: Enterprise Health Status Dashboard.

Impact

  • Eliminated manual channel creation and responder lookup — incident channels were provisioned automatically with the right people already inside, within seconds of declaration.
  • The 30-minute reminder cadence enforced a consistent update process without requiring on-call engineers to remember stakeholder communications during active incidents.
  • Presented to senior Grafana engineers at the client's request — recognized as more advanced and purpose-built than any off-the-shelf option available.
  • The shared DynamoDB data model made the status dashboard buildable in days rather than weeks — component and team data infrastructure already existed.
  • Consolidating into Next.js eliminated the Lambda cold-start overhead and reduced infrastructure surface area, making the entire incident platform easier to maintain and extend.

Lessons Learned

  • Use email as the identity anchor. Tying everything to Okta-managed email addresses meant team membership changes propagated automatically, rather than requiring a separate user management flow in the bot.
  • Idempotent channel creation is non-negotiable. Duplicate alerts happen. If the bot attempts to create a channel that already exists, it needs to handle that gracefully and route to the existing channel instead. The Slack API returns a name_taken error — handle it explicitly.
  • Slack API rate limits matter at scale. A multi-component outage firing several alerts simultaneously can exhaust write limits quickly — batching and retry logic with backoff are essential.
  • The hardest part wasn't the API calls. It was agreeing on what a 'component' meant, who owned it, and what counted as an incident worth declaring. The bot forced a conversation the org needed to have anyway.
#Slack API#OpsGenie#AWS Lambda#DynamoDB#Next.js#Incident Management#Python
apr·Anthony Paul Ruiz
ci/cdhosted onk8sviacloudflare