Building an Enterprise Health Status Dashboard

When the vendor can't bend, you build your own

How I built a custom Next.js status dashboard with Okta SSO, role-based stakeholder views, and OpsGenie-driven component state — replacing Statuspage.io for a client who needed control the vendor couldn't offer.

TL;DR

A client with 30+ teams outgrew Statuspage.io's seat limits, Okta integration gaps, and customization ceiling. I built a full replacement in Next.js on Vercel — real-time component health, Okta SSO with component-level access control, 30/60/90-day uptime calculations, deep-linked shareable URLs, dynamic OG previews, and scoped API keys — all sharing the DynamoDB backing the Slack bot from Part 1.

This is Part 2 of a two-part series. Part 1 covers the Slack Incident Command Bot — the DynamoDB data model and OpsGenie team mappings built there are what this dashboard is built on top of. You can read them in any order, but Part 1 gives useful context for the architecture decisions here.

The org had been running Statuspage.io for a while when I started piloting it across our sub-org. For a while it worked. Teams could post updates, stakeholders had a URL to check, and the automated SLO-breach state changes via API gave it a real-time feel. Then the org kept growing, and the cracks started showing. What the vendor offered and what we actually needed had drifted apart far enough that the path forward was obvious: build it ourselves.

A note on the images

The diagrams and mockups in this post are generalized representations. Component names, team names, and organizational details have been abstracted to protect client confidentiality.

Why Statuspage.io Hit Its Ceiling

There were four concrete problems. First, seat costs: once you pushed past 1,000 users, the pricing model became untenable for a platform meant for the entire org. Second, Okta SSO worked for authentication, but the assertions weren't being honored for access control — everyone who authenticated saw the same components at the same level of detail, regardless of their team or role. Third, branding and customization were limited — you could inject JavaScript and hack CSS, but it was fragile and outside supported usage. Fourth, there was no native path to integrate with OpsGenie and Slack in the way we needed: component state auto-driven by SLO rules in Datadog and Prometheus, with incident coordination happening in the same toolchain teams were already using.

Architecture

FrontendNext.js (App Router), deployed on Vercel — synced directly from the repo

AuthOkta SSO via NextAuth.js — OIDC, server-side assertion reading for role resolution

Data storeAWS DynamoDB — shared with the Slack bot (components, teams, incident history)

Component stateOpsGenie API + scoped API keys for SLO-driven auto-update from Datadog / Prometheus rules

Incident coordinationSlack bot (Part 1) — channel creation, responder routing, 30-min reminder cadence

DNSRoute 53, proxied through Cloudflare — consistent with the rest of the client's infrastructure

The First Version: Porting Statuspage.io

The initial prototype took a few days. I pointed the Next.js app at the Statuspage.io API to pull component data, current status, and incident history — enough to prove the concept worked and show stakeholders a familiar layout on infrastructure I controlled. Once the department head saw it, the scope expanded: build something that matched Statuspage.io feature-for-feature and then went further. I started with uptime and access control, since those were the two gaps that had the most immediate impact.

Calculating Uptime: The Math Behind the Badges

Statuspage.io shows uptime percentages per component, but we needed to compute this ourselves once we moved off their API. The approach: pull incident history for each component with epoch timestamps for start and end. Filter to incidents with severity of Minor Outage or Major Outage — Degraded states were excluded since they represented impaired but not unavailable service. Sum the total outage duration in seconds. Divide the remaining seconds by the total seconds in the window (30 days = 2,592,000 seconds, and so on). The result was a 30-, 60-, and 90-day uptime percentage per component that matched the industry-standard SLA reporting format stakeholders expected.

Diagram showing incident timeline bars for 90/60/30 day windows with outage intervals highlighted in red/yellow, alongside the resulting uptime percentage badges with color-coded thresholds — Left: incident timeline with outage windows marked per window. Right: the resulting uptime badges. Color thresholds: green ≥99.9%, yellow ≥99%, orange ≥95%, red <95%.

Why epoch timestamps matter

Storing incident start and end times as epoch seconds made the uptime arithmetic straightforward and timezone-agnostic. No moment.js, no timezone conversion bugs — just subtraction and division.

Okta SSO and Component-Level Access Control

Using NextAuth.js with the Okta OIDC provider, I could read the authenticated user's email assertion server-side on every request. That email was the key into the DynamoDB component membership table — the same table the Slack bot used. Depending on the user's role for a given component, the server would include or exclude that component from the rendered page. Admins saw everything and had access to all configuration options. Team members saw their components with full incident context and update capabilities. The public-facing view — no auth required — showed only coarse component health with no incident details. All of this was enforced server-side, not just in the UI.

Three-panel comparison of role-based views: Public (unauthenticated) shows only status names, Team Member (Okta) sees own components with incident detail and declare/update actions, Admin sees all components with full configuration controls — The same component list renders differently based on the Okta email assertion resolved server-side. Public users get coarse status. Team members get incident context for their components only. Admins get full configuration.

Deep-Linking and Open Graph Previews

Every state in the dashboard was serialized into the URL via query strings — which component was selected, which incident modal was open, which history view was active. Sharing a URL during an incident took the recipient directly to that exact view, no navigation required. On top of that, I implemented dynamic Open Graph metadata: every incident URL generated a custom OG image with the incident title, affected component, current status, and severity. When someone shared that URL in Slack or iMessage, the preview showed actionable information before they even clicked.

Two panels: left shows the URL with query params for component and incident, plus the OG metadata fields generated server-side; right shows the resulting Slack unfurl card with incident title, status, and dynamic OG image — Incident URLs are fully deep-linked and generate dynamic OG metadata server-side. The Slack unfurl shows the incident title, severity, and current status before the recipient opens the link.

Scoped API Keys for SLO-Driven State Updates

One of the original goals with Statuspage.io was to let teams auto-update component state when an availability SLO was breached — a Datadog or Prometheus alert rule would fire an API call to set the component to Degraded, and another call when the SLO recovered to set it back to Operational. We replicated this on the custom dashboard with scoped API keys. Each key was tied to a user's component membership: it could only update the components the key owner was a member of. Admin keys had broader access. This meant teams could safely distribute API keys to their alerting infrastructure without worrying about accidentally updating components they didn't own.

Incident Management: One Platform, Two Surfaces

Incidents could be declared from either the dashboard or the Slack bot. When declared from the dashboard, the same bot logic from Part 1 fired — a dedicated Slack channel was created, responders were auto-invited, and the 30-minute update reminder was armed. When declared from Slack, the incident state was reflected immediately on the dashboard. Both surfaces read and wrote to the same DynamoDB tables, so there was no sync step and no risk of state drift between the two.

Consolidating Everything into One Codebase

The original Slack bot was Python running on AWS Lambda behind API Gateway. Once the dashboard was stable on Next.js and Vercel, it was a natural consolidation: refactor the Lambda functions into Next.js API routes. The DynamoDB schema was unchanged — the data model had been designed to serve both surfaces from the start. Eliminating Lambda meant no cold-start overhead, no separate deployment pipeline, and no context-switching between Python and TypeScript. Everything was in one repo, deployed in one Vercel project, behind one authentication layer.

Lessons Learned

Defining 'degraded' is harder than building the UI. Teams had very different interpretations of what each status level meant for their component. Getting alignment on that definition was more work than any of the technical implementation.
Dynamic OG images are worth the effort. Stakeholders and engineers share incident URLs constantly. When the preview shows the incident title and status inline, people act on it faster — they know what they're clicking into before they open it.
Server-side access control is the only access control. Client-side rendering guards are useful for UX, but enforcement has to happen on the server where the Okta assertion is authoritative. Any other approach can be bypassed.
Scoped API keys need to be the default, not an option. Teams initially wanted broad keys for simplicity. Scoped-by-default meant fewer accidents and made key rotation manageable when team membership changed.
Consolidate infrastructure sooner. Running the bot on Lambda and the dashboard on Vercel separately for months created unnecessary friction. The shared data model made consolidation easy — it should have happened earlier.

#Next.js#Vercel#Okta#OpsGenie#NextAuth#DynamoDB#Incident Management