apr_
home/portfolio/IoT Observability with Grafana

2024-06-01 · 9 min read · Observability

Enterprise IoT Observability with Grafana

How I built a unified observability platform from scratch for a workplace safety IoT company — and automated away 80% of support ticket creation in the process

As lead support engineer at an IoT workplace safety startup, I made the case for Grafana and single-handedly built the company's entire observability stack — integrating Clickhouse SQL, Prometheus, InfluxDB, BigQuery, Databricks, in-house APIs, and inventory databases into a single pane of glass. I then wired Grafana OnCall to auto-generate fully-contextualized Zendesk tickets, eliminating black-box troubleshooting and driving 80%+ of all support ticket creation automatically.

TL;DR

An IoT workplace safety company had a single ambiguous Datadog dashboard and no way to quickly understand what was failing, where, or for which client. I made the case for Grafana, built the entire stack solo from scratch — integrating 7 data sources including Clickhouse SQL, Prometheus, InfluxDB, BigQuery, Databricks, an in-house API, and an inventory database — and wired Grafana OnCall to auto-generate fully-contextualized Zendesk tickets. The result: 80%+ of all support tickets created automatically, zero alert fatigue, and hours of manual triage turned into seconds.

The company made wearable IoT sensors that tracked ergonomic risk — bends, twists, and unsafe postures — while workers moved through warehouse floors all day. At shift end, workers docked their devices at Android tablets running a kiosk app. The dock synced the day's data to AWS and the cloud platform. At peak, the platform was live in hundreds of enterprise client sites across the country, each with their own networking environment, IT constraints, and operational expectations.

I was the lead support engineer for this platform — responsible for troubleshooting client issues end-to-end, onboarding new accounts, and serving as the technical voice of the product during sales calls. When I joined, the observability story was one ambiguous Datadog dashboard. It would occasionally surface a signal — an elevated metric for a warehouse_id or dock_id — but turning that signal into an actionable support case took a painful amount of manual work.


A note on the images

The diagrams and mockups in this post are generalized representations. Component names, team names, and organizational details have been abstracted to protect client confidentiality.

The Problem: Black-Box Troubleshooting at Scale

The existing workflow was exactly as painful as it sounds. An alert fires — maybe. You see a warehouse_id in a Datadog panel and a vague metric spike. To figure out what warehouse that actually was, you'd leave Datadog, cross-reference a spreadsheet, then hit an internal tool to get the dock's SAT number. Then you'd reach out to a client manager just to get the contact information for the right person at that client site. By this point you still had no idea *why* there was an issue — only that something was probably wrong.

The support team was operating in the dark. Every issue started with 30 minutes of context-gathering before any real diagnosis could begin. As the client base grew and the device fleet expanded, this process didn't scale — it just created more entropy. New clients had unique networking constraints. Some required working through an IT department before any on-site troubleshooting. With no unified picture of device health, dock status, client context, and infrastructure metrics in one place, there was no way to stay ahead of issues.

Making the Case for Grafana

I had worked with Grafana before and knew what it could do with the right data sources behind it. The key insight was that we didn't just have an alerting problem — we had a *correlation* problem. The signals existed across multiple systems: device telemetry in Clickhouse, infrastructure metrics in Prometheus, time-series data in InfluxDB, analytics in BigQuery and Databricks, and the critical contextual layer — client contacts, dock mappings, SAT numbers — locked in an in-house API and an inventory database. No single tool was going to surface all of that automatically. But Grafana could unify it.

I made the case to leadership: Grafana's multi-datasource architecture meant we could write queries that joined device telemetry with inventory context and surface a complete operational picture per warehouse and dock — not raw metrics, but *answers*. The pitch wasn't just better dashboards. It was replacing the manual lookup workflow entirely. Leadership approved it. I owned the implementation end-to-end.

Data Architecture

Device eventsClickhouse — complex SQL queries joining dock activity, sync events, upload success rates, and device usage patterns per warehouse
Infrastructure metricsPrometheus — platform health, AWS service metrics, dock connectivity signals
Time-series telemetryInfluxDB — high-frequency device sensor data streams
AnalyticsBigQuery — historical usage patterns, device adoption trends per client
ML / data platformDatabricks — model output and enriched usage signals from the data science team
Client contextIn-house REST API — warehouse names, client contacts, account metadata
Dock inventoryInventory database — dock SAT numbers, device serial mappings, site configurations

The most technically interesting work was the Clickhouse layer. The queries weren't simple metric reads — they joined device event tables with the inventory database to produce per-dock, per-warehouse views that merged device health signals with client context in a single query result. Grafana's Clickhouse data source plugin handled the transport, but the SQL work required understanding the full schema across multiple systems. I also built a custom Grafana data source proxy that called the in-house API at panel render time, injecting warehouse names and client contacts as template variables so every dashboard was always contextually accurate — no static lists to maintain.

The Dashboard Suite

The goal wasn't a single god-dashboard. It was a layered set of views that answered different questions at different levels of zoom. The top-level fleet overview showed every warehouse at a glance: dock sync rates, upload success percentages, devices active today, and any open incidents — color-coded by severity. From there, you could drill into a single warehouse to see individual dock-level health, recent upload history, and the client contact for that site — all in one panel row.

A critical design decision was surfacing the *human* context alongside the technical signal. Before Grafana, the warehouse name, the dock SAT number, and the client contact lived in three different systems. In the new dashboards, they all appeared inline. An engineer looking at a dock failure saw the warehouse name, the SAT number, and the contact name and phone number in the same row as the failure signal. The time from "alert fires" to "I know who to call" went from 20-30 minutes to near-zero.

I also built usage-pattern analytics that rivaled what the dedicated data analytics team was producing in Looker. By merging device sync events with the shift schedule data in Clickhouse and grouping by warehouse and role type, I could surface which client sites had the highest device utilization, which docks were chronically problematic, and which warehouses had adoption patterns that suggested training gaps. The data science team started pulling from my dashboards because the operational joins I was doing surfaced signal they couldn't easily replicate in Looker.

Grafana OnCall: From Alert to Ticket in Seconds

Dashboards solved the context-gathering problem for engineers who were already looking at the right screen. But the support workflow still depended on someone noticing an alert, triaging it manually, and creating a Zendesk ticket with the right fields populated. That part hadn't changed — it was still hours of work per incident.

I built OSS Grafana OnCall into the stack to close that gap. OnCall receives alert payloads from Grafana alerting, applies routing rules, and can fire webhooks with the enriched context. The key was that by this point, my alert rules were already running against the joined Clickhouse + inventory database queries — so when an alert fired, its payload already contained the warehouse name, the dock SAT number, the client contact, and the severity classification. I wrote the routing rules to match on those enriched fields and configured a webhook integration that called the Zendesk API to create a ticket with all custom fields pre-populated directly from the alert payload.

The result: when a dock failed a sync threshold, OnCall fired once, created a Zendesk ticket with the warehouse name, dock SAT number, client contact name and phone number, and severity level already in the right fields. No manual lookups. No reaching out to client managers. The engineer assigned to the ticket had everything they needed before they even opened it.

The auto-recovery piece was equally important. Tickets fired exactly once per incident — no duplicate floods. When the alert condition cleared after a configurable time threshold (dock came back online, sync resumed), OnCall fired a recovery event that automatically resolved the Zendesk ticket. There was no alert fatigue. The team didn't tune out the system because it didn't spam them. Every ticket that came in was meaningful.

Device Management Context

The dock hardware itself added complexity to every client deployment. Each dock was an Android tablet running the company's proprietary app in kiosk mode, managed via VMware WorkspaceOne MDM. The MDM layer handled OS update deployments, remote session access for troubleshooting, and app version management across the fleet. Networking was the wildcard: enterprise client sites varied enormously in their IT posture — some had open corporate Wi-Fi, others had firewall rules that needed whitelisting, VPN tunnels, or proxy configurations. I flew on-site to several clients to diagnose these network-layer issues directly.

The Grafana stack helped here too. By correlating dock connectivity signals with site-specific Prometheus metrics and WorkspaceOne device telemetry, I could distinguish a genuine application failure from a network hiccup or an MDM enrollment issue — and route them to the right resolution path. That classification accuracy was part of why ticket quality improved: the system wasn't just automating tickets, it was automating *correctly categorized* tickets.

Impact

  • 80%+ of all support tickets created automatically via Grafana OnCall webhooks — no manual triage for the majority of incidents
  • Zero alert fatigue — tickets fired exactly once per incident, auto-resolved on recovery, no duplicate floods
  • Time-to-context reduced from 20–30 minutes to near-zero — engineers had warehouse name, dock SAT#, and client contact before opening a ticket
  • Built analytics that rivaled the data analytics team's output — usage pattern dashboards merged device telemetry with shift and client data in ways the org hadn't done before
  • Every support team member used the Grafana dashboards as their primary operational interface — displacing ad-hoc tool-hopping
  • Authored virtually all internal and client-facing documentation — device setup guides, troubleshooting runbooks, platform integration docs
  • Deployed on-site to critical enterprise client locations to resolve complex networking and MDM issues directly

What I'd Do Differently

The Clickhouse queries I wrote were powerful but brittle — schema changes upstream could silently break dashboard panels because there was no contract between the data pipeline and my query layer. Given more time, I'd have added a lightweight materialized view or a structured intermediate table per use case rather than querying raw event tables directly. It would have made the dashboards more resilient to upstream schema evolution.

I'd also invest earlier in dashboard-as-code tooling — Grafana's provisioning API or Grizzly — to version-control the dashboard definitions in Git rather than managing them through the UI. As the dashboard suite grew, having no version history for panel changes became a real operational risk. Dashboard-as-code would have solved that cleanly and made onboarding future engineers to the observability stack considerably easier.

If you're building something similar — whether for IoT fleets, distributed field hardware, or any environment where operational context is spread across multiple systems — the pattern holds: unify your context layer first. Alerts are only as useful as the context attached to them. Get the inventory, the human contacts, and the device state into the same query before you write a single routing rule.


Interested in how I've applied similar observability patterns to my homelab? The live Prometheus dashboards on this site pull real-time cluster metrics from the same k3s homelab, and actively building out the rest of my portfolio. If you're seeing this message in several days I should have another reference to my work up — stay tuned!

#Grafana#Grafana OnCall#Clickhouse#Prometheus#InfluxDB#BigQuery#Databricks#SQL#IoT#Observability#Zendesk
apr·Anthony Paul Ruiz
ci/cdhosted onk8sviacloudflare