anthonypaulruiz.com

Anthony Paul Ruiz

Observability Engineer · DevOps · Platform Engineering

Summary

Observability and DevOps engineer with 10+ years of experience enabling engineering teams with production-grade telemetry, building observability platforms from scratch, and engineering custom incident response tooling. Expert in Grafana, Datadog, and OpenTelemetry — with a consistent track record of turning black-box troubleshooting into automated, proactive operations and building the purpose-built tools that make it possible.


Experience

Observability Administrator · TekStream Solutions
Feb 2023 — Mar 2026 · Remote
  • Expert in Datadog, Grafana, and OpenTelemetry — served as a primary implementation and troubleshooting POC for the entire digital enablement sub-org across ~30 teams and hundreds of Kubernetes microservices
  • Single-handedly built a custom Next.js Health Status Dashboard on Vercel — integrating Okta SSO, OpsGenie, and the Slack API — after being asked to improve on Statuspage.io's limitations
  • Built a Slack Incident Command bot from scratch: auto-routes responders per component via OpsGenie, manages dedicated incident channels, and posts live status updates to the health dashboard
  • Established and enforced telemetry collection best practices across 30+ teams: tagging standards, cardinality-driven cost controls, and sampling compliance enforcement
  • Rotated on-call as Managed Incident Response coordinator — led cross-functional stakeholder communications following structured runbooks

Datadog · Grafana · OpenTelemetry · OpsGenie · Kubernetes · Argo CD · Next.js · Vercel · Okta · Slack API · Statuspage.io · Incident Management · SRE · AWS

Senior Engineer · StrongArm Technologies
Jun 2021 — Feb 2023 · New York, NY (Hybrid)
  • Single-handedly built the Grafana observability stack from scratch — integrating Clickhouse SQL, Prometheus, InfluxDB, BigQuery, Databricks, in-house APIs, and inventory databases to surface real-time warehouse and device health in a single pane of glass
  • Built Grafana OnCall routing that auto-generated Zendesk tickets with full context (warehouse name, dock SAT number, contact info mapped from payload) — driving 80%+ of all ticket creation automatically with no alert fatigue
  • Built analytics that rivaled the data analytics team's visibility into device and worksite usage patterns by merging data sources in ways the org had never done before
  • Served as technical voice of the product on sales calls and primary troubleshooting POC for all client issues; deployed on-site to critical client locations to resolve issues directly
  • Mentored growing support team and authored virtually all internal and client-facing documentation for device setup, troubleshooting, and platform integration

Grafana · Grafana OnCall · Datadog · Prometheus · InfluxDB · Clickhouse · BigQuery · Databricks · SQL · AWS · Zendesk · WorkspaceOne · Looker · IoT

Jr. System Admin / Full Stack Developer · PFR IT Consulting Co
Apr 2014 — Jun 2021 · Manhattan, NY
  • Identified $10K/month in wasted Google Ads spend immediately upon analysis; took over full $25K/month account management and significantly improved ROI
  • Rebuilt the firm's website on a self-managed Ionos VM (PHP/HTML/JS) — cut page load from 10s+ to under 2 seconds; configured Cloudflare for DNS, image optimization, and email security (DMARC, SPF, DKIM)
  • Devised a guerrilla marketing campaign — branded gear on construction sites + social media hashtag — that generated client leads and still drives visibility years later
  • Built a custom C# client-arrival notification system (SQL Express backend, multi-user UI) — alerted paralegals in real time and gave the office manager a live wait-time view, eliminating missed client visits
  • Built a Grafana-backed security and productivity monitoring suite from scratch: Zabbix metrics, keystroke logging, delta-based screenshot capture (15s interval, no duplicates), idle time tracking, login/logout events, and server room temperature — all surfaced in a single dashboard with scheduled daily PDF reports

Grafana · Zabbix · C# / .NET · SQL Express · Cloudflare · Active Directory · PHP · Google Ads · Adobe Suite · SEO


Skills

Observability & MonitoringGrafana, Grafana OnCall, Datadog, Prometheus, OpenTelemetry, OpsGenie, Statuspage.io, Zabbix, Loki
Orchestration & GitOpsKubernetes (k3s), Docker, Argo CD, Helm, GitHub Actions
Infrastructure & CloudAWS, Proxmox, Cloudflare, Terraform, NGINX, Vault, Linux
Languages & ScriptingPython, Go, Bash, SQL, TypeScript, C#
Data & StoragePostgreSQL, Redis, MinIO, InfluxDB, Clickhouse, BigQuery, Databricks

Projects

Self-Hosted k3s Homelabhttps://anthonypaulruiz.com

4-node k3s cluster on Proxmox. Hosts personal projects and services via Cloudflare Tunnel with zero open inbound ports. Full GitOps: Argo CD syncs from GitHub, OpenTelemetry + KSM ships metrics to both Datadog and Grafana.

k3s · Proxmox · Argo CD · Cloudflare · OpenTelemetry

Portfolio Sitehttps://anthonypaulruiz.com

Next.js 16 + React 19 portfolio with live Prometheus health checks, React Flow infra diagram, and an xterm.js interactive terminal. Deployed via GitOps: GitHub Actions → ghcr.io → Argo CD → k3s.

Next.js · React · Prometheus · Docker · Kubernetes

Enterprise Health Status Dashboard

Custom Next.js web app built as an internal replacement for Statuspage.io after the client needed customizations the vendor couldn't provide. Integrates Okta SSO, OpsGenie, and the Slack API for fully configurable component health views and role-based stakeholder access. Deployed on Vercel.

Next.js · Vercel · Okta · OpsGenie · Slack API

Slack Incident Command Bot

Solo-built Slack bot that automates incident response coordination via the OpsGenie API. Auto-routes responders into dedicated incident channels by component, manages stakeholder communications, and posts live status updates directly to the Health Status Dashboard — eliminating manual triage and driving consistent incident process.

Slack API · OpsGenie · AWS

Staff Monitoring & Productivity Platform

Built a comprehensive internal monitoring platform from scratch for a 50-person law firm. Integrated Zabbix metrics, a custom keystroke logger (daily counts + application/website tracking), a delta-based screenshot service (15s interval, skips duplicates), login/logout event tracking, idle time analysis, and server room temperature monitoring via web scrape — all surfaced in a Grafana dashboard with scheduled daily PDF reports to management.

Grafana · Zabbix · C# · SQL Express · Windows

Client Arrival Notification System

Designed and built a multi-component C# application to solve a recurring operational problem: clients waiting unnoticed while paralegals claimed they were never notified. Built a receptionist UI to log arrivals, a background Windows service on each paralegal's workstation that surfaced real-time alerts requiring acknowledgment, and an office manager dashboard showing live wait times and acknowledgment timestamps — all backed by a SQL Express database.

C# · .NET · SQL Express · Windows

Law Firm Website Rebuild

Rebuilt the website for Gorayeb & Associates — a prominent Manhattan personal injury law firm — from a slow, unoptimized site to a fast, well-ranked one. Migrated to a self-managed Ionos VM running PHP/HTML/JS, cutting page load from 10+ seconds to under 2 seconds. Configured Cloudflare for DNS, image compression, and email security (DMARC, SPF, DKIM). Built conversion-focused landing pages that supported a $25K/month Google Ads account.

PHP · Cloudflare · Google Ads · SEO · Linux

apr
anthonypaulruiz.com·linkedin.com/in/anthonypaulruiz·github.com/anthonypaulruiz