apr_

Anthony Ruiz · USA

Observability
Engineer

I build systems that make data tell the story — marrying business intelligence with infrastructure telemetry so teams can stop reacting and start anticipating.

anthony@homelab ~ zsh

$ whoami

About

Anthony Paul Ruiz

Observability Engineer · DevOps · Platform Engineering

I got into technology through curiosity and necessity — starting as a self-taught IT tech at a Manhattan law firm and growing into owning everything from network infrastructure to digital advertising to custom software. That scrappiness never left. Today I design and operate observability platforms that give engineering teams real signal in production: the kind of dashboards and alerting pipelines that mean your on-call engineer wakes up to a clear picture, not a wall of noise. If something needs building, I build it — I've replaced vendor products with purpose-built internal tools, made the case for better solutions and then delivered them solo, and integrated data sources across stacks that were never meant to talk to each other. I care about systems that are honest — metrics that reflect reality, runbooks that actually get used, and infrastructure that can explain itself. When I'm not building for clients, I'm running a self-hosted k3s homelab on Proxmox, shipping this portfolio site through a GitOps pipeline, and looking for ways to push observability deeper into every layer of the stack.

United States - MI - (Remote)[email protected]anthonypaulruiz.com

$ ls -la tools/

Grafana
Datadog
OpenTelemetry
Prometheus
Kubernetes
Docker
Argo CD
Helm
OpsGenie
Terraform
GitHub Actions
Proxmox
AWS
Cloudflare
NGINX
PostgreSQL
Redis
Home Assistant
Linux
Python
Go
Bash
VMware
ClickHouse
InfluxDB
Ubuntu
Slack
Next.js
Zendesk
Postman
Anthony Paul Ruiz

10+

years in production infrastructure

~80%

ticket reduction at StrongArm via telemetry and Grafana OnCall

Built twice

made the case for better tooling, then delivered it solo — at two separate companies

$ cat resume.yaml

Experience

Observability Administrator

TekStream Solutions

Feb 2023 — Mar 2026

Observability SRE embedded within the digital enablement org at a top-5 U.S. restaurant chain, enabling telemetry adoption (logs, traces, metrics) across ~30 engineering teams and hundreds of Kubernetes microservices. A primary POC for all Grafana, Datadog, OpenTelemetry, and OpsGenie troubleshooting across the Digital Enablement sub-org. Sole engineer behind a custom Next.js Health Status Dashboard and Slack Incident Command bot — integrating OpsGenie, Okta, and the Slack API — replacing Statuspage.io with a purpose-built incident coordination platform.

Top 5
US restaurant chain client
~30
teams enabled on Datadog, Grafana & OTel
Solo Engineer
custom incident platform (web app dashboard + bot)
  • Expert in Datadog, Grafana, and OpenTelemetry — served as a primary implementation and troubleshooting POC for the entire digital enablement sub-org across ~30 teams and hundreds of Kubernetes microservices
  • Single-handedly built a custom Next.js Health Status Dashboard on Vercel — integrating Okta SSO, OpsGenie, and the Slack API — after being asked to improve on Statuspage.io's limitations
  • Built a Slack Incident Command bot from scratch: auto-routes responders per component via OpsGenie, manages dedicated incident channels, and posts live status updates to the health dashboard
  • Established and enforced telemetry collection best practices across 30+ teams: tagging standards, cardinality-driven cost controls, and sampling compliance enforcement
  • Rotated on-call as Managed Incident Response coordinator — led cross-functional stakeholder communications following structured runbooks
DatadogGrafanaOpenTelemetryOpsGenieKubernetesArgo CDNext.jsVercelOktaSlack APIStatuspage.ioIncident ManagementSREAWS

Senior Engineer

StrongArm Technologies

Jun 2021 — Feb 2023

Lead support engineer at an IoT workplace safety startup, responsible for all client troubleshooting, onboarding, and serving as the technical voice of the product on sales calls. Made the case for Grafana and single-handedly built the company's entire observability stack from scratch — integrating Clickhouse SQL, in-house APIs, and inventory databases into a unified operational picture. Wired Grafana OnCall to auto-generate Zendesk tickets with full client context, eliminating black-box troubleshooting and driving 80%+ of all support ticket creation automatically.

~80%
of tickets automated via Grafana OnCall
0 → prod
Grafana stack built solo from scratch
Solo Engineer
analytics that rivaled the full data team
  • Single-handedly built the Grafana observability stack from scratch — integrating Clickhouse SQL, Prometheus, InfluxDB, BigQuery, Databricks, in-house APIs, and inventory databases to surface real-time warehouse and device health in a single pane of glass
  • Built Grafana OnCall routing that auto-generated Zendesk tickets with full context (warehouse name, dock SAT number, contact info mapped from payload) — driving 80%+ of all ticket creation automatically with no alert fatigue
  • Built analytics that rivaled the data analytics team's visibility into device and worksite usage patterns by merging data sources in ways the org had never done before
  • Served as technical voice of the product on sales calls and primary troubleshooting POC for all client issues; deployed on-site to critical client locations to resolve issues directly
  • Mentored growing support team and authored virtually all internal and client-facing documentation for device setup, troubleshooting, and platform integration
GrafanaGrafana OnCallDatadogPrometheusInfluxDBClickhouseBigQueryDatabricksSQLAWSZendeskWorkspaceOneLookerIoT

Jr. System Admin / Full Stack Developer

PFR IT Consulting Co

Apr 2014 — Jun 2021

Grew from part-time IT technician to the sole technical, web, and marketing owner of a prominent 50-person Manhattan law firm. Shortly after joining, identified $10K/month in wasted Google Ads spend and took over full management of their $25K/month account. Rebuilt the website from scratch (10s+ → under 2 seconds), administered the Windows AD network, and independently built a suite of custom C# tools and a Grafana-backed monitoring platform covering security, productivity, and system health.

<2s
page load (was 10s+)
$10K/mo
wasted ad spend identified and eliminated
Solo
IT, web, marketing, and custom tooling
  • Identified $10K/month in wasted Google Ads spend immediately upon analysis; took over full $25K/month account management and significantly improved ROI
  • Rebuilt the firm's website on a self-managed Ionos VM (PHP/HTML/JS) — cut page load from 10s+ to under 2 seconds; configured Cloudflare for DNS, image optimization, and email security (DMARC, SPF, DKIM)
  • Devised a guerrilla marketing campaign — branded gear on construction sites + social media hashtag — that generated client leads and still drives visibility years later
  • Built a custom C# client-arrival notification system (SQL Express backend, multi-user UI) — alerted paralegals in real time and gave the office manager a live wait-time view, eliminating missed client visits
  • Built a Grafana-backed security and productivity monitoring suite from scratch: Zabbix metrics, keystroke logging, delta-based screenshot capture (15s interval, no duplicates), idle time tracking, login/logout events, and server room temperature — all surfaced in a single dashboard with scheduled daily PDF reports
GrafanaZabbixC# / .NETSQL ExpressCloudflareActive DirectoryPHPGoogle AdsAdobe SuiteSEO

$ ls -la ~/projects

Projects

project_01

Self-Hosted k3s Homelab

4-node k3s cluster on Proxmox. Hosts personal projects and services via Cloudflare Tunnel with zero open inbound ports. Full GitOps: Argo CD syncs from GitHub, OpenTelemetry + KSM ships metrics to both Datadog and Grafana.

k3sProxmoxArgo CD+2

project_02

Portfolio Site

Next.js 16 + React 19 portfolio with live Prometheus health checks, React Flow infra diagram, and an xterm.js interactive terminal. Deployed via GitOps: GitHub Actions → ghcr.io → Argo CD → k3s.

Next.jsReactPrometheus+2

project_03

Enterprise Health Status Dashboard

Custom Next.js web app built as an internal replacement for Statuspage.io after the client needed customizations the vendor couldn't provide. Integrates Okta SSO, OpsGenie, and the Slack API for fully configurable component health views and role-based stakeholder access. Deployed on Vercel.

Next.jsVercelOkta+2

project_04

Slack Incident Command Bot

Solo-built Slack bot that automates incident response coordination via the OpsGenie API. Auto-routes responders into dedicated incident channels by component, manages stakeholder communications, and posts live status updates directly to the Health Status Dashboard — eliminating manual triage and driving consistent incident process.

Slack APIOpsGenieAWS

project_05

Staff Monitoring & Productivity Platform

Built a comprehensive internal monitoring platform from scratch for a 50-person law firm. Integrated Zabbix metrics, a custom keystroke logger (daily counts + application/website tracking), a delta-based screenshot service (15s interval, skips duplicates), login/logout event tracking, idle time analysis, and server room temperature monitoring via web scrape — all surfaced in a Grafana dashboard with scheduled daily PDF reports to management.

GrafanaZabbixC#+2

project_06

Client Arrival Notification System

Designed and built a multi-component C# application to solve a recurring operational problem: clients waiting unnoticed while paralegals claimed they were never notified. Built a receptionist UI to log arrivals, a background Windows service on each paralegal's workstation that surfaced real-time alerts requiring acknowledgment, and an office manager dashboard showing live wait times and acknowledgment timestamps — all backed by a SQL Express database.

C#.NETSQL Express+1

project_07

Law Firm Website Rebuild

Rebuilt the website for Gorayeb & Associates — a prominent Manhattan personal injury law firm — from a slow, unoptimized site to a fast, well-ranked one. Migrated to a self-managed Ionos VM running PHP/HTML/JS, cutting page load from 10+ seconds to under 2 seconds. Configured Cloudflare for DNS, image compression, and email security (DMARC, SPF, DKIM). Built conversion-focused landing pages that supported a $25K/month Google Ads account.

PHPCloudflareGoogle Ads+2

$ grafana-cli dashboard import

Observability

End-to-end telemetry — metrics, traces, and incidents wired together.live

Nodes Ready

4/4

all healthy

Pods Running

66

3 pending

Restarts (1h)

12

check logs

Avg CPU

4%

normal

Cluster CPU (24h)
Hourly average across all nodes
Node Resources
CPU % and memory % per node
Pod Restarts (7d)
Daily container restart count — lower is better
Pod Phase Distribution
Current pods by lifecycle state
Service Health
Prometheus up metric · 30-day uptime
8/10 operationallive
ServiceStatus30d Uptime
kubelet
up100.0%
apiserver
up100.0%
OpenWRT
down0.0%
homeClimate
down0.0%
coredns
up100.0%
kube-prometheus-stack-alertmanager
up100.0%
kube-state-metrics
up100.0%
node-exporter
up100.0%
kube-prometheus-stack-prometheus
up100.0%
kube-prometheus-stack-operator
up100.0%
stack:GrafanaDatadogOpenTelemetryPrometheusLokiOpsGenie
Live Request Trace
Fires a real request through this app's OTEL pipeline — spans from the in-process ring buffer, logs from Loki, correlated by spanId.

$ pvesh get /nodes

Infrastructure

Proxmox homelab — all workloads on one physical host · scroll or use controls to explore· node metrics live

Mini Map

$ ssh anthony@homelab

Live Terminal

Interactive shell — try help

anthony@k3s-server — zsh — 120×32

$ ping anthony

Get In Touch

Open to interesting infrastructure challenges, consulting, or just talking shop about homelabs and platform engineering.

apr·Anthony Paul Ruiz
ci/cdhosted onk8sviacloudflare