NVIDIA DSX
Introduction For years, building an AI data center meant assembling a puzzle from dozens of vendors, each speaking a different language. Chips from one company, …
Home
Site Reliability Engineer / Platform Engineer
Led infrastructure for 10+ startups across 17+ Kubernetes environments on AWS, GCP, and Azure. Specialized in reliability, observability, and air-gapped enterprise deployments.
Blog
Practical writing on Kubernetes, observability, platform engineering, migrations, and production infrastructure.
Introduction For years, building an AI data center meant assembling a puzzle from dozens of vendors, each speaking a different language. Chips from one company, …
Let’s see how the ndots option works in Kubernetes. In Kubernetes, we connect to running pods either directly or via a Kubernetes Service. This post …
This blog is based on my work at CloudRaft! AI jobs often run for long periods on expensive hardware like GPUs. When a job fails halfway, you don’t just …
Skills