Site Reliability Engineer / Platform Engineer
I build and scale Kubernetes platforms for startups and enterprise teams.
I help teams ship reliable infrastructure, modernize CI/CD, improve observability, and run production workloads across AWS, GCP, Azure, and bare metal. Lately I have also been focusing on Slurm, HPC environments, and Apptainer-based workloads. If you need someone who can own platform work end to end, let’s talk.
I’m an SRE and Platform Engineer who has worked with startups like EnkryptAI, Composio, SkySwitch, Rezolve.ai, NeevCloud, Ditto, Clika, IFF, and DrDroid, building reliable Kubernetes and multi-cloud infrastructure.
How I can help
- Kubernetes platform engineering and production operations
- Observability design with Prometheus, Grafana, OpenTelemetry, Loki, Tempo, VictoriaMetrics, and Mimir
- CI/CD modernization with GitHub Actions, ArgoCD, Tekton, and GitOps workflows
- Multi-cloud and BYOC infrastructure across AWS, GCP, Azure, and bare metal
- GPU and AI/ML infrastructure on Kubernetes
- Slurm, HPC, and Apptainer-based compute environments
- Reliability, migration, and cost optimization work where downtime and mistakes are expensive
Who I work with
- Startup founders and CTOs who need an infrastructure owner
- Platform teams scaling Kubernetes in production
- Teams migrating from legacy tooling to cloud-native stacks
- Companies building AI/ML or GPU-backed platforms
- Organizations that need better reliability, observability, and delivery velocity
Selected work
Built cloud-agnostic air-gapped platform deployments for EnkryptAI
Led infrastructure architecture across AWS, GCP, and Azure for enterprise AI deployments. I built three Helm chart stacks that orchestrated 70+ microservices in air-gapped environments, supported GPU scheduling, and enabled onboarding of enterprise customers on AKS and EKS.
Migrated 100TB+ of monitoring data from InfluxDB to Grafana Mimir
Led a zero-downtime migration for a telecom environment, built a custom Golang conversion tool to translate InfluxDB data into OpenMetrics for Mimir, migrated 1000+ dashboards from InfluxQL to PromQL, and reduced infrastructure compute and memory usage by 30%.
Built a production GPU cloud platform on bare-metal Kubernetes
Designed a commercial GPU platform on kubeadm with KubeVirt and NVIDIA GPUs, implemented GPU-aware VM provisioning in Go, fixed passthrough and networking issues, enforced resource quotas, and delivered 99.9%+ uptime for production workloads.
Composio
Built enterprise onboarding and release workflows using Replicated, GitHub Actions, and Kubernetes. I helped reduce onboarding time from days to around one hour, supported multi-channel release pipelines, and handled production onboarding across AWS EKS and GCP GKE environments.
SkySwitch
Led a large-scale observability migration from InfluxDB to Grafana Mimir, rewrote dashboards from InfluxQL to PromQL, and improved efficiency by re-architecting the Mimir deployment for multi-tenancy and lower infrastructure usage.
Rezolve.ai
Designed and deployed an observability stack across multiple AKS production clusters using Prometheus, Grafana, OpenTelemetry, Loki, and Tempo. I also identified infrastructure inefficiencies that delivered measurable annual cost savings and led Kubernetes version upgrades.
Ditto
Worked on multi-cloud BYOC platform infrastructure using Cluster API across AWS, GCP, and Azure. I automated deployment of core platform components and built migration tooling in Go to support Kubernetes cluster transitions safely.
Clika
Built platform components for AI job orchestration on Kubernetes, including a Go-based scheduler, observability integrations, storage workflows, and GKE infrastructure automation with Terraform, ArgoCD, and GitHub Actions.
IFF
Implemented enterprise SAML-based authentication for ArgoCD to improve access control across organization-wide deployments in a Fortune 500 environment.
DrDroid
Modernized deployment workflows by replacing bash-driven releases with GitHub Actions and ArgoCD, and added Metabase for analytics visibility.
Makerble
Migrated complete environments from AWS to Azure with Terraform, reduced infrastructure costs, modernized CI/CD, improved Kubernetes reliability, and deployed monitoring, logging, VPN, and internal developer tooling.
Kubernetes Homelab
To validate ideas before production, I run a multi-node k3s cluster at pidoku.co.in where I experiment with cutting-edge cloud-native technologies and test infrastructure patterns.
Experience
Site Reliability Engineer - CloudRaft
April 2024 - March 2026 • Remote
Owned infrastructure architecture and platform engineering across 10+ client environments spanning telecom, AI/ML, fintech, and enterprise SaaS.
EnkryptAI - AI/ML Infrastructure & Production Operations
- Owned complete AWS EKS infrastructure and VPC-level architecture, serving as DevOps Lead for Enterprise client deployments
- Architected and developed 3 cloud-agnostic Helm Charts (EnkryptAI Stack, Platform Stack, Platform Core) to orchestrate the deployment of 70+ microservices in fully air-gapped multi-cloud deployments (AWS, GCP, Azure) with GPU node support and VRAM-based scheduling
- Onboarded 2 enterprise clients on AKS and EKS environments
- Re-architected Redteaming job orchestration by migrating from Celery-based execution to a Kubernetes-native stack using Argo Workflows, Argo Events, and NATS
- Created a Golang-based container entrypoint script, reducing Kubernetes pod startup time from 5 minutes to 30 seconds for a NextJS application
- Managed on-call rotation using incident.io for guardrails and red-teaming services, ensuring production uptime and incident response
- Deployed production guardrails application on NVIDIA A30/H100 GPUs using NVIDIA GPU operators
- Resolved critical compatibility and deployment issues with NVIDIA GPU Operator on Azure Kubernetes Service (AKS)
- Deployed on-prem OpenFGA integrated with CloudNativePG (CNPG) as the backing PostgreSQL database to implement fine-grained, scalable RBAC across the platform
- Migrated Elasticsearch to OpenSearch using the OpenSearch Kubernetes Operator and successfully transitioned Kibana dashboards to OpenSearch Dashboards, ensuring seamless continuity of audit logging
- Customized and deployed Supabase Helm Chart with CloudNativePG (CNPG) to migrate Supabase from managed cloud to on-prem VPC deployment, enabling full functionality within air-gapped enterprise environments
- Implemented DevSecOps practices by integrating SBOM generation and Grype-based vulnerability scanning into CI/CD pipelines, with automated Slack alerts for high-severity security findings
- Configured Devspace for developers enabling rapid deployment to AWS GPU nodes
- Supported SOC 2 and ISO 27001 compliance initiatives
Composio - Enterprise SaaS Platform & Customer Onboarding
- Implemented Replicated as enterprise portal for customer onboarding, reducing onboarding time from days to 1 hour
- Migrated existing clients to Replicated platform, enabling centralized release management and automated updates
- Wrote preflight and postflight validation checks for installation reliability
- Built multi-channel CI/CD pipeline with GitHub Actions supporting Unstable, Stable, and Nightly releases
- Led client onboarding for 6+ production environments across AWS EKS and GCP GKE
- Customized Temporal Helm charts for client-specific requirements
- Automated deployment of unstable Replicated releases to GKE for continuous testing
Skyswitch - 100TB Monitoring Migration (Telecom)
- Led migration of 100TB+ InfluxDB data to Grafana Mimir with zero downtime
- Built custom Golang conversion tool to transform InfluxDB data to OpenMetrics format for Grafana Mimir
- Migrated 1000+ Grafana dashboards from InfluxQL to PromQL
- Re-architected single-tenant Mimir to multi-tenant model, reducing compactor startup time from hours to 5 minutes
- Reduced infrastructure compute and memory usage by 30%
NeevCloud - GPU Cloud Platform
- Built production-grade GPU cloud platform on bare-metal Kubernetes (kubeadm) with KubeVirt and NVIDIA GPUs
- Developed Golang-based VM provisioning API with GPU-aware scheduling and inventory validation logic
- Implemented custom x-api-key authentication middleware for VLLM model endpoints using Traefik
- Configured Cilium CNI to assign public IPs to KubeVirt VMs
- Resolved GPU passthrough issues and fixed VM network persistence using KubeMacPool
- Enforced AWS-style resource quotas to prevent over-provisioning
- Achieved 99.9%+ uptime for commercial GPU workloads with CI/CD via GitHub Actions, ArgoCD, and Harbor
Rezolve.ai - Enterprise Observability & Cost Optimization
- Designed and deployed an Observability stack across 4 production AKS clusters using Prometheus, Grafana, OpenTelemetry Operator, Loki, and Tempo
- Enabled RED metrics and service graphs for production services
- Identified infrastructure inefficiencies using Steampipe/Powerpipe, delivering $36,000 annual cost savings
- Led Kubernetes version upgrades to v1.31 across all clusters
- Planned and executed Jenkins server upgrade
Ditto - Multi-Cloud BYOC Platform
- Using Cluster API, provisioned kubeadm clusters on client environments across AWS (CAPA), GCP (CAPG), and Azure (CAPZ)
- Developed custom Golang migration script for transition from MachinePool to MachineDeployment for AWS and GCP based Kubernetes Clusters
- Automated deployment of Velero, Cluster Autoscaler, and Node Problem Detector using ArgoCD ApplicationSets across multi-cloud environments
Clika - AI Job Orchestration Platform
- Built Golang-based Kubernetes job scheduler with credit-based billing and a ledger system
- Implemented S3-compatible storage integration for ML model outputs with access controls
- Integrated VictoriaMetrics and VictoriaLogs for real-time job observability with custom API endpoints
- Provisioned GKE infrastructure via Terraform with CI/CD automation using ArgoCD and GitHub Actions
IFF - Enterprise Authentication Integration (Fortune 500)
- Implemented organization-wide SAML-based authentication for ArgoCD, enhancing secure access management across enterprise deployments
DrDroid - CI/CD Modernization
- Modernized bash-based deployments to production-grade CI/CD pipeline using GitHub Actions and ArgoCD with Image Updater
- Set up Metabase for database analytics and insights
DevOps Engineer - Makerble
October 2023 - April 2024 • Remote
- Migrated complete staging, pre-production, and production infrastructure from AWS to Azure using Terraform, achieving 40% cost reduction
- Optimized AWS workloads, reducing costs by 17% through request/limit tuning, affinity/toleration rules, and efficient node placement
- Built CI/CD pipelines in Tekton and automated Testsigma execution with Slack notifications post-production sync
- Configured Ingress Controller with custom error pages, integrated Rollbar with ArgoCD hooks for deployment tracking, and implemented Robusta for Kubernetes monitoring
- Deployed Uptime Kuma for uptime/downtime alerts, Redis Insight via Ingress for Redis monitoring, and BotKube for infrastructure logs on Slack
- Resolved AWS IP exhaustion by adding a secondary subnet for EC2
- Created custom GitHub runners on Oracle Cloud for GitHub Actions
- Deployed a WireGuard-based VPN for secure internal access and implemented Passbolt as a self-hosted password manager
- Designed Azure Kubernetes architecture, documented cluster + CI/CD workflows, and integrated DeepSource, Snyk, and PrefectScale for code quality, security scanning, and cost monitoring
- Implemented Liveness, Readiness, and Startup probes to eliminate downtime in staging/pre-production environments
- Configured Prometheus & Grafana for monitoring and deployed EFK stack for centralized logging
- Automated Azure VM start for developers via Logic Apps
Work with me
If you’re hiring for SRE, platform engineering, DevOps, or Kubernetes consulting, email [email protected]. The fastest path is to send a short note with your infrastructure stack, the current bottleneck, and the kind of help you need.