Case Study | SNOW Corp Scales GenAI for 200M Users with HAMi GPU Sharing and KEDA Autoscaling
Discover how SNOW Corp orchestrates 1,000+ GPUs to handle 700% viral traffic spikes, achieving 91% MTTR reduction, 85% fewer surge errors, and USD 17.4M in estimated cost savings using HAMi and KEDA on Kubernetes.
Company Overview
SNOW Corp., a subsidiary of NAVER from South Korea, operates a fleet of 1,000+ A100 GPUs serving GenAI features for 200M+ global users across three top-ranked apps (SNOW, EPIK, B612). The infrastructure serves 1,200+ AI workflows and 400+ models, handling extreme traffic volatility from viral AI trends.
Subsidiary of NAVER, South Korea's leading tech platform
Three top-ranked GenAI apps: SNOW, EPIK, B612
1,200+ AI workflows and 400+ models in production
Multi-region on-premise Kubernetes platform
SNOW Corp.
NAVER subsidiary serving 200M+ GenAI users globally
Challenge: GPU Scheduling at Extreme Scale
Kubernetes' native GPU scheduling treats GPUs as atomic resources — a pod either gets a full GPU or nothing. This model broke down under SNOW's heterogeneous workload demands and unpredictable viral traffic spikes.
Heterogeneous Workload Demands
Training pipelines, inference services, and batch processing have vastly different GPU utilization profiles. A training job might consume 80% of GPU compute but only 20% of memory, while inference has the opposite profile.
Traffic Unpredictability
Viral AI trends and heterogeneous inference workflows created 700% traffic spikes. Without GPU-level observability, scaling decisions were either manual or based on crude CPU/memory metrics that don't reflect GPU saturation.
Scheduling Blindness
Before Kubernetes adoption, 2-3 containers in the same pod competed for GPU resources with no coordination. Manual GPU provisioning based on load predictions led to ~2x over-provisioning.
Cost Explosion
To handle peak loads, SNOW had to over-provision by approximately 2x — one GPU for training, one for inference — when one could theoretically suffice.
Solution: CNCF-Powered GPU Orchestration Stack
SNOW migrated to a multi-region on-premise Kubernetes platform with the CNCF ecosystem underpinning the entire stack: Cilium for CNI, Helm for GitOps-based deployment, Traefik for ingress, Prometheus/Loki/Grafana for observability, HAMi for GPU sharing, and KEDA for autoscaling.
GPU Sharing with HAMi
Kubernetes' default scheduler enforces strict GPU isolation, which blocked migration of SNOW's sequential Train-to-Inference pipelines — where a trainer and inference engine must share a single GPU. HAMi resolves this by virtualizing GPU resources (vGPU), enabling multiple containers within the same pod to share a single GPU concurrently.
Integrates natively with kube-scheduler with zero changes to application code
Fully compatible with the broader autoscaling ecosystem
Enables 2x fewer GPUs needed for training + inference pipelines
Proactive GPU Orchestration with KEDA
Standard metrics (CPU/RAM, DCGM utilization) proved unreliable for SNOW's heterogeneous workloads. KEDA's built-in RabbitMQ scaler functioned as a lagging indicator — given a ~60-second model warm-up time, scaling triggered after a queue backlog formed was consistently too late.
Custom Metric Server (Python/FastAPI) exposing Consumer Saturation metric to KEDA
Proactive scaling: when active_ratio exceeds threshold (e.g., 0.7), KEDA provisions new GPU pods before worker pool saturates
Smart scale-in: longer stabilizationWindowSeconds and cooldownPeriod prevent premature deallocation
Hybrid Cloud Bursting for Viral Spikes
When viral trends like the 'Ghibli Filter' tripled traffic within 3 hours, SNOW expanded dynamically into CSP regions using a unified GitOps pipeline that deployed identical Helm charts across all clusters.
Unified GitOps pipeline deploying identical Helm charts across all clusters
CSP worker nodes consumed tasks directly from central RabbitMQ queue
Seamless multi-cluster scaling with zero service interruption
Impact and Results
SNOW's cloud-native transformation demonstrates how combining Kubernetes with the broader CNCF ecosystem can overcome fundamental limitations in GPU scheduling and observability at extreme scale.
MTTR Improvement
91%
Reduced from ~2 hrs to ~10 min
Surge Error Reduction
85%
GPU surge-related user errors during peak traffic
Cost Savings
USD 17.4M
Estimated savings vs. on-demand cloud GPU provisioning
GPU Usage Time
-55%
Average GPU usage time reduced by proactive autoscaling
GPU Requirement
2× Fewer
GPUs needed for training + inference via HAMi sharing
Operational Savings
10.8 Man-month
Manual scaling eliminated, metric-driven automation adopted
“SNOW's cloud-native transformation demonstrates how combining Kubernetes with the broader CNCF ecosystem can overcome fundamental limitations in GPU scheduling and observability at extreme scale. By introducing HAMi for GPU sharing and augmenting KEDA with proactive, custom metrics, SNOW shifted from reactive, manual operations to predictive, automated orchestration.”
A Replicable Blueprint for Large-Scale GenAI
SNOW's approach highlights a replicable blueprint for running large-scale GenAI workloads — where intelligent resource utilization, autoscaling precision, and deep observability are critical to turning infrastructure challenges into competitive advantage.