Fallstudie | SNOW Corp Scales GenAI for 200M Users with HAMi GPU Sharing and KEDA Autoscaling
Discover how SNOW Corp orchestrates 1,000+ GPUs to handle 700% viral traffic spikes, achieving 91% MTTR reduction, 85% fewer surge errors, and USD 17.4M in estimated cost savings using HAMi and KEDA on Kubernetes.
Unternehmensuebersicht
SNOW Corp., a subsidiary of NAVER from South Korea, operates a fleet of 1,000+ A100 GPUs serving GenAI features for 200M+ global users across three top-ranked apps (SNOW, EPIK, B612). The infrastructure serves 1,200+ AI workflows and 400+ models, handling extreme traffic volatility from viral AI trends.
Subsidiary of NAVER, South Korea
Three top-ranked GenAI apps: SNOW, EPIK, B612
1,200+ AI workflows and 400+ models in production
Multi-region on-premise Kubernetes platform
SNOW Corp.
NAVER subsidiary serving 200M+ GenAI users globally
Challenge: GPU Scheduling at Extreme Scale
Kubernetes' native GPU scheduling treats GPUs as atomic resources — a pod either gets a full GPU or nothing. This model broke down under SNOW's heterogeneous workload demands and unpredictable viral traffic spikes.
Heterogeneous Workloads
Training, inference, and batch processing have vastly different GPU utilization profiles
Traffic Unpredictability
700% viral traffic spikes with no GPU-level observability
Scheduling Blindness
2-3 containers competing for GPU resources with no coordination
Cost Explosion
~2x over-provisioning to handle peak loads
Solution: CNCF-Powered GPU Orchestration Stack
SNOW migrated to a multi-region on-premise Kubernetes platform with the CNCF ecosystem underpinning the entire stack: Cilium for CNI, Helm for GitOps-based deployment, Traefik for ingress, Prometheus/Loki/Grafana for observability, HAMi for GPU sharing, and KEDA for autoscaling.
GPU Sharing with HAMi
Kubernetes' default scheduler enforces strict GPU isolation, which blocked migration of SNOW's sequential Train-to-Inference pipelines — where a trainer and inference engine must share a single GPU. HAMi resolves this by virtualizing GPU resources (vGPU), enabling multiple containers within the same pod to share a single GPU concurrently.
Native kube-scheduler integration
Autoscaling ecosystem compatible
2x fewer GPUs for train+inference
Proactive GPU Orchestration with KEDA
Standard metrics (CPU/RAM, DCGM utilization) proved unreliable for SNOW's heterogeneous workloads. KEDA's built-in RabbitMQ scaler functioned as a lagging indicator — given a ~60-second model warm-up time, scaling triggered after a queue backlog formed was consistently too late.
Custom Metric Server for KEDA
Proactive scaling before saturation
Smart scale-in with cooldown
Hybrid Cloud Bursting for Viral Spikes
When viral trends like the 'Ghibli Filter' tripled traffic within 3 hours, SNOW expanded dynamically into CSP regions using a unified GitOps pipeline that deployed identical Helm charts across all clusters.
Unified GitOps pipeline
CSP nodes consume from central queue
Zero service interruption
Impact and Results
SNOW's cloud-native transformation demonstrates how combining Kubernetes with the broader CNCF ecosystem can overcome fundamental limitations in GPU scheduling and observability at extreme scale.
MTTR Improvement
91%
Reduced from ~2 hrs to ~10 min
Surge Error Reduction
85%
During peak traffic
Cost Savings
USD 17.4M
vs. on-demand cloud GPU
“SNOW's cloud-native transformation demonstrates how combining Kubernetes with the broader CNCF ecosystem can overcome fundamental limitations in GPU scheduling and observability at extreme scale. By introducing HAMi for GPU sharing and augmenting KEDA with proactive, custom metrics, SNOW shifted from reactive, manual operations to predictive, automated orchestration.”
A Replicable Blueprint for Large-Scale GenAI
SNOW's approach highlights a replicable blueprint for running large-scale GenAI workloads — where intelligent resource utilization, autoscaling precision, and deep observability are critical to turning infrastructure challenges into competitive advantage.