This guide is for SREs and platform engineers. It walks through deploying HAMi Enterprise to a Kubernetes cluster, enabling GPU nodes, integrating monitoring, and verifying functionality.
⚠️ Installation ≠ Activation
After completing the Helm installation in this section, HAMi Enterprise core components will be running. However, GPU virtualization and scheduling features require a license activation to work properly.
The installation itself does not depend on a license, so you can complete the deployment first, then apply for and import the license in the subsequent steps.In short: Install the software first, then obtain the license; vGPU partitioning and scheduling will not work without activation, and verification will fail.
Architecture & Positioning
HAMi Enterprise is the enterprise edition of the open-source HAMi project. It contains:
- Enhanced Device Plugin: replaces NVIDIA's default device-plugin with vGPU partitioning + memory oversubscription
- Scheduler Extensions: GPU topology-aware scheduling
- Exporter: HAMi metrics for Prometheus
- Enterprise Hardening: signed images, CVE patching pipeline, long-term support
Best fit for: multi-tenant GPU sharing, memory oversubscription, heterogeneous accelerator (NVIDIA / Ascend / Hygon DCU / etc.) unified scheduling.
Prerequisites
| Type | Requirement | Verify |
|---|---|---|
| Kubernetes | >= 1.24 | kubectl version --short |
| Container Runtime | containerd or Docker | kubectl get nodes -o wide |
| Helm | >= 3.14 | helm version --short |
| GPU Driver | NVIDIA driver >= 470 (>= 550 recommended) | nvidia-smi |
| Prometheus | >= 2.37 (if integrating monitoring) | kubectl get pods -A | grep prom |
| GPU Operator | Installed AND devicePlugin.enabled = false (recommended version: v25.3.2) | helm list -A | grep gpu-operator |
Critical constraint: HAMi ships its own device-plugin and conflicts with the NVIDIA GPU Operator's built-in device-plugin. If GPU Operator is installed, you must disable its device-plugin via
--set devicePlugin.enabled=false.
Install HAMi Enterprise
Two installation paths — choose based on your scenario:
- Online OCI install (evaluation, PoC, clusters with external network access)
- All-in-One Air-gap Bundle (finance / government / telecom isolated networks)
Regardless of path, you must apply for and activate a license at the end.
Path A: Online OCI Chart Install
If you wish to use a Chinese domestic mirror registry, please contact Dynamia.ai sales/support for details.
We recommend using a version tracking system to maintain values files for all Helm releases in the cluster. Use -f example-values.yaml to override corresponding keys in the chart's default values.
After selecting the correct kubeconfig context, proceed:
If you haven't installed nvidia/gpu-operator yet, install it first.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set devicePlugin.enabled=false \
--set dcgmExporter.serviceMonitor.enabled=true \
--version=v25.3.2
If the cluster doesn't have a Prometheus monitoring stack, you'll also need to install one. Here's how to install prometheus-community/kube-prometheus-stack:
helm install prometheus \
oci://ghcr.io/prometheus-community/charts/kube-prometheus-stack \
--version 72.3.0 \
--namespace monitoring \
--create-namespace \
--set alertmanager.enabled=false \
--set grafana.enabled=false
Install dynamia-ai/hami-enterprise(hami-commercial):
helm install hami \
oci://ghcr.io/dynamia-ai/hami-commercial/hami \
--version 2.9.0-rc1 \
--namespace hami-system \
--create-namespace
hami-enterprise(hami-commercial) common chart customization options are listed below. For the complete values reference, see: HAMi Helm Values Reference.
| Parameter | Description | Default |
|---|---|---|
dra.enabled | Enable DRA | false |
scheduler.leaderElect | Enable leader election for hami-scheduler | true |
scheduler.replicas | Number of hami-scheduler replicas | 1 |
scheduler.kubeScheduler.image.registry | Registry for kube-scheduler image used by hami-scheduler | "registry.cn-hangzhou.aliyuncs.com" |
scheduler.kubeScheduler.image.repository | Repository for kube-scheduler image used by hami-scheduler | "google-containers/kube-scheduler" |
scheduler.kubeScheduler.image.tag | Tag for kube-scheduler image used by hami-scheduler. If empty, the chart will infer an appropriate version. | "" |
Path B: All-in-One Air-gap Bundle
Please contact Dynamia.ai sales/support to obtain the download URL.
Download hami-enterprise-v<VERSION>-airgap-<ARCH>.tar.gz and hami-enterprise-v<VERSION>-airgap-<ARCH>.tar.gz.sha256.
The hami-enterprise air-gap bundle includes dynamia-ai/hami-enterprise, nvidia/gpu-operator, and prometheus-community/kube-prometheus-stack. Install as needed.
# Download
curl -L -O <URL>
# Or: wget <URL>
# Extract outer tar.gz
# macOS
tar -xzf hami-enterprise-vX.Y.Z-airgap-amd64.tar.gz
# Linux (GNU tar)
tar -xaf hami-enterprise-vX.Y.Z-airgap-amd64.tar.gz
Verify integrity:
# Linux / macOS
shasum -a 256 -c hami-enterprise-vX.Y.Z-airgap-amd64.tar.gz.sha256
# Or manually compare
shasum -a 256 hami-enterprise-vX.Y.Z-airgap-amd64.tar.gz
cat hami-enterprise-vX.Y.Z-airgap-amd64.tar.gz.sha256
For the subsequent installation steps, refer to the extracted DEPLOY.md file.
Enable GPU Nodes
The HAMi device plugin only starts on nodes labeled gpu=on:
kubectl label nodes <node-name> gpu=on
Verify:
kubectl -n hami-system get podsshould showhami-device-plugin-*andhami-scheduler-*inRunningstate.
Monitoring Integration
Ensure Prometheus can scrape HAMi and DCGM-Exporter metrics.
The ServiceMonitor resource's
metadata.labelsmust match Prometheus'sspec.serviceMonitorSelector— otherwise Prometheus won't discover these monitors.
Verify Metrics Collection
| Exporter | Query | Expected |
|---|---|---|
dcgm-exporter | DCGM_FI_DEV_GPU_UTIL | non-empty value |
hami-exporter | HostCoreUtilization | non-empty value |
hami-device-plugin-exporter | GPUDeviceCoreAllocated | non-empty value |
License Activation
Please complete the above installation steps and ensure all component pods are running before proceeding with activation.
Run the following script to collect license information (requires kubectl, jq):
# Online install
curl -fsSL https://dynamia.ai/scripts/collect-hami-license-info.sh | bash
# Air-gap install (bundled in the package)
bash collect-hami-license-info.sh
After execution, you will see JSON output like:
{
"kube_system_uid": "bd8bce4f-f440-48e0-bf74-4ea2b6419c8b",
"collection_time": "2026-05-28T03:00:39Z",
"hami_install_location_namespace": "hami-system",
"total_licenses": 1,
"licenses": [
{
"uuid": "GPU-6762ec8e-2ce2-9ae4-df13-3e2e5cf17e53",
"reminder": 10,
"expire": "2026-06-21T10:04:41.468Z",
"node_name": "172.28.135.11"
}
]
}
Send this information to Dynamia.ai sales/support to obtain your license.
Post-Activation Verification
# 1. Pod status
kubectl -n hami-system get pods
# 2. Device Plugin GPU resources
kubectl describe node <gpu-node> | grep -A 5 'Capacity:'
# Expect: nvidia.com/gpu: <N> and nvidia.com/gpumem: <MB>
# 3. Submit a test pod to verify scheduling
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: hami-smoke
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:12.4.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 2000
EOF
kubectl logs hami-smoke
Expected: nvidia-smi shows GPU information with memory capped at 2000 MiB.
Troubleshooting
| Symptom | Likely Cause | Fix |
|---|---|---|
| Images fail to pull | Node has no external network or poor connectivity to ghcr.io | Contact Dynamia.ai sales/support for domestic mirror registry or the All-in-One air-gap bundle |
device-plugin pod Pending or missing | Node not labeled gpu=on | kubectl label nodes <node> gpu=on |
device-plugin pod CrashLoopBackOff | Conflict with NVIDIA's default device-plugin | Disable GPU Operator's devicePlugin (--set devicePlugin.enabled=false) |
| Prometheus missing HAMi metrics | serviceMonitorNamespaceSelector doesn't match ServiceMonitor label | Align prometheus/prometheus-kube-prometheus-prometheus .spec.serviceMonitorSelector with hami-enterprise serviceMonitor labels |
nvidia-smi errors | GPU driver not ready | Check driver pod status in gpu-operator namespace |
Get Support
- Email: info@dynamia.ai
- Sales / Support: 400-026-7800