From Device Plugin to DRA: GPU Scheduling Paradigm Shift and HAMi-DRA in Practice

KCD Beijing 2026 was one of the largest Kubernetes community events in recent years, with over 1,000 registrations, setting a new record for KCD Beijing.

The HAMi community was invited to deliver a technical talk and also maintained a booth, engaging deeply with developers and enterprise users from the cloud-native and AI infrastructure space.

The talk was delivered by two core HAMi community contributors:

  • Wang Jifei (Dynamia, HAMi Approver, lead HAMi-DRA contributor)
  • James Deng (4Paradigm, HAMi Reviewer)

The topic: From Device Plugin to DRA: GPU Scheduling Paradigm Shift and HAMi-DRA in Practice.

This article provides a more complete technical review combining the on-site presentation and slides. Slides available for download: GitHub - HAMi-DRA KCD Beijing 2026.

Event Recap

Main conference hall
Figure 1: Main conference hall

Attendee registration
Figure 2: Attendee registration

Attendees visiting the HAMi booth
Figure 3: Attendees visiting the HAMi booth

Volunteers stamping for attendees
Figure 4: Volunteers stamping for attendees

Wang Jifei presenting
Figure 5: Wang Jifei presenting

James Deng presenting
Figure 6: James Deng presenting

The GPU Scheduling Paradigm Is Shifting

The core of this talk goes beyond DRA itself — it addresses a larger transformation:

GPUs are evolving from "devices" into "resource objects."

Behind this shift is a fundamental change in how AI workloads consume GPUs. GPUs are no longer suited to simple whole-card exclusive allocation — they need to be shared, partitioned, scheduled, and governed.

The Ceiling of Device Plugin

The traditional Device Plugin model suffers from limited expressiveness:

  • Can only describe "quantity" (nvidia.com/gpu: 1)
  • Cannot express multi-dimensional resources (memory / cores / slices)
  • Cannot express multi-card combinations
  • Cannot express topology (NUMA / NVLink)

These limitations directly lead to:

  • Scheduling logic leakage (extender / sidecar)
  • Increased system complexity
  • Constrained scheduling concurrency

As AI workloads enter inference serving and multi-tenant mixed scenarios, these problems are rapidly magnified.

DRA: A Leap in Resource Modeling

DRA (Dynamic Resource Allocation) is a significant upgrade to the Kubernetes resource model, with core advantages including:

  • Multi-dimensional resource modeling — going beyond quantity to express fine-grained dimensions like memory and compute
  • Complete device lifecycle management — a full closed loop from resource discovery through allocation to reclamation
  • Fine-grained resource allocation — more flexible resource composition

The key structural change:

Resource requests move from embedded Pod fields to independent ResourceClaim objects.

This gives GPU resources the same "first-class citizen" status as Pods and PVCs, allowing the scheduler to manage GPU resources the same way it manages storage volumes.

The Reality: DRA Is Too Complex

DRA's capabilities are undeniable, but there's an often-overlooked practical issue: the user experience has clearly regressed.

Device Plugin syntax

resources:
  limits:
    nvidia.com/gpu: 1

DRA syntax

spec:
  devices:
    requests:
    - exactly:
        allocationMode: ExactCount
        capacity:
          requests:
            memory: 4194304k
            count: 1

Plus you need to write a CEL selector:

device.attributes["gpu.hami.io"].type == "hami-gpu"

The comparison is stark:

DRA is an upgrade in capability, but a clear downgrade in user experience.

For enterprises already using Device Plugin, the migration cost isn't just rewriting YAML — the entire team needs to learn an entirely new resource declaration paradigm.

HAMi-DRA's Key Breakthrough: Automated Migration

This was one of the most valuable parts of the talk.

HAMi's approach doesn't ask users to "use DRA directly." Instead, it takes a more pragmatic strategy:

Let users keep using Device Plugin syntax, and have the system automatically convert to DRA.

How It Works

Through a Mutating Webhook, HAMi-DRA automatically performs the conversion during Pod creation:

Input (user side, keeping Device Plugin syntax):

nvidia.com/gpu: 1
nvidia.com/gpumemory: 4000

Webhook auto-conversion:

  • Generates ResourceClaim objects
  • Constructs CEL selectors
  • Injects device constraints (UUID / GPU type)

Output (system internal):

  • Standard DRA resource objects
  • Scheduler-recognizable resource expressions

The core value of this design:

Transforming DRA from an "expert interface" into an interface ordinary users can work with.

Users don't need to understand new concepts like ResourceClaim or CEL selectors. They simply write nvidia.com/gpu as before, and the system handles the underlying complexity.

DRA Driver: More Than Just "Registering Resources"

The implementation complexity of a DRA Driver goes far beyond simply "registering resources with the scheduler." It assumes full device lifecycle management:

Three Core Interfaces

  • Publish Resources — publishing available resources to the scheduler
  • Prepare Resources — resource preparation before Pod creation (injecting libvgpu.so, configuring ld.so.preload, managing environment variables and temporary directories)
  • Unprepare Resources — resource reclamation after Pod deletion

This means:

GPU scheduling has entered the runtime orchestration layer — it's no longer just simple resource allocation.

From the user's perspective, the Pod creation timeline is extended — after the scheduler matches resources, the Driver still needs to complete device initialization, runtime injection, and a series of other operations before the Pod can run normally.

Performance Gains: More Than Just "More Elegant"

HAMi-DRA doesn't just offer a cleaner architecture — it delivers tangible performance improvements.

Pod Creation Time Comparison

  • HAMi (traditional mode): peak approximately 42,000
  • HAMi-DRA: significantly reduced (~30%+ improvement)

This improvement comes from DRA's resource pre-binding mechanism: resource allocation is determined during the scheduling phase, reducing scheduling conflicts and retries.

For large-scale AI clusters, Pod creation speed directly impacts task startup latency and cluster throughput. A 30%+ improvement has significant implications in production environments.

Observability Paradigm Shift

A subtle but important change lies in observability.

Traditional Model

  • Resource information comes from Node
  • Usage information comes from Pod
  • Requires aggregation and inference to build a complete resource view

DRA Model

  • ResourceSlice describes device inventory
  • ResourceClaim describes resource allocation
  • Resource perspective is first-class

This means:

Observability shifts from "inference" to "direct modeling."

Operations teams can directly see through ResourceClaims which GPU is occupied by whom, how much memory is allocated, and how much remains — without having to reverse-engineer this from Node status and Pod configurations.

Unified Modeling: The Future of Heterogeneous Devices

If device attributes can be standardized, a vendor-agnostic scheduling model becomes possible.

For example, through standardized attribute fields describing:

  • PCIe root complex
  • PCI bus ID
  • GPU core attributes

This points to a larger narrative:

DRA is the starting point for heterogeneous compute abstraction.

When accelerators from different vendors — Huawei Ascend, Cambricon, AMD, and others — all connect to Kubernetes through a unified attribute model, the scheduler can achieve truly cross-vendor resource management, without needing to maintain separate scheduling logic for each hardware vendor.

The Bigger Trend: Kubernetes Is Becoming the AI Control Plane

Connecting these changes reveals a clear trend:

  • From scheduling "machines" to scheduling "resource objects" — Node is no longer the minimum scheduling unit
  • From "device" to "virtual resource" — GPU is no longer a physical card, but a divisible, composable resource
  • From "imperative" to "declarative" — scheduling logic is replaced by resource declarations

Fundamentally:

Kubernetes is evolving into the control plane for AI infrastructure.

HAMi's Positioning

Within this trend, HAMi's positioning is becoming increasingly clear:

The GPU resource layer for Kubernetes.

  • Downward: adapting to heterogeneous GPUs (NVIDIA / Huawei Ascend / Cambricon, etc.)
  • Upward: supporting AI workloads (training / inference / Agent)
  • In between: scheduling + virtualization + resource abstraction

And HAMi-DRA is the key step that aligns this resource layer with Kubernetes' native model.

Closing

The real value of this KCD Beijing 2026 talk wasn't just introducing DRA — it was answering a more practical question:

How do you turn a "correct but hard to use" model into a system people can use today?

HAMi-DRA's answer:

  • Don't change user habits — keep using Device Plugin syntax
  • Absorb DRA capabilities — automatically convert to DRA resource model underneath
  • Handle complexity internally — Webhooks, Drivers, and lifecycle management are all handled by the system

This reflects the approach the HAMi community has always championed: advancing AI infrastructure through community collaboration, not closed systems. Contributors from different companies validate solutions in real production environments and share experiences through the community, benefiting more people.

If you're interested in HAMi-DRA or GPU scheduling, we invite you to join the HAMi community and help us advance AI compute resource management on Kubernetes.

Share this article