This website uses cookies. By using this site, you consent to the use of cookies. For more information, please take a look at our Privacy Policy.

NVIDIA BlueField-3 DPU Architecture and Roadmap

Dec 30, 2025      View: 1175

Modern hyperscale cloud technologies are driving data centers toward new architectural paradigms. A new class of processors—specifically designed for data center infrastructure software—is being adopted to offload and accelerate the massive computational workloads generated by virtualization, networking, storage, security, and other cloud-native AI services. This class of products is represented by the BlueField DPU family.

NVIDIA BlueField-3 DPU

As illustrated in NVIDIA’s BlueField DPU product roadmap, the lineup includes the already available second-generation BlueField-2, the upcoming BlueField-3 DPU delivering 400 Gb/s throughput, and the future BlueField-4 DPU, which will integrate NVIDIA GPU capabilities and scale up to 800 Gb/s.

NVIDIA DPU Development Roadmap

BlueField-3 is the first 400 Gb/s DPU purpose-built for AI and accelerated computing. It enables enterprises to achieve industry-leading performance and data center-grade security across applications of any scale. A single BlueField-3 DPU can deliver data center services equivalent to what would otherwise require up to 300 CPU cores, thereby freeing valuable CPU resources to run mission-critical business applications. Optimized for multi-tenant and cloud-native environments, BlueField-3 provides software-defined, hardware-accelerated networking, storage, security, and management services at the data center level.

The introduction of BlueField-3 addresses long-standing industry challenges related to end-to-end data security. BlueField-3 fully inherits the advanced capabilities of BlueField-2 and significantly enhances and extends them in terms of performance and scalability, as shown in the figure below.

BlueField Architecture Overview

At its core, the BlueField architecture tightly integrates a network interface subsystem with a programmable data path, dedicated hardware accelerator subsystems for functions such as encryption and compression, and an ARM-based processor subsystem for control and management. In BlueField-3, the Data Path Accelerator (DPA) includes 16 processing cores capable of handling up to 256 concurrent threads.

BlueField-3 DPU Overall Architecture

The key technical features of BlueField-3 are described below across networking, security, and storage workloads.

1. Networking Workloads

For networking workloads, BlueField-3 further enhances technologies such as RDMA, connection tracking, and ASAP². It also delivers improved time-synchronization accuracy, enabling precise clock synchronization between data centers and edge environments. Key technologies are analyzed below.

RDMA Technology

RDMA (Remote Direct Memory Access) enables direct data exchange between memory spaces, providing excellent scalability, higher performance, and significant CPU offloading. The main advantages of RDMA include:

1. Zero-copy Applications can perform data transfers directly without traversing the network software stack. Data is sent directly to application buffers or received directly from them, eliminating intermediate copies to network layers.

2. Kernel bypass Applications can transfer data entirely in user space, avoiding costly context switches between kernel and user modes.

3. No CPU involvement Applications can access the memory of a remote host without consuming any CPU resources on that host, enabling remote read and write operations transparently.

4. Message-based transactions Data is processed as discrete messages rather than streams, removing the need for applications to segment streams into individual messages or transactions. Messages up to 2 GB in size are supported.

5. Native scatter/gather support RDMA natively supports scatter/gather operations, allowing data to be read from multiple memory buffers and transmitted as a single message, or received as one message and written into multiple buffers.

GPU-Direct RDMA (GDR)

GPU-Direct RDMA (GDR) enables the GPU of one system to directly access the GPU memory of another system. Prior to GDR, data had to be copied from GPU memory to system memory before RDMA transmission, and then copied again from system memory to GPU memory on the destination system.

GDR significantly reduces data copy operations during GPU communication and further lowers latency. Mellanox network adapters already support GPUDirect RDMA for both InfiniBand and RoCE transports. Following NVIDIA’s acquisition of Mellanox, all NVIDIA network adapters now fully support GPU-Direct RDMA technology.

2. Security Workloads

In terms of security, BlueField-3 delivers full line-rate, on-the-fly encryption and decryption at 400 Gb/s across the IP, transport, and MAC layers. When using deep packet inspection with RegEx and DPI, throughput can reach up to 50 Gb/s. Key security features are described below.

IPSec Acceleration

BlueField-3 supports the IPSec protocol, providing encryption and decryption at the IP layer while maintaining network line-rate performance. IPSec (Internet Protocol Security) is a suite of open security standards developed by the IETF (Internet Engineering Task Force). Rather than a single protocol, IPSec comprises a set of protocols and services designed to secure IP communications, supporting both IPv4 and IPv6 networks.

IPSec primarily includes the AH (Authentication Header) and ESP (Encapsulating Security Payload) security protocols, the IKE (Internet Key Exchange) key management protocol, and various authentication and encryption algorithms. By combining encryption and authentication, IPSec provides comprehensive security services for IP packets.

BlueField-3 can achieve IPSec encryption and decryption speeds of up to 400 Gb/s. In comparison, CPU-based IPSec implementations paired with 100 Gb/s or 200 Gb/s networks typically deliver only 20–40 Gb/s and consume substantial CPU resources. Offloading IPSec to BlueField-3 releases this CPU capacity for application workloads.

TLS Acceleration

BlueField-3 also supports TLS at the TCP layer, securing data in transit. TLS is the encryption protocol used by HTTP to mitigate three major risks of plaintext communication:

Eavesdropping, where third parties can intercept communication content

Tampering, where communication data can be modified

Impersonation, where malicious actors can masquerade as legitimate parties

Accordingly, TLS is designed to ensure that:

All transmitted data is encrypted, preventing eavesdropping

Integrity checks detect any data modification immediately

Digital certificates prevent identity spoofing

TLS relies on public-key cryptography. Clients obtain the server’s public key and use it to encrypt data, which the server then decrypts using its private key. BlueField-3 can achieve TLS encryption and decryption speeds of up to 400 Gb/s, once again offloading significant computational overhead from the CPU.

3. Storage Workloads

In storage workloads, BlueField-3 enables capabilities that were previously difficult or impossible to achieve. It can emulate block storage, file storage, object storage, and NVMe storage, while offloading encryption and decryption operations—such as AES-XTS—during data persistence. Even cryptographic signing operations can be offloaded to the DPU.

Its Elastic Block Storage (EBS) performance can reach up to 18 million IOPS for read and write operations, while virtualization I/O acceleration can achieve up to 80 million packets per second (Mpps).

BlueField SNAP Technology

BlueField SNAP is a software-defined network-accelerated processing technology that allows users to access remote NVMe storage connected to servers as if it were local storage. It combines the efficiency and manageability of remote storage with the simplicity of local storage access.

The NVIDIA BlueField SNAP solution eliminates dependency on local storage and addresses growing cloud demands for storage disaggregation and composable storage architectures. SNAP integrates seamlessly into nearly any server environment, regardless of operating system or hypervisor, enabling faster adoption of NVMe over Fabrics (NVMe-oF) across data centers.

Delivered as part of the BlueField PCIe DPU SmartNIC portfolio, BlueField SNAP virtualizes physical storage so that network-attached flash storage behaves like local NVMe storage. Today, all major operating systems and hypervisors support local NVMe SSDs.

NVIDIA BlueField-3 DPU

By leveraging existing NVMe interfaces and combining them with the performance, manageability, and software transparency of local SSDs, BlueField SNAP delivers composability and flexibility for network flash storage. When combined with BlueField’s powerful multi-core ARM processors, virtual switching, and RDMA offload engines, SNAP supports a broad range of accelerated storage, software-defined networking, and application solutions. Together with ARM processing, SNAP can also accelerate distributed file systems, compression, deduplication, big data analytics, AI workloads, load balancing, and security applications.

4. Development Ecosystem

For the development ecosystem, NVIDIA provides the DOCA (Data Center on a Chip Architecture) software development kit, designed to enable and support the BlueField partner ecosystem. Through DOCA, developers can implement software-defined networking, storage, and security, and directly access BlueField’s hardware acceleration engines.

The NVIDIA DOCA SDK offers a complete and open development platform for building software-defined and hardware-accelerated networking, storage, security, and management applications on BlueField DPUs. DOCA includes runtime environments for creating, compiling, and optimizing applications; orchestration tools for configuring, upgrading, and monitoring thousands of DPUs across a data center; and an expanding set of libraries, APIs, and applications such as deep packet inspection and load balancing.

DOCA is a framework composed of libraries, memory management, and services built on a mature driver stack. Some libraries are derived from open-source projects, while others are proprietary to NVIDIA. Similar to how CUDA abstracts GPU programming, DOCA abstracts DPU programming at a higher level. NVIDIA delivers a complete solution by combining developer-focused DOCA SDKs with DOCA management software for out-of-the-box deployment.

For example, ASAP² is a hardware-based protocol that processes network data paths and is delivered in binary form. It enables network device emulation through Virt I/O and low-level APIs for configuring flow tracking and Regex accelerators. Security drivers provide in-kernel TLS offload, while SNAP drivers enable NVMe virtualization for storage workloads.

DOCA maintains backward compatibility across generations. NVIDIA’s vision is to establish the DPU as the third pillar of heterogeneous computing—complementing CPUs and GPUs—and DOCA is essential to realizing this vision across a wide range of applications.

The Role and Value of the DPU

DPUs extend the capabilities of SmartNICs by inheriting features such as CPU offload, programmability, task acceleration, and traffic management, while enabling unified programmable acceleration across both the control plane and data plane.

Traditionally, data center operations—including both compute workloads and infrastructure tasks—have relied heavily on CPUs. As data processing demands continue to grow, CPU performance has reached practical limits, and the slowing of Moore’s Law has become increasingly evident. GPUs emerged to address compute bottlenecks, but the data center bottleneck has now shifted toward infrastructure tasks such as data storage, data validation, and network security.

The DPU addresses this need by accelerating general-purpose infrastructure workloads. In a DPU-centric architecture, DPUs form a powerful infrastructure layer, while CPUs and GPUs focus on application compute. Key characteristics of a DPU include:

1. Industry-standard, high-performance, software-programmable multi-core CPUs, typically based on widely adopted ARM architectures and tightly integrated with other SoC components.

2. High-performance networking interfaces capable of parsing, processing, and efficiently delivering data to GPUs and CPUs at line rate.

3. Rich, flexible, programmable acceleration engines that offload and accelerate AI and machine learning, security, telecommunications, storage, and virtualization workloads.

Dimension

SmartNIC

DPU (Data Processing Unit)

Positioning

Improves server performance in cloud and private data centers by offloading networking and other workloads from the server CPU.

A data center-level computing processor that can exist as the smallest node in a data center.

Main Features

Frees up CPU overhead and is programmable; features task acceleration and traffic management.

Includes dual-plane offloading and acceleration for both data and control planes; covers all SmartNIC functions; features standard, high-performance, software-programmable multi-core CPUs and a rich set of flexible programmable acceleration engines.

Ecosystem

The ecosystem is complex with non-unified standards; development difficulty is high, and project portability is poor.

Possesses a standard ecosystem; some have dedicated software development platforms providing high-level standard development interfaces (such as NVIDIA's DOCA SDK), leading to low entry and development difficulty.

Application Scenarios

Accelerates specialized services such as storage, security, and data compression.

Data centers and cloud computing; network security; high-performance computing (HPC) and AI; communications and edge computing; data storage; streaming media, etc.

Value

Processes specialized services with relatively single functionality within the data center; passive and dependent on other devices.

Can function as a standalone, independent data center unit with rich, expandable functions; set to become a standard data center component and one of the three core pillars (CPU, GPU, DPU); active, capable of serving as a computing node, NIC, or acceleration engine, and can exist independently.

 

The core mission of the DPU is data pre-processing and post-processing. This includes networking tasks (such as ALL-to-ALL and point-to-point communication acceleration, IPSec, TCP connection tracking, and RDMA), storage tasks (distributed storage, encryption and decryption at rest, compression, redundancy algorithms), virtualization acceleration (OVS and hypervisor offload, separation of control and data planes), and hardware-based security (such as Root of Trust).

From a cloud computing perspective, the DPU effectively offloads the entire IaaS service stack into hardware acceleration.

SmartNICs typically fall into FPGA-based and ARM-based categories. FPGA-based SmartNICs struggle with control-plane processing, while ARM-based SmartNICs can become overloaded when handling diverse workloads. By providing dual-plane acceleration for both data and control planes, DPUs overcome these limitations.

Moreover, unlike traditional SmartNICs, DPUs can function as the smallest autonomous node in a data center, integrating compute, networking, acceleration engines, and security. As a result, DPUs are expected to become a standard component of future data centers and one of the three core pillars alongside CPUs and GPUs.

NVIDIA BlueField-3 DPU FAQs

1. What is NVIDIA BlueField-3 DPU used for?

The NVIDIA BlueField-3 DPU is used to offload, accelerate, and secure data center infrastructure workloads such as networking, storage, security, and virtualization. By handling these tasks in hardware, BlueField-3 frees CPU resources for application processing and improves overall data center performance and isolation.

2. How is BlueField-3 different from BlueField-2?

BlueField-3 significantly increases performance and scalability compared to BlueField-2. It supports up to 400 Gb/s throughput, offers enhanced Data Path Accelerators (DPA), improved RDMA and security offload, and delivers data center services equivalent to hundreds of CPU cores. BlueField-2 is limited to lower bandwidth and earlier acceleration capabilities.

3. What makes BlueField-3 suitable for AI and accelerated computing?

BlueField-3 is designed to support AI and accelerated workloads by providing high-bandwidth networking (400 Gb/s), RDMA and GPU-Direct RDMA support, and hardware-accelerated security. These features reduce latency, minimize data movement overhead, and ensure that GPUs and CPUs are dedicated to AI computation rather than infrastructure tasks.

4. Does BlueField-3 support hardware security acceleration?

Yes. BlueField-3 provides full line-rate hardware acceleration for security protocols such as IPSec and TLS at up to 400 Gb/s. It also supports deep packet inspection (DPI), RegEx acceleration, and root-of-trust capabilities, enabling strong isolation and zero-trust security models in multi-tenant cloud environments.

5. How does BlueField-3 improve storage performance?

BlueField-3 accelerates storage by offloading NVMe, NVMe-oF, block, file, and object storage operations to the DPU. With BlueField SNAP, remote NVMe storage can be accessed as if it were local, while encryption, compression, and virtualization tasks are handled in hardware, resulting in higher IOPS and lower CPU overhead.

6. What is NVIDIA DOCA, and how does it relate to BlueField-3?

NVIDIA DOCA is a software development framework that allows developers to build and deploy networking, storage, security, and management applications on BlueField DPUs. DOCA provides APIs, libraries, and tools to directly access BlueField-3’s hardware acceleration engines, simplifying DPU programming and enabling portable, future-proof infrastructure applications.

 

 

Previous: Google TPU Chip Ironwood Technology Explained

Next: NVIDIA Alpamayo: In-depth Analysis of Inference-centered AI Architecture for Autonomous Driving