UALink, Ultra Ethernet, and PCIe: Transforming Next-Generation HPC and AI Workloads
Main Article Content
Abstract
As HPC and AI workloads fundamentally transform data center architectures, the demand is growing for heterogeneous compute infrastructures with unprecedented bandwidth, low latency, and massive parallelism. Legacy interconnect technologies, such as InfiniBand and Ethernet, can no longer address the scale and diversity of the AI training and scientific simulation workloads deployed today. Complementary vendor-neutral, open standards include UALink, Ultra Ethernet, and next-generation PCIe with Compute Express Link (CXL), forming a set of hierarchical interconnect technologies that are optimized for future compute ecosystems. UALink provides multi-terabit throughput at sub-microsecond latency in vendor-neutral, flexible topologies with optimal GPU placement and contention-free configurations. Ultra Ethernet transforms Ethernet networking protocols with deterministic forwarding, advanced congestion control, and hardware-accelerated collective communication primitives to enable commodity Ethernet as a low-latency fabric suitable for scale-out AI use cases while retaining Ethernet protocol compatibility. PCIe evolution, through Compute Express Link (CXL), allows heterogeneous memory architectures with cache-coherent, memory-mapped memory, including Managed DRAM, ReRAM, and persistent non-volatile memory. Managed DRAM with hardware-managed tiering between classes of memory has demonstrated order of magnitude performance improvements compared to software-managed tiered memory systems. These interconnect technologies address green computing issues across the hardware life cycle and enable the topology-aware scheduling, data migration, and resource composition that will be important in future cloud platforms that handle heterogeneous workloads. The success of these convergence architectures will ultimately determine whether cloud-based data center architectures will cope with the computational demands of future AI, natural language processing, and scientific discovery workloads, such as climate modeling.