What I Learned Scaling Transformer Models at Cohere

Arthur Hanson Rasmusson

👋 Introduction

Hello, I'm Arthur Rasmusson. I've contributed to open-source projects for GPU virtualization at LibVF.IO and Open-IOV.org. In 2023, I joined Cohere's model efficiency team. Over the past year, I dedicated myself to projects aimed at enhancing our GPU cluster for AI inference capabilities and large-scale machine learning tasks.

This is the story of my journey at Cohere, from initiating the GPUDirect Storage project to defining new standard bars for efficiency and power in GPU & AI infrastructure.

Pictured: I dawn a wide grin while standing at the big green #GTC24 sign outside the San Jose Convention Center on day 1 of the event (a pronounced feeling of eagerness and excitement fills every room and hallway at the conference; opportunities to solve old and new problems with AI that until recently seemed like science fiction now are within range of any programmer's imagination)

‍✈️ 50,000 Foot View

In the rapidly evolving field of artificial intelligence, the lines between foundation model training and efficient model serving are often perceived as distinct domains. However, these are two sides of the same coin, interconnected by the challenges and opportunities of resource utilization and scalability. This article explores how Direct Memory Access (DMA) technologies serve as a bridge for compute clusters and organizations, enabling seamless scaling of both training and inference workloads.

Industry leaders need to reassess the traditional divide between training teams and model efficiency/model serving teams (focused on inference). This separation can create barriers to problem-solving and lead to suboptimal resource allocation. While training teams are perpetually constrained by insufficient computational resources, inference teams often reserve more GPUs than necessary to ensure real-time responses to API requests, resulting in enormous numbers of GPUs sitting idle due to spiky traffic patterns.

By recognizing and addressing this imbalance, organizations can optimize their infrastructure, reduce costs, and enhance overall efficiency. Technologies like GPUDirect Storage, RDMA, and techniques like Just-In-Time-Inference (JITI) & Just-In-Time-Training (JITT) offer pathways to bridge this divide, leading to near-perfect utilization of resources.

⌛ The Issue With Multi-Cloud

When Cohere initially started, training infrastructure was built around a single cloud, leveraging cloud-native object storage for both training data and checkpoints. We used a Network File System (NFS) driver to download from the object storage. This setup worked efficiently within the same cloud environment.

However, as we expanded and adopted additional GPU clouds we encountered significant challenges. The need to download training data and checkpoints from the cloud-native object storage system to GPUs in other clouds introduced several issues:

Data Egress Costs: Every GPU node involved in a training or inferencing job separately downloading data from our object storage resulted in redundant egress charges from newly added GPU clouds and additional costs from the object storage provider for data retrieval.
Data Transfer Bottlenecks: The lack of data locality meant that transferring large amounts of data across clouds was inefficient and slow.
Rate Limiting: The object storage provider imposed rate limits when hundreds of nodes simultaneously attempted to download data, leading to delays and hindering our ability to scale quickly.

This situation highlighted the concept that "data has gravity". When substantial data resides within a particular cloud provider, it creates a gravitational pull that makes it costly and inefficient to move workloads elsewhere. The redundant egress and rate-limiting issues underscored the importance of data locality in large-scale AI operations.

🤔 DGX SuperPOD or Nvidia "Super Cluster"

Every cloud wants to cut corners with NVIDIA's DGX SuperPOD Reference Architecture, but sometimes there are consequences to that approach.

Nearly all cloud providers tend to overlook the DGX SuperPOD Reference Architecture because of its higher cost. However, this architecture is specifically designed for large clusters of GPUs and provides critical features as well as Quality of Service guarantees:

RDMA Paths for Both Storage and Compute: It ensures Remote Direct Memory Access (RDMA) paths for both storage fabric traffic (a distributed filesystem capable of utilizing the cuFile API) and NCCL traffic (RDMA paths for training data traffic in multi-GPU jobs).
Infiniband or Spectrum-X Network Fabric: The architecture includes a network fabric that minimizes latency and maximizes bandwidth, essential for high-performance training and inference.

Pictured: The Nvidia SuperPOD Reference Architecture Compute Fabric

Pictured: The Nvidia SuperPOD Reference Architecture Storage Fabric

Ignoring the DGX certified reference architecture as almost every cloud does can lead to suboptimal performance, network congestion, and difficulties scaling AI workloads efficiently. Organizations that attempt to cut costs by deviating from this architecture may face significant challenges as their operations grow.

💳 Equity for Compute is the Root Cause of Low Utilization

The current investment cycles in large language models (LLMs) have led to a phenomenon we can call "equity for compute," which is a byproduct of two main factors:

GPU Shortage: There's a global shortage of GPUs, making it difficult for companies to acquire the computational resources they need.
Large Clouds Investing in Language Models: Major cloud providers are investing in LLM companies, offering GPU capacity in exchange for equity stakes.

LLM companies often have to accept GPU availability wherever they can find it, leading them to take capacity from multiple cloud providers. This results in a multi-cloud hodgepodge of systems without thorough consideration of the long-term consequences.

🌧️ Multi-Cloud Complexities

Operating across multiple cloud providers introduces several complexities:

Data Locality Issues: As mentioned earlier, moving data between clouds incurs costs and latency.
Infrastructure Inconsistencies: Different providers may have varying hardware, network configurations, and support levels.
Operational Overhead: Managing and optimizing resources across multiple environments increases complexity for engineering teams.

🎩 The End of "GPU Patrons"?

As LLM unicorns mature, the "Who's got the GPUs—I'll pay you in equity" model will become unsustainable. Companies will transition to a mode where revenue needs to sustain operations rather than investor dollars. This shift means that:

GPU Capacity Procurement Changes: Companies will need to invest in centralized GPU infrastructure rather than relying on cloud providers offering capacity for equity or seek acquisition by those cloud providers.
Focus on Efficiency: There will be a greater emphasis on optimizing resource utilization to reduce costs.
Strategic Planning: Long-term infrastructure strategies will become crucial, prioritizing data locality and scalability.

LLM companies will need to rethink their infrastructure choices, consolidating resources to reduce data movement costs and improve performance.

🏦 Organizational Structure & GPU Allocation

Many organizations are structured with separate teams for "inference" and "training," but this division can lead to inefficiencies and barriers to problem solving. In a multi-cloud environment, GPU allocations are determined by scale requirements:

Inference [Scaling Down]: Inference workloads are thought to scale down because they operate on individual nodes with availability-based front end routing of prompts which does not require a high bandwidth interconnect to carry. LLM companies allocate GPU cloud instances without a high bandwidth interconnect to serving teams who are both managed separately (making their problem space separate) and who have their own infrastructure in a physically separated cloud.
Training [Scaling Up]: Training workloads are thought to scale up, requiring massive computational power for extended periods. Similarly they have their own infrastructure and management which isolates problem solving that could otherwise cut across teams.

With the ability to perform Just-In-Time-Inference (JITI) and Just-In-Time-Training (JITT) [which we will discuss in this article], along with an impending need to rely on revenue rather than investment rounds, organizations must reconsider this divide. The "rubber will need to meet the road" regarding revenues and costs for large enterprise LLM unicorns like Mistral, Cohere, Anthropic, xAI, and OpenAI.

🪙 Serving Efficiency & Training Efficiency are Two Sides of the Same Coin

Instead of viewing inference and training as separate entities, organizations should consider them as two sides of the same coin—a hybrid problem space. By allocating infrastructure exactly when and where it's needed, we can achieve:

Dynamic Resource Allocation: Training teams that are constantly constrained by insufficient GPUs can access resources freed up from inference during low-demand periods.
Optimized Utilization: Inference groups, which often have excess resources due to bursty API traffic, can scale down when demand is low, allowing those GPUs to be used for training.
Cost Savings: This approach reduces the need to over-provision resources for peak demand, saving millions in operational costs.

By breaking down the barriers between training and inference teams, we enable more flexible and efficient use of GPU clusters.

🚧 Identifying a Bottleneck

Early in my tenure at Cohere, I noticed significant inefficiencies in our GPU cluster related to training and inference workloads. The existing method for loading data into the GPU utilized a Network File System (NFS) driver that would mount cloud-native storage buckets from a popular virtualized GPU cloud provider and download the files. This process was layered on top of the slow POSIX API, requiring data to pass through the CPU before reaching the GPU. Specifically, the CPU would read data from storage into system memory and then use cudaMemcpy() to transfer the data from system memory to GPU memory.

This method led to several issues:

Long Data Loading Times: Any new GPU instance took approximately ten minutes to load the necessary data to auto-scale an inference job.
Underutilized Resources: GPU servers remained idle during data loading but still incurred operational costs.
Scaling Limitations: The lengthy data loading times hindered our ability to scale efficiently to meet burst traffic demands.
Increased Costs: Keeping many instances running continuously to handle potential spikes resulted in substantial expenses due to underutilized resources.

Recognizing these challenges, I conducted throughput tests with various methods of moving data to the GPU to quantify the impact, evaluating storage costs associated with bucket storage and identifying potential savings through alternative approaches. This analysis highlighted the need for a more efficient solution to improve performance and reduce costs.

✨ Proposing a Solution: GPUDirect Storage RDMA

Motivated to improve the system, I initiated the integration of GPUDirect Storage into our infrastructure—a project aimed at enhancing performance. GPUDirect Storage allows GPUs to access data directly from storage devices over Remote Direct Memory Access (RDMA) networks, bypassing the CPU and system memory, thereby reducing latency and improving throughput.

I developed a detailed technical plan which outlined how integrating GPUDirect Storage could reduce the time-to-first-token when initializing inference from ten minutes to twelve seconds—a 98% improvement. It would enable dynamic auto-scaling to handle API bursts efficiently and potentially reduce monthly GPU instance expenses by three-quarters. Additionally, by minimizing reliance on bucket storage, we could significantly decrease associated storage costs by keeping our source of truth in the high-performance GPUDirect Storage distributed filesystem, using cloud storage buckets only for disaster recovery of essential files.

After presenting the analysis, I obtained buy-in from senior leadership and began working on the implementation. This marked the beginning of a transformative project to enhance our AI infrastructure.

Pictured: Nvidia's open-kernel-modules, now bundled under CUDA toolkits 12.6 and newer introduces GPUDirect Storage (cuFile API) by removing proprietary kernel blobs (which now reside in the GPU System Processor's firmware instead) Nvidia has unlocked access to DMABUF in the Linux kernel - an API which permits only GPL modules to access it and which is foundational to this feature.

🔧 Implementing the GPUDirect Storage Project

To fully leverage GPUDirect Storage, I selected a high-performance GPUDirect Storage distributed filesystem, which pools together all the NVMe storage of nodes in a GPU cluster into a single, large storage pool. When a node joins the pool, it instantly gains access to all other files in the storage cluster at speeds of up to 132 GiB/second. I chose this solution because it was one of the few with commercial support that could seamlessly integrate with our existing cluster in a bare-metal cloud GPU system without requiring the cloud provider to purchase additional hardware.

♻️ Replacing the Old Method

By implementing the high-performance GPUDirect Storage distributed filesystem, we replaced both the NFS download step and the slow POSIX API layer. Instead of mounting cloud-native storage buckets via NFS and relying on the CPU to read and transfer data to the GPU, the RDMA capabilities allowed the GPUs to directly access data from the distributed NVMe storage pool. This eliminated the need for cudaMemcpy() operations and the associated CPU overhead.

🚀 Enhancing Inference Engine Support

To fully utilize GPUDirect Storage, I focused on modifying the inference engine software to support this technology. Specifically, I wrote patches for Cohere's TensorRT custom branch, the back-end library (libnvinfer) for NVIDIA's inference server (TensorRT-LLM & tritonserver - of which Cohere's version also needed to be modified), to enable GPUDirect Storage loading of tensor weights.

🧬 LLM Inference Engine File Structure

To better understand the deserialization process, it's helpful to visualize the structure of an LLM inference engine file. Below is a diagram illustrating the components of the engine file:

The engine file contains:

Tensor Types and Metadata: Definitions of tensor shapes, data types, and layer configurations.
Engine Configuration Parameters: Settings that dictate how the engine should execute on the hardware.
Serialized Weights and Biases: The numerical values that the model uses for inference.

In the traditional process of loading data into GPU memory, the engine's tensors are first transferred through the CPU and system RAM before being copied into the GPU memory. This involves a PCIe bounce buffer, where the data is read from storage into system RAM and then copied from RAM into GPU memory through the PCIe bus. The issue here is twofold: the PCIe bus introduces bandwidth limitations, and the involvement of the CPU and system memory creates additional overhead, resulting in slower data transfers. This method is not only time-consuming but also needlessly inefficient for large-scale inference tasks where large amounts of data (such as tensor weights) need to be transferred quickly. The use of cudaMemcpy() further compounds this bottleneck as it necessitates an additional memory copy step from RAM to GPU memory.

👷🏻‍♂️ Optimizing Tensor Deserialization Using cuFile API

To address this inefficiency, we made critical improvements to the deserialization pathway within open-source TensorRT. We modified TensorRT's deserialization logic to read only the necessary components for CPU execution—such as configuration parameters and metadata—into CPU memory. For the tensor data, which forms the bulk of the engine file, we utilized NVIDIA's cuFileRead() function from the cuFile API. This allowed us to bypass the CPU and system RAM entirely, reading data directly from NVMe storage into GPU memory.

By eliminating the need for the PCIe bounce buffer and cudaMemcpy(), we significantly reduced data loading times—from ten minutes down to twelve seconds. This optimization not only improved performance but also freed up CPU resources, leading to improved overall system responsiveness.

This functionality is included in recent nightly builds of TensorRT & TensorRT-LLM.

🧗🏿 Overcoming Technical Challenges

Implementing GPUDirect Storage required overcoming several technical challenges. Cloud infrastructure dependencies necessitated collaboration with engineers from our cloud provider to ensure that cuFile dependencies were met. This involved transitioning the training cluster from NVIDIA's proprietary kernel module to the new Open Kernel Modules (open NVIDIA driver), essential to unlock the Linux Kernel's DMABUF API, allowing the Mellanox OpenFabrics Enterprise Distribution (OFED) driver to work with nvidia-fs for GPUDirect Storage.

Because the filesystem was based on RDMA over Converged Ethernet (RoCEv2) using switches not fully compliant with NVIDIA's DGX SuperPOD reference architecture we needed to alter a number of settings within the distributed filesystem. We reconfigured the training cluster so that the ConnectX-7 NICs operated within the same subnet. Additionally, network configuration challenges arose as configurations utilized by the cloud provider's daemon needed to be intercepted to configure /etc/cufile.json on each node, since the RDMA NIC IP addresses would change from node to node and sometimes between reboots.

These challenges required close collaboration with multiple teams to ensure successful implementation.

🕰️ Limitations in Virtualized Clouds Running KVM

In contrast to the bare-metal cloud GPU system, virtualized GPU cloud providers face limitations when implementing GPUDirect Storage. Virtualized environments often utilize the free Kernel-based Virtual Machine (KVM) hypervisor, where GPUs are passed through into the virtual machine using Virtual Function I/O (VFIO). Network Interface Cards (NICs) are typically provided via software-mode NICs like gVNIC or VirtIO-NIC.

The use of Input-Output Memory Management Unit (IOMMU) and VFIO introduces challenges. IOMMU interference occurs because the IOMMU maps device-visible virtual addresses to physical addresses, interfering with the direct memory access required for GPUDirect Storage. VFIO limitations arise since it relies on the IOMMU for memory isolation, preventing the direct GPU-to-NIC communication necessary for GPUDirect Storage. Furthermore, there's a lack of RDMA support because GPUs and NICs are virtualized and passed through separately, making direct RDMA between them unfeasible in the current virtualization architecture.

Pictured: An illustration of how amd_iommu=on or intel_iommu=on halts RDMA operations to or from GPUs & DPUs/NICs - a configuration that is common to cloud platforms offering hypervisor-mediated access to their hardware and do not support bare metal GPU rentals.

As a result, virtualized GPU cloud providers currently cannot fully support GPUDirect Storage for distributed filesystems. While some are working on solutions like GPUDirect RDMA over TCP, these efforts primarily focus on supporting NCCL uses of GPUDirect RDMA, which differ from GPUDirect Storage.

🧩 Understanding VFIO and IOMMU Limitations

When devices are passed through to virtual machines using VFIO, the IOMMU plays a critical role in mapping device addresses to physical memory addresses, ensuring isolation between devices and VMs. However, this setup introduces limitations. Device isolation prevents direct memory access between GPUs and NICs required for GPUDirect Storage. Interrupt handling is affected because VFIO relies on techniques incompatible with direct GPU-to-NIC communication paths needed for GPUDirect Storage. Additionally, software emulation of NICs lacks the necessary hardware support for RDMA.

Pictured: One of two individuals who stoked my curiosity in the domain of firmware internals & GPU virtualization (Wendell from L1T, the self-appointed "G-Man" of the GPU VFIO community and I grabbed a bite to eat after GTC following a few years of chatting on Discord).

🛡️ Proposing a New IOMMU Pathway

To address these limitations, I propose exploring a new IOMMU pathway utilizing NVLink-C2C (Chip-to-Chip). NVLink-C2C allows high-speed, coherent communication between GPUs and other devices, such as NICs, even in virtualized environments.

By designing a secure interconnect that leverages NVLink-C2C, we could enable direct DPU-to-GPU RDMA in KVM virtualized contexts, allowing cloud providers that require virtualization to offer distributed filesystems with GPUDirect Storage capabilities without compromising security. Key considerations include security controls to enable or disable GPU-to-DPU or GPU-to-NIC RDMA as needed, preventing side-channel leaks for high-assurance workloads, and flexible configuration where administrators can enforce policies that route all I/O through the CPU and MMU controller when necessary.

Implementing such a pathway would require collaboration between hardware vendors, cloud providers, cloud users, and the open-source community to develop and adopt firmware enabling GPUDirect Storage in virtualized environments.

📈 At Scale, Attention Is Not All You Need. You Need QoS Too.

As AI models continue to grow, the challenges of training these models at scale become more pronounced. One critical area impacting large language model (LLM) training is the traversal of NCCL traffic over RDMA NICs and network switches.

🔎 LLM Compute and Communication Profiling

Often in large scale LLM compute environments (for training and finetuning) machine learning engineers will implement software that is suboptimal in terms of it's timing of cache data movement or instruction queueing; as a result workloads are always bound either by memory constraints, or by compute constraints. Consequently the most sophisticated GPU programmers write their software with the help of debugging tools like Nvidia Nsight Systems. Tools like this provide developers of training and finetuning software a view of fine-grain, cycle per cycle, memory-level measurements for how in-kernel operations are performing.

Pictured: Profiling of GPU Metrics in the Nv Nsight Systems performance debugging utility.

While Nsight Systems provides a great deal of value to the machine learning engineer (MLE) including monitoring of metrics like GPU Frequency, Memory Bandwidth Utilization, Memory Allocation Latency, GPU Compute Utilization, Time Spend Idling, NvLink/NvSwitch utilization, and much much more, Nsight has left MLEs with a degree of blindness on certainly one of if not the single most impactful metrics for diagnosing issues with compute clusters at scale which is Quality of Service in the layer which carries inter-GPU memory traffic out of the node to other nodes in the cluster (ie: Infiniband/RoCEv2 - non-nvlink).

Since the time of writing Nvidia has since incorporated a plugin system in Nvidia Sight systems which bridges this gap and now provides profiling of both the local DMA (Direct Memory Access) performance for memory movement between GPUs (XBAR/NvSwitch) and the external RDMA (Remote Direct Memory Access) path between nodes (Infiniband/RDMA-over-Converged-Ethernet) but only on switches with profiling plugins (as of writing NCCL profiling in the RDMA fabric is supported in Mellanox, now owned by Nvidia, Spectrum-X and Quantum name-brand switches).

Pictured: At GTC, Gilad Shainer covering Spectrum-X & Infiniband profiling (NCCL).

⚠️ Challenges with NCCL Traffic over RDMA Networks

NCCL facilitates communication between GPUs in distributed training environments, leveraging RDMA networks for high-throughput, low-latency data transfers. However, as training scales up, network infrastructure can become a bottleneck.

🚩 Issues with Non-DGX Reference Architecture Switches

During large-scale training sessions, certain network switches not compliant with NVIDIA's DGX SuperPOD Reference Architecture exhibited issues. Packet loss and congestion occurred as switches struggled with the high volume of RDMA traffic. This led to inconsistent performance, with training jobs experiencing inconsistent runtimes and failures due to communication timeouts.

In contrast, switches compliant with the DGX Reference Architecture did not exhibit these issues under similar loads, highlighting the importance of using appropriate infrastructure.

📌 Importance of Quality of Service (QoS)

The key differentiator was the implementation of Quality of Service (QoS) features for RDMA packet delivery. QoS for RDMA traffic ensures that RDMA packets are prioritized and delivered reliably, crucial for synchronization in distributed training. Switch buffering and scheduling are enhanced in advanced switches, offering better buffering and packet scheduling to handle bursty traffic patterns common in LLM training.

🎛️ Scaling Challenges at High GPU Counts

While network switches may perform adequately at smaller scales, they may fail at higher scales required for top-tier LLM training. Exponential traffic growth with increased GPUs leads to exponentially more inter-node communication, putting immense pressure on the network fabric. DGX reference architecture compliance (or very close to compliant, a concept which can be stretched and nearly always is) becomes critical to ensure components are optimized for high-scale operations.

I enjoyed the opportunity to speak with Jensen on GPUDirect & what's important when building a leading "AI Factory". During the exchange which lasted an hour or more Jensen told me "It's bad" - speaking to compute fabric designs in many Nvidia clusters which when (wholesale) ignoring DGX SuperPOD reference architecture can only run training software for a very limited period of time (if the job launches at all) before the machine learning engineer sees their program encounter a non-0 exit code at large GPU counts, often or always relating to NCCL which is out of their control and they cannot debug.

👨🏻‍🎓 Lessons Learned

Several important lessons emerge from these observations. Infrastructure matters; investing in network infrastructure compliant with proven reference architectures is essential. QoS is crucial; proper QoS settings for RDMA traffic significantly impact reliability and performance. Testing at scale is necessary; success at lower scales doesn't guarantee success at higher scales.

Understanding and addressing these challenges prepares organizations to handle the demands of large-scale LLM training, ensuring efficient and reliable operations.

🧵 What’s Next? PAoF: Paged Attention over Fabric

As AI models grow in size and complexity, efficiently serving them across multiple nodes becomes increasingly challenging. One area of focus is optimizing multi-node serving with Paged Attention over Fabric (PAoF).

📉 The Challenge of KV Cache in Multi-Node Systems

In single-node serving systems, a Key-Value (KV) cache improves efficiency by storing intermediate computations from the model's attention mechanisms. However, in multi-node systems, cache locality issues arise because load balancers distribute requests across nodes, making it difficult to preserve the KV cache tied to a specific session. Inefficient regeneration occurs as each node must regenerate the KV cache for every request, increasing computation time and latency.

👓 Paged Attention

The paged attention algorithm handles attention computation in fixed-size memory pages, reducing memory overhead and improving performance. By segmenting attention computations into pages, the algorithm limits the amount of data loaded into memory at any given time, optimizing resource utilization.

Pictured: A diagram depicting single node serving without KV cache locality issues.

To overcome multi-node challenges, we propose Paged Attention over Fabric (PAoF). By utilizing a shared, high-performance distributed filesystem (such as Weka, Vast, DDN Exascaler, Lustre, or NetApp), we can store the KV cache in a location accessible by all nodes in the cluster. Each KV cache is tagged with a unique conversation ID, allowing any node to retrieve the appropriate cache for a given session. This approach reduces latency by minimizing the need to regenerate the KV cache.

Integrating KV cache over GPUDirect Storage fabric offers several advantages. Reduced latency is achieved through direct memory access between GPUs and storage, allowing faster access to KV caches stored on the distributed filesystem. Improved scalability results from efficient cache sharing, enhancing multi-node performance. Enhanced throughput is possible due to decreased overhead in regenerating KV caches, leading to higher request processing capacity.

Pictured: A diagram depicting KV Cache locality issues in a multi-node LLM serving cluster.

Pictured: A diagram depicting Multi-Node Inference utilizing KV Cache over Fabric.

🧮 Performance Benefits

Without a shared KV cache, the total time per query is:

Equations illustrating significant time savings achieved by utilizing a distributed KV cache over high-speed storage fabric.

👨🏻‍💻 Implementation

To realize PAoF, the following steps are essential. Integrate with inference servers by modifying servers like vLLM or TensorRT-LLM to support shared KV caches. Develop an efficient tagging mechanism for tagging and accessing KV caches based on conversation IDs. Filesystem optimization is necessary to ensure the distributed filesystem is configured for low latency and high throughput, leveraging technologies like RDMA and GPUDirect Storage.

🎯 Impact

The integration of PAoF represents a significant step forward in AI infrastructure. As models and datasets continue to grow, the ability to efficiently scale and serve AI applications across multiple nodes becomes increasingly critical, especially as the number of large businesses serving LLM APIs grows. PAoF enables enhanced user experience by providing faster, more responsive AI services, and optimized resource utilization by making better use of existing hardware, reducing costs.

🧠 Rethinking Assumptions About InfiniBand

Traditionally, high-bandwidth interconnects like InfiniBand have been considered valuable primarily for training large models. However, there's significant value in leveraging InfiniBand and RDMA technologies for inference workloads as well.

💔 The Divide Between Inference and Training Teams

Companies are typically divided into two groups. Model efficiency/model serving teams focus on deploying and optimizing models for inference tasks and often receive nodes without high-bandwidth interconnects because such nodes are cheaper. Foundations/training teams concentrate on training large models and have access to nodes equipped with InfiniBand or RDMA-over-Ethernet to facilitate high-speed communication.

This divide creates barriers to problem-solving and can lead to suboptimal resource utilization. Foundations/training groups are constantly choked by having too few GPUs, while model efficiency/model serving teams have too many GPUs, leading to an imbalance and wasted resources. Hundreds or even thousands of GPUs can be sitting idle at any given time due to spiky traffic patterns on large LLM APIs. Industry leaders need to analyze this divide and consider it a byproduct of organizational structures that constrain problem-solving within their respective domains.

🔎 Reevaluating Infrastructure Allocation

By integrating high-bandwidth interconnects into inference clusters, companies can achieve several benefits. Just-In-Time-Inference (JITI) allows rapid auto-scaling of inference instances using high-bandwidth interconnects, enabling models to be loaded and served just in time to meet demand. Merged hardware resources combine the hardware used for inference/model serving and foundations/training, maximizing utilization and flexibility. This approach can result in cost savings, reducing idle resources and improving efficiency.

Foundations/training groups can benefit from access to additional GPUs when inference workloads are low, alleviating the constant shortage of computational resources. By breaking down the barriers between training and inference teams, organizations can ensure that GPUs are allocated where they are most needed, enhancing overall productivity.

💎 Save Millions on Foundations Training and Inference with Near-Perfect Utilization

To further bridge the gap between training and inference workloads, we can leverage technologies like CUDA Checkpoint and GPUDirect Storage to achieve near-perfect utilization of GPU resources.

🏎️ Just-In-Time-Training (JITT)

Building on the concept of JITI, Just-In-Time-Training (JITT) is the other side of the utilization coin. By allowing scheduled Kubernetes jobs for training to be paused when the cluster is fully saturated, resources can be reallocated for inference tasks during peak demand. Once the inference demand subsides, training can resume from where it left off.

🏁 Utilizing CUDA Checkpoint and GPUDirect Storage

CUDA Checkpoint is a utility that enables transparent checkpointing and restoring of CUDA state within a running Linux process. When combined with GPUDirect Storage, it allows for efficient saving and restoring of GPU memory to and from storage using the same cuFile APIs utilized for JITI.

By capturing all GPU memory and saving it to a file, training jobs can be paused and resumed without loss of progress. This flexibility allows for dynamic allocation of resources between training and inference workloads based on real-time demand.

💡 How CUDA Checkpoint Works

The cuda-checkpoint utility exposes checkpoint and restore functionality for CUDA through a command-line interface. It can toggle the state of CUDA within a process between suspended and running.

When a process is suspended:

CUDA driver APIs that launch work or manage resources are locked.
Already-submitted CUDA work is completed.
Device memory is copied to the host into allocations managed by the CUDA driver.
All CUDA GPU resources are released.

When the process is resumed:

GPUs are re-acquired by the process.
Device memory is copied back to the GPU, and memory mappings are restored.
CUDA objects like streams and contexts are restored.
CUDA driver APIs are unlocked.

🫂 Implementing JITT and JITI Together

By combining JITT and JITI, organizations can create a highly responsive and efficient GPU cluster. Training jobs can be paused and resumed as needed, while inference workloads can scale rapidly to meet demand without overprovisioning resources.

Key steps include:

Integrate CUDA Checkpoint into training workflows.
Leverage GPUDirect Storage for rapid loading and saving of models.
Implement Kubernetes Scheduling Policies to manage resource allocation dynamically.
Monitor Workload Demands to adjust training and inference priorities in real-time.

This holistic approach ensures that GPUs are utilized effectively, reducing idle time and maximizing return on investment.

🆕 Next Steps for Cluster Utilization

To capitalize on GPUDirect Storage, RDMA technologies, and CUDA Checkpoint, it's essential to update inference and training software. Users should run the latest builds of TensorRT and TensorRT-LLM, and integrate CUDA Checkpoint into their training pipelines.

Investing in high-bandwidth interconnects is crucial; equipping clusters with InfiniBand or RDMA-over-Ethernet allows leveraging these technologies. Optimizing infrastructure configurations ensures network settings are aligned for RDMA and GPUDirect Storage. Collaborating across teams by encouraging communication between training and inference groups helps share resources and best practices.

✨ Reflections and Future Challenges

By integrating GPUDirect Storage, leveraging RDMA technologies, and implementing CUDA Checkpoint, we achieved significant reductions in data loading times and improved system efficiency. We reduced waiting times by 98%, improved resource utilization, and enabled more dynamic scaling of both training and inference workloads.

However, challenges remain. Infrastructure complexity arises as managing and configuring RDMA networks and checkpointing mechanisms requires specialized knowledge and can introduce complexity. Software compatibility is an ongoing effort, as keeping software up-to-date and compatible with the latest technologies is essential. Scalability issues persist; as models and datasets continue to grow, further optimizations and infrastructure investments will be necessary.

Solving these challenges will require continued innovation and collaboration across hardware and software domains. By breaking down organizational silos and viewing training and inference as two sides of the same coin, we can unlock new efficiencies and drive the field forward.

❤️ A Note of Appreciation

I want to express my sincere appreciation for the incredible innovations happening at Cohere. The team's dedication to pushing the boundaries of language AI is truly inspiring. I'm grateful for the opportunity I had to contribute and collaborate with such talented individuals, and I look forward to seeing their continued success in groundbreaking work.

⛳ Final Thoughts

As I move forward, I'm excited about the opportunities to further explore cutting-edge technologies and make meaningful contributions to the field. I look forward to applying my skills and passion for innovation to help drive advancements in GPU-accelerated computing, high-performance storage solutions, and scalable AI infrastructure.

Please reach out to me if you'd like to work with a specialist in AI, large training and inference clusters, GPUs, and IO virtualization.

You can reach me via email at arthur@vgpu.io