Arthur Rasmusson

Photo of Arthur Rasmusson

Hello! I'm Arthur Rasmusson, a specialist in AI, large training and inference clusters, GPUs, and IO virtualization. I've contributed code as well as libraries to open-source projects like NVIDIA TensorRT-LLM, and LMCache, which integrates with vLLM-based inference frameworks. I also started an open and closed source GPU virtualization community at Open-IOV.org where I have contributed open source code and documentation. In 2023, I joined Cohere's Model Efficiency team as a Machine Learning Engineer, where I worked on pioneering projects aimed at enhancing GPU cluster and AI inference software capabilities. In 2024, I joined Weka in their CTO Office as Principal AI Engineer, where I worked on improving efficiency in open source inference servers running on top of NeuralMesh by Weka.

I'm passionate about pushing the boundaries of GPU performance and AI inference. Feel free to contact me if you're interested in collaborating or learning more about my work.

Blog Posts

Getting Started With DGX Spark

Lessons Learned Scaling LLM Training and Inference with Direct Memory Access (DMA) Part 1

Open Source

NVIDIA TensorRT-LLM Pull/3209 "feature: KV Cache GPUDirect Storage" (merged)

Python-Native-libCuFile

A modified of my original Python-Native-libCuFile code is used in cufile-python which is a dependency for LMCache pull/699 which added the GPUDirect Storage backend to LMCache for the vLLM ecosystem.

Selected Work

CLSAC 2025 White Paper: The Future of Exabyte Scale Inference

CLSAC 2025 Slides: The Future of Exabyte Scale Inference


Abstract: Presented at the 2025 Chesapeake Large Scale Analytics Conference (CLSAC) under the theme “Grand Challenges Requiring Grand Solutions,” this talk walked through the lineage from early vector‑space and bag‑of‑words models (Salton et al., 1975; Salton & Buckley, 1988) to word embeddings (Mikolov et al., 2013), the Transformer (Vaswani et al., 2017), and the now‑standard practice of reusing decoder key/value tensors for fast autoregressive decoding (Ott et al., 2019 — fairseq; Shazeer, 2019 — MQA; Yan et al., 2021 — EL‑Attention). We then reviewed PagedAttention for serving (Kwon et al., 2023) and my work open source on PagedAttention over RDMA (PAoR) implemented in the TensorRT-LLM Batch Manager using Nvidia's public cuFile C++ API bindings (TensorRT‑LLM PR #3209; Archive) and in arthurrasmusson/Python-Native-libCuFile a packaged fork of which is used in LMCache for interfacing CUDA-Python with NVIDIA's C++ CuFile userspace, for KV Cache GPUDirect Storage. (LMCache GDS backend). LMCache release notes highlight the storage backend added in v0.3.1. We cover code level changes following TensorRT‑LLM source contributions, where NVIDIA added NIXL support for GDS in commit ecc0e687 (“feat: NIXL support for GDS”), which moved KV Cache GPUDirect Storage previously implemented directly in the inference server to NVIDIA's Inference Xfer Library (NIXL) in their AI-Dynamo GitHub organization. We also covered the architectural differences in inference servers now that of KV Cache for GPU Direct Storage has been moved out of the inference servers and into these source code files at src/plugins/cuda_gds and src/plugins/gds_mt.

I’m grateful to the organizers and attendees for the chance to share my work with the community. I’m happy to deliver expanded sessions (public or private) in the future.

TFEITool: IO Hierarchy and Binary Cache Analysis Tool, Created CLSAC Presentation — a standalone Python program that turns a real or synthetic inference run into a self‑contained, presentation‑ready PowerPoint deck. It explains how a text prompt flows through an OpenAI‑compatible inference server (e.g., TensorRT‑LLM), visualizes key/value (KV) cache pages and binary layouts, and renders timing diagrams for GPU kernels and I/O paths (including GPUDirect Storage and offload/restore events). The tool powered my CLSAC 2025 talk and helped me arrive with reproducible, data‑backed visuals.

GTC 2025 Talk – "A Blueprint for Supercharging LLM Inference with PagedAttention over RDMA" (presentation on accelerating large-model inference using RDMA networking).

Open-IOV Community Calls – Regular community call series on open GPU virtualization (collaborative discussions and knowledge sharing).

World Summit AI Talk – In-Person Session at World Summit AI USA 2025 on boosting LLM inference throughput and reducing GPU bottlenecks.

Key Open-IOV Technical Articles: