Hello! I'm Arthur Rasmusson, a specialist in AI, large training and inference clusters, GPUs, and IO virtualization. I've contributed code as well as libraries to open-source projects like NVIDIA TensorRT-LLM, and LMCache, which integrates with vLLM-based inference frameworks. I also started an open and closed source GPU virtualization community at Open-IOV.org where I have contributed open source code and documentation. In 2023, I joined Cohere's Model Efficiency team as a Machine Learning Engineer, where I worked on pioneering projects aimed at enhancing GPU cluster and AI inference software capabilities. In 2024, I joined Weka in their CTO Office as Principal AI Engineer, where I worked on improving efficiency in open source inference servers running on top of NeuralMesh by Weka.
I'm passionate about pushing the boundaries of GPU performance and AI inference. Feel free to contact me if you're interested in collaborating or learning more about my work.
Getting Started With DGX Spark
Lessons Learned Scaling LLM Training and Inference with Direct Memory Access (DMA) Part 1
NVIDIA TensorRT-LLM Pull/3209 "feature: KV Cache GPUDirect Storage" (merged)
A modified of my original Python-Native-libCuFile code is used in cufile-python which is a dependency for LMCache pull/699 which added the GPUDirect Storage backend to LMCache for the vLLM ecosystem.
CLSAC 2025 White Paper: The Future of Exabyte Scale Inference
CLSAC 2025 Slides: The Future of Exabyte Scale Inference
src/plugins/cuda_gds and
src/plugins/gds_mt.
I’m grateful to the organizers and attendees for the chance to share my work with the community. I’m happy to deliver expanded sessions (public or private) in the future.
TFEITool: IO Hierarchy and Binary Cache Analysis Tool, Created CLSAC Presentation — a standalone Python program that turns a real or synthetic inference run into a self‑contained, presentation‑ready PowerPoint deck. It explains how a text prompt flows through an OpenAI‑compatible inference server (e.g., TensorRT‑LLM), visualizes key/value (KV) cache pages and binary layouts, and renders timing diagrams for GPU kernels and I/O paths (including GPUDirect Storage and offload/restore events). The tool powered my CLSAC 2025 talk and helped me arrive with reproducible, data‑backed visuals.
GTC 2025 Talk – "A Blueprint for Supercharging LLM Inference with PagedAttention over RDMA" (presentation on accelerating large-model inference using RDMA networking).
Open-IOV Community Calls – Regular community call series on open GPU virtualization (collaborative discussions and knowledge sharing).
World Summit AI Talk – In-Person Session at World Summit AI USA 2025 on boosting LLM inference throughput and reducing GPU bottlenecks.
GPU Driver Internals – GPU driver internals for virtualization.
OpenRM – Analysis of NVIDIA’s open-sourced GPU Resource Manager API and RM Core.
GPU Firmware – Documentation of GPU embedded firmware and virtualization support.
LIME Is Mediated Emulation – LibVF.IO feature enabling Windows applications via GPU virtualization.