Arthur Rasmusson

Hello! I'm Arthur Rasmusson, a specialist in AI, large training and inference clusters, GPUs, and IO virtualization. I've contributed code to open-source projects like TensorRT-LLM, vLLM Caching Software. I also started an open and closed source GPU virtualization community at Open-IOV.org where I have contributed open source code and documentation. In 2023, I joined Cohere's Model Efficiency team as a Machine Learning Engineer, where I worked on pioneering projects aimed at enhancing GPU cluster and AI inference software capabilities. In 2024, I joined Weka in their CTO Office as Principal AI Engineer, where I worked on improving efficiency in open source inference servers running on top of NeuralMesh by Weka.

I'm passionate about pushing the boundaries of GPU performance and AI inference. Feel free to contact me if you're interested in collaborating or learning more about my work.

Blog Posts

Paged Attention over RDMA (PAoR)

Lessons Learned Scaling LLM Training and Inference with Direct Memory Access (DMA)

Open Source

NVIDIA TensorRT-LLM Pull/3209 "feature: KV Cache GPUDirect Storage" (merged)

Python-Native-libCuFile

A modified of my original Python-Native-libCuFile code is used in cufile-python which is a dependency for LMCache pull/699 which added the GPUDirect Storage backend to LMCache for the vLLM ecosystem.

Selected Work

GTC 2025 Talk – "A Blueprint for Supercharging LLM Inference with PagedAttention over RDMA" (presentation on accelerating large-model inference using RDMA networking).

Open-IOV Community Calls – Regular community call series on open GPU virtualization (collaborative discussions and knowledge sharing).

World Summit AI Talk – In-Person Session at World Summit AI USA 2025 on boosting LLM inference throughput and reducing GPU bottlenecks.

Key Open-IOV Technical Articles:

Virtual I/O Internals

GPU Driver Internals

OpenRM

GPU Firmware

LIME Is Mediated Emulation