Distributed Inference for Fun and Profit

You ever just wonder how large models serve at scale? Or how to actually go from query to answer? Over the course of this article, we will take a look at approaches to inference and explore the tradeoffs of various approaches from a technical perspective.

We assume that the reader has basic knowledge of ML concepts and how Transformers work. Additionally, all of the work here is done on a single Nvidia RTX 3090 GPU with the respective drivers installed (nvidia-smi, nvidia-ctk, etc.). The purpose of this article was to learn about how to set up and run a local Kubernetes cluster with GPU support, splitting, and more.

Prerequisites