A Dive Into GPU Math

You ever wonder what goes on when you ask ChatGPT a question and how that is served? Or what people mean when by using a A100 to train a model and the time it takes? Or even considering the levels of abstraction between the model and the hardware? This article will aim to bring light to many of the concepts related to GPUs and the math behind them.

We assume that the reader has a basic understanding of how recent LLM technologies work.

Common Data Types

To begin, we first look at some common data types used in training workloads. Tensors make up the backbone of all calculations, storing everything from weights, gradients, and other data. These tensors are composed of floating point numbers, a method of approximating real numbers with exponents and significant digits. Over time, there has been many developments made in floating point numbers which we will explore and set as the foundation for further calculations.

Type	Description
float32	Float32 is one of the most default data types inside of scientific computing. This type uses 32 bits to represent a float where one bit is used as the sign (positive or negative), eight bits for the exponent, and 23 bits for the fraction. At first glance, this seems great since we get better accuracy which is carried through during inference. However, this increased precision means increased size to represent such figures, potentially proving unwieldy for larger models.
float16	Similarly to float32, float16 follows many of the same concepts. One bit is used for the sign, 5 bits for the exponent, and 10 bits for the fraction. Although we cut the size in half, float16 brings it’s own challenges, namely vanishing gradients. Because of the reduced instability, if gradients go toward zero, the model can actually stop learning entirely.
fp8	The fp8 workload was introduced by the Nvidia H100 GPU. This cuts down even further than the previous examples. We will save explanations of fp8 to the official guide provided by Nvidia

We now have encountered 2 problems:

If we use higher precision, we get better results but the computations utilize more compute and memory.
If we use lower precision, we get worse results but the computations utilize less compute and less memory.

In order to address this, we take the best of both worlds and introduce mixed-precision training. Utilize types such as float32 for operations requiring more accuracy (parameters, gradients) and types like float16 or fp8 for operations such as forward passes. We do not cover an indepth analysis of mixed-precision training and merely talk about reader awareness. We now start moving towards actually operating them on a GPU.

Size of Data Types

Data Type	Bytes per Parameter
FP32	4 bytes
FP16	2 bytes
BF16	2 bytes
INT8	1 byte
INT4	0.5 bytes

What is a FLOP?

Before we get started, it is good to briefly cover the term FLOP. FLOP stands for a single unit of computation such as addition or multiplication (or a FLoating Point Operation). The higher number of FLOPs the underlying hardware can support will result in faster inference and workloads.

FLOP/s means how fast the hardware performs. This is different from FLOPs which is the unit of calculation performed.

GPU This, GPU That

Now this is where the fun begins. For the rest of the article, we will be walking through choosing a model and hardware, optimizations, and any related concepts which lend themselves to this topic. Let’s say we want to run the 7B model from the Llama 3 Family and we have a single A100 GPU. While the spec sheet has many cool facts and figures, we bring some of the most relevant ones to this article:

Specification	A100
FP64	9.7 TFLOPS
FP64 Tensor Core	19.5 TFLOPS
FP32	19.5 TFLOPS
Tensor Float 32 (TF32)	156 TFLOPS \| 312 TFLOPS*
BFLOAT16 Tensor Core	312 TFLOPS \| 624 TFLOPS*
FP16 Tensor Core	312 TFLOPS \| 624 TFLOPS*
INT8 Tensor Core	624 TOPS \| 1248 TOPS*
GPU Memory	40GB HBM2 / 80GB HBM2e
GPU Memory Bandwidth	1,555GB/s - 2,039GB/s

We care about 3 facts in this table for the purpose of the article: GPU Memory, GPU Memory and Bandwidth, and FP16 Tensor Core. These terms specifically mean:

GPU Memory is the amount of total memory on the chip itself. This comes in handy when doing quick math on the size of the model and supported chips.
GPU Memory Bandwidth is how fast we can move information from memory to the processing chips.
FP16 Tensor Core is the speed and bandwidth available for models at float16 precision.

Calculating Memory Requirements

In this section now, we walk through the option of choosing a model, and GPU and performing a set of calculations that allow us to determine where to optimize at. Assume that we are running Llama3 7B on an A100 Nvidia GPU.

Model Memory Requirements

An approximation for memory usage is by multiplying the amount of parameters in billions by 2. So in this case, running Llama3 7B model will need to use 14GB of GPU memory to fit the whole model while leaving 16 GB for swap space. However, this assumes that if is only at float16 precision. A more general rule is to multiply by the number of bytes instead. So in this case:

float32: 7B * 4 bytes = 28 GB
float16: 7B * 2 bytes = 14 GB
int8: 7B * 1 bytes = 7 GB
int4: 7B * 0.5 bytes = 3.5 GB

However, each step down will result in less accuracy and performance.

KV Cache Memory Requirements

The key-value (KV) vectors are used to calculate attention scores. For autoregressive models, KV scores are calculated every time because the model predicts one token at a time. A KV cache stores these calculations so they can be reused without recomputing them. This form of efficient computing is necessary as it optimizes the calculations and decreases the time to first token. For a deeper understanding of the KV cache, see this article here.

Knowing how KV cache works, we are able to develop a rough estimation on the amount of memory used inside of the KV cache:

Parameter	Explanation
b	Batch size (No of sequences processed simultaneously). Essential for efficient compute and memory utilization, throughput, latency and time to first token
2	For both K & V caches - represents the two separate caches needed (one for Keys, one for Values) in the attention mechanism
n_layers	No of layers in the neural network, KV cache is per layer - each transformer layer maintains its own KV cache
n_heads	No of attention heads per layer - multi-head attention requires separate KV storage for each head
d_head	Dimension of each attention head - the size of the key/value vectors for each attention head (typically d_model/n_heads)
t_seq_len	Total sequence length (No of input and output tokens) - determines how many token positions need to be cached
p_a	No of bytes per parameter - typically 2 bytes for fp16/bf16 or 4 bytes for fp32, determines the memory footprint per value

Now, with knowing the parameters used to estimate the size of a KV cache, we can have a rough formula:

kv_cache_size = b * (2 * n_layers * n_heads * d_head * t_seq_len * p_a)

Thus, we can estimate some sizes of KV caches of common models. Going back to the example of Llama 3-8B, we can pull out the relevant facts from the model card and papers:

Factor	Llama 3-8B	Description
n_layers	32	Number of transformer blocks (each containing attention and feed-forward layers).
n_heads	32	Number of attention heads per layer.
d_head	128 (4096 / 32)	Hidden dimension per head (total embedding dimension divided by heads).
t_seq_len	8192 tokens	Training and inference context length (input + output).
p_a (bytes per parameter)	Typically 2 bytes (bfloat16) for precision	Byte size per parameter used for KV and weight memory calculation.
Batch size (B)	Tunable depending on GPU memory; optimized via mixed precision and FlashAttention 2	Number of token sequences processed simultaneously.

So for example, let’s say we want to have a batch size of 1 and with fp16 accuracy, then we can estimate the size of the KV cache through the following approximation. Remember that batch size denotes how many multiple sequences are processed simultaneously. A batch size of 1 means low latency for the user as queries are processed sequentially.

kv_cache_size = 1 * (2 * 32 * 32 * 128 * 8192 * 2) = 4,294,967,296 bytes

A Cheat Code with Memory/Token

Maybe sometimes, you do not have the time to calculate all of these metrics or these facts are not available to you. You can take an easy way by using memory per token to estimate the total memory usage. For this case, we can estimate that 1MB of memory per token. This gives us a formula of:

memory = weights + no. of tokens * 1 MB

where no. of tokens is equal to the batch size * sequence length. So let us say that we are using Llama 3-8B with a batch size of 1 and sequence length of 8192, this gives us:

memory = 8 GB * 2 + 8192 * 1 MB = 24.2 GB

We can also calculate something called time per token to see how long it takes to generate each individual token.

Why do we care about this?

Knowing the limitations of hardware and the interactions with software become important when designing products utilizing LLMs (beyond plug-and-play APIs). For example, introducing latency in a consumer application may reduce retention of customers due to the expectations of near-instant results. Another scenario is scaling model serving to multiple consumers of your service, how do you properly batch and account for that? There are three main metrics that we will cover:

Metric	Explanation
Latency	The total time from when a user sends a request until they receive the complete response. For LLMs, this includes both the time to process the input prompt and generate all output. A lower latency means better UX.
Time To First Token (TTFT)	While we do not dive deep into the mechanics of this, TTFT measures the time between sending the request (user hits enter) to receiving the very first token of a streamed response. This makes UX seem good as the system will feel responsive even if the entire latency may be longer.
Throughput	The number of tokens a system can process at any given time. This can be tokens/second or queries/second. A higher throughput means we can let more users utilize our models with the same hardware.

Before we move on to calculating these values, a really quick side bar into what prefill stage time is. As you enter a query into a model, the prefill stage is the time it takes the model to understand the entire request before generating a response. We can reason that decreasing the prefill stage will help decrease latency and increase the responsiveness.

The following metrics are calculated by:

Latency = prefill time + (# tokens * time/token)
TTFT = prefill time + first token time 
Throughout = batch size / time per token

You may have noticed that we have mentioned the term time/token multiple times but have not yet provided a method of calculating it. For a GPU, there are two potential bottlenecks, being memory bound and compute bound. In reality, the memory is the largest bottleneck. To simplify our calculations, we can calculate time/token as the following:

time per token = (model size in GB * 2) / memory bandwidth

On the engineering side, we want to measure a key metric which is the utilization of our hardware. If our hardware is not being effectively utilized, then we are missing out on potential upside and scale. If we are overutilizing our hardware, we may encounter degradation of the three metrics encountered above.

In order to calculate utilization (also known as operations per byte - due to the nature of us calculating the number of peak flops per byte of memory moved), we use a simple calculation:

util = peak type flop per second / memory bandwidth (b) in seconds

So for a Nvidia A100 GPU, we have:

312 TFLOPS at peak for FP16 Tensor Cores
40GB of HBM2 memory around 1,555 GB / seconds

which gives us

312 / 1.55 = 201 FLOPs

denoting that for every byte of memory, we need to perform 201 FLOP in order to reach peak utilization.

What happens if we are performing under or over that number? That is how we can identify bottlenecks for the model running on hardware. Namely, there are two types of bottlenecks:

Compute Bound: Being compute bound means that if we want to perform more than 201 FLOPs, and the hardware limits us to that, then we are compute bound. We are not restrained by memory or bandwidth, but instead, we are restrained by the number of tensor cores our chip possesses.
Memory Bound: Being memory bound means tht we are completing less than 201 FLOPs and that we are constrained by the rate in which we can transfer data between chips.

Case Study: Llama 3.1-27B

Let’s bring all of these concepts together to study how we would design and run our own cluster given certain user constraints. We won’t be doing an extensive plan but more so using the topics to illustrate how to get ballpark values.

Assume that we are designing a knowledge base app which will respond to user queries given some text information on an internal system. We expect to have a population of users but will not exceed more than 50 queries per second. Each query will have around 500-600 words of context in addition to a 1000 word prompt. Our model of choice is using Llama3.1-27B and we have flexibility in what hardware to run our service on. Additionally, each output generated is around 400 tokens too.

Also, the IT people left a few A100s sitting out for us which gives us some hardware to get started!

A rough conversion is dividing the number of words by 0.75 to get the token count. Our conditions are listed in the table below:

Metric	Value
Model	Llama3.1-27B
Input Tokens	~2,100 tokens
Output Tokens	~500 tokens
Total Sequence Length	~2,600 tokens

From these metrics, we first know that we can approximate the amount of memory the model weights take up at float16 precision. For a 27B parameter model, that will be around 27B * 2 bytes = 54 GB. At int8, we only need 27GB of memory (due to int8 taking up 1 byte) but that comes with an accuracy compromise.

The next thing we will want to consider is the various throughput metrics that will help us understand how much time a user will have to wait until we are able to serve a request and how to structure our hardware. First, let us calculate the TTFT:

Prefill = 27B * 2 * 2,100 tokens = 113 GFLOPs  
On A100 (624 TFLOPS INT8 Quant.) = 113 / 624,000 = 0.19s 
Weights memory transfer = 27GB / 1935 GB/s = 0.013s 
TTFT = 0.2s

Now, we can calculate how long it will take us to generate each token:

Computation for weights: 27B * 2 = 54 GFLOPs 
Time = 54 / 624,000 = 0.08ms 
Memory Time = 27 / 2,039 = 13ms 
Time/token = 13ms/token

Thus, we can see that the total request time will be:

200ms + 500 * 13ms = 6.7s per request.

Now, we can calculate the KV Cache size for each request: We know that the Llama3.1 architecture has some of the following characteristics:

Metric	Value
n_layers	40
n_heads	40
d_head	128
Max Sequence	2,600 tokens
Precision	2 Bytes

We want to use FP16 for our KV Cache in order to preserve accuracy and potentially offset the quantization of model weights to INT8. Thus, using the following formula above, we can see that for each request, the KV cache of batch size 1 will take up:

kv cache size = 1 * (2 * 40 * 40 * 128 * 2,600 * 2) = 2.1 GB per request

However, this is horribly inefficient! Imagine being able to only serve one request at a time, in sequential order. Your users will be waiting for ages in order to get their results back. This is where batching comes in. Lets assume that we have 50 queries per second generating 500 tokens. Knowing that it takes 20-30ms to generate a token on an A100, we can see that that for 500 tokens, a request will be around 10 seconds (taking the lower bound) and we will need 500 * 10 = 500 concurrent requests.

With 500 concurrent requests at a batch size of 1, we will need around 1TB of memory which is very cost efficient. This is where batching comes in. Batching will allow us to put in a group of forward passes on modern GPUs and calculate them all in parallel.For our case, as we calculated that each request takes around 7 seconds, at 50 rps, we get 350 concurrent requests. That still doesn’t get us anywhere.

To illustrate a batch size, let’s take a batch size = 16. We can calculate that the memory required for the KV cache will be 16 * 2.1 = 33.6 GB roughly (with 16 caches). Adding up the KV Cache size and the amount of memory for the model weights, we get roughly 88GB of memory needed. With 350 concurrent requests, we will need 350/16 = 22 parallel batches. Thus, we will need 22 * 88 GB = 1,936GB or total memory. To illustrate this, you will need way more than 1 GPU. This is extremely inefficient.

This obviously won’t fit into an A100 which is limited to 80 GB of memory. This is where we quantize the model weights to a lower precision. Now instead of FP16, we are able to bring the weights loaded in for forward pass to int8. This cuts our memory in half! Now, we can see that the model will take up 61 GB (27 + 34). This performs a little bit better: 22 * 61 = 1,342GB.

Overall this is still impractical (requiring 16 A100 GPUs). We can do many different things to optimize it such as quantizing our model weights, creating shorter context windows, and better attention. We will not cover those methods in this article. In conclusion, we can see that the technologies we see and interact with every day require so much more horsepower and compute that meets the eye. Stay tuned for more articles around infrastructure and model serving.