25 Nov 2025 3 min read CUDA

GPU Computing: Parallelism vs Latency

Hi guys! I am currently learning GPU computing and would love to share what I've learn with you. Of course, this won't be a in-depth learning course, but more of an overview and what I think interesting.

You've heard it so many times before: "GPU hides latency with parallelism", but what does it really means? Perhaps those among you with a good hardware background already know the answer, but I didn't!

Von Neumann Architecture

The von Neumann architecture, which is basically used in every modern CPU/GPU, is bottlenecked by its memory bus. In short, the part of the processor that actually do computing, called arithmetic logic unit (ALU), is prone to stall while waiting for data to come, which bottleneck the processor.

The ALU is ready to process data each cycle, but fetching data from cache, RAM or disk takes more cycles. Now this is what I found interesting: Hyperthreading (or SMT on AMD) is a technique to have the ALU stalled less. By having two (or more!) threads assigned on one ALU, it is less likely to stall. This is because if one thread is stalled, the ALU would just compute the other thread. But having too many threads in one ALU creates resource contention [ref].

GPU Architecture

The same concept is applied to GPU. Instead of making sure we've finished one task before moving on to another task, we just spam the ALU with data from different threads. This is realised in CUDA with blocks and threads.

__global__
void cuda_vector_add(int n, float* res, float *x, float *y)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) 
        res[i] = x[i] + y[i];
}


// Invoke the CUDA function (kernel) from CPU (host)

int blockSize = 256; // number of threads in one block
int numBlocks = (N + blockSize - 1) / blockSize;
cuda_vector_add<<<numBlocks, blockSize>>>(N, res, x, y);

To run CUDA functions, we need to define how many blocks we are running and how many threads should run in one block. In the above example, we add two vectors together and each thread in the GPU process exactly one element of the vector. This means the number of blocks that run depends on the length of the vector, and it could be massive. In fact, it could be so massive that the maximum number of blocks allowed in one CUDA function call is 2^31 - 1. Since each block could have a maximum of 1024 threads, which translates to a maximum number of total threads of 1024 * (2^31 - 1).

How it works

So we've established that the number of blocks and threads could be massive. However, the GPU hardware itself is limited right? For example in NVIDIA 4060 Ti, there are 34 streaming multiprocessors (SM), and each SM could run a maximum of 1024 threads for a total of 34 * 1024 threads. So how can the GPU compute it?

Well, they queue the blocks! CUDA will assign one block to one SM, and if the SM still have resources (registers, cache) to run another block, it will do it. CUDA does not guarantees which block run first or which one finish first. They also don't guarantee the order of execution of the blocks. This is exactly because we just load data to the ALU as fast as possible, and the ALU would process any data that is ready first.

This type of processor can be thought as "throughput-oriented processors", where the goal is to make sure that the ALU never stalls by not caring about execution orders. On the other side, the goal of the CPU is to finish one task as quickly as possible or can be thought of "latency-oriented processors". This difference in goal also make the design of CPU and GPU differs in things like size of cache, size of registers, and its pipeline.

Conclusion

That is where "hides latency with parallelism" comes from. Since we don't care about execution orders in the GPU, the execution time of one single individual thread is higher compared to CPU, because we need to wait for other threads. We can think of this as batch processing. Since GPU were originally designed to render graphics, which comes in batches, they prioritize throughput over individual thread timing. We don't care when a single pixel is processed, we only care about rendering the entire screen (a batch of pixels).

This is my first post about GPU computing and I will probably post more! Subscribe so you don't miss it or comment down below if you have any questions ☺️

Von Neumann Architecture

GPU Architecture

How it works

Conclusion

You might also like...

Step-by-step guide to create CUDA Ubuntu VM in Proxmox 8