I have been writing CUDA kernels for a while now. Between my GSoC work at CERN on GPU backends (CUDA, ROCm, Alpaka) and contributing to vLLM, I have had the chance to actually feel the architectural differences between NVIDIA generations in a way that spec sheets alone never convey. When you are staring at an Nsight Compute profile and trying to figure out why your kernel is hitting 40% occupancy, you develop opinions about hardware. Strong ones.
So this is my attempt at walking through the last four major NVIDIA architectures from the perspective of someone who actually writes the kernels that run on them. Not a marketing overview, not a benchmark roundup. Just what matters when you are the person writing __global__ functions and fighting with shared memory bank conflicts at 2 AM.
The Quick Reference
Before diving in, here is the table I wish someone had given me when I started. These are the four chips I will be covering.
| Architecture | Year | Memory | Bandwidth | SMs | Tensor Core Gen | Key Feature |
|---|---|---|---|---|---|---|
| Ampere (A100) | 2020 | 80 GB HBM2e | 2 TB/s | 108 | 3rd gen | TF32, MIG |
| Ada Lovelace (RTX 4090) | 2022 | 24 GB GDDR6X | ~1 TB/s | 128 | 4th gen | FP8 tensor cores |
| Hopper (H100 SXM) | 2022 | 80 GB HBM3 | 3.35 TB/s | 132 | 4th gen | Transformer Engine, TMA |
| Blackwell (B200) | 2024 | 192 GB HBM3e | 8 TB/s | N/A (chiplet) | 5th gen | FP4, two-die design |
Now let me actually talk about what these numbers mean when you are writing kernels.
Ampere: Where Modern GPU Computing Really Started
I know Volta and Turing existed, and yes, tensor cores were introduced on Volta. But honestly, Ampere is where things clicked for the broader ML ecosystem. The A100 was the GPU that made large scale training feel accessible to research labs, not just hyperscalers.
The A100 gives you 80 GB of HBM2e at 2 TB/s of memory bandwidth, 108 SMs, 6912 CUDA cores, and 432 third generation Tensor Cores. On paper those are just numbers. In practice, if you have ever profiled a kernel you know that the 2 TB/s bandwidth figure is the one that actually matters for most workloads. The raw compute is 19.5 TFLOPS at FP64, 156 TFLOPS at TF32, and 312 TFLOPS at FP16 through tensor cores.
The thing is, the real story of Ampere is TF32. NVIDIA introduced this 19 bit floating point format (8 bit exponent like FP32, but 10 bit mantissa like FP16) that could run on tensor cores while giving you near FP32 accuracy. Before TF32, you had to explicitly convert your training pipeline to mixed precision with FP16, deal with loss scaling, worry about overflow. TF32 just worked. You kept your FP32 code, the tensor cores handled TF32 internally, and training accuracy barely moved. This was huge for adoption.
Then there was structured sparsity. Ampere's tensor cores support 2:4 sparsity, meaning if exactly 2 out of every 4 weights are zero, the tensor cores can skip those multiplications and give you 2x throughput. From a kernel developer's perspective, this is interesting because it is a hardware constraint that your pruning algorithm has to respect. You cannot just randomly zero out weights. You need the 2:4 pattern. In practice, getting models to converge well with this constraint is still an active research area, but when it works, the speedup is real.
MIG (Multi-Instance GPU) was the other big Ampere feature. You could partition a single A100 into up to 7 isolated GPU instances, each with dedicated memory bandwidth and compute. For inference serving this was a game changer. Instead of wasting an entire 80 GB A100 on a small model, you slice it up. From a kernel perspective, MIG does not change how you write code, but it changes how your code gets scheduled and what resources it can see.
Connectivity: PCIe Gen4 and NVLink 3.0 at 600 GB/s. For multi-GPU kernels and all-reduce operations, that NVLink bandwidth matters a lot.
Ada Lovelace: The Consumer Card That Punches Up
The RTX 4090 is not a datacenter GPU. It has 24 GB of GDDR6X instead of HBM, roughly 1 TB/s of bandwidth, and sits on a TSMC 4N process. But with 128 SMs, 16384 CUDA cores, and 512 fourth generation Tensor Cores, it is a serious piece of hardware for researchers and hobbyists who cannot get H100 allocations.
The big deal for kernel developers: FP8 tensor cores. Ada Lovelace was the first architecture to bring FP8 (both E4M3 and E5M2 formats) to tensor cores. If you have been following the quantization literature, FP8 is increasingly the sweet spot for inference. You get massive throughput gains over FP16 with surprisingly little accuracy loss for most models. Writing kernels that actually exploit FP8 on Ada is a different experience than on Hopper though, because Ada lacks the Transformer Engine that automates the scaling factors.
Ada also introduced Shader Execution Reordering (SER) and DLSS 3 with optical flow based frame generation. These are graphics features and mostly irrelevant if you are writing compute kernels, but SER is architecturally interesting because it lets the hardware reorder thread execution for better coherence. The general principle of execution reordering for better memory access patterns is something kernel developers think about constantly.
Honestly, the RTX 4090 occupies a weird but important niche. It is the GPU that lets independent researchers and startups actually experiment with quantized inference, fine tuning smaller models, and developing kernels before they get datacenter access. A lot of the early vLLM development and testing happened on consumer cards like this. The 24 GB memory limit forces you to be creative with memory management, which, arguably, makes you a better kernel developer.
Hopper: The Architecture That Changed How I Write Kernels
Hopper is where things got really interesting. The H100 SXM packs 80 GB of HBM3 at 3.35 TB/s bandwidth, 132 SMs, 16896 CUDA cores, and 528 fourth generation Tensor Cores. The headline compute numbers are staggering: 3958 TFLOPS at FP8, 1979 TFLOPS at FP16, 989 TFLOPS at TF32.
But the raw numbers are not why Hopper changed things. It is the new programming abstractions.
The Transformer Engine
The Transformer Engine is hardware level support for automatic mixed precision specifically targeting transformer architectures. It dynamically switches between FP8 and FP16 on a per layer basis, managing scaling factors automatically. If you have ever hand tuned FP8 quantization and dealt with the pain of per tensor vs per channel scaling, you understand why having this in hardware is a massive quality of life improvement. The Transformer Engine looks at the statistics of each layer's activations and picks the right precision on the fly.
For kernel developers working on transformer inference (which is basically everyone in LLM land right now), this means you can write your GEMM calls targeting FP8 and trust the hardware to handle the precision management that used to require careful software engineering.
Tensor Memory Accelerator (TMA)
The TMA is, honestly, one of my favorite Hopper features. It is a dedicated hardware unit for asynchronous bulk data movement between global memory and shared memory. Before TMA, doing efficient async copies meant using cp.async instructions and managing completion tracking yourself. TMA handles multi-dimensional tensor copies in hardware, including boundary checks and padding.
If you have ever written a tiled GEMM kernel and spent hours getting the shared memory loading right with proper double buffering, TMA makes that dramatically simpler. You describe the tensor layout, tell TMA to prefetch a tile, and it handles the rest. This is not just a convenience feature. It actually frees up warps that were previously dedicated to data movement, letting you use more of your SM budget for actual compute.
Thread Block Clusters
Thread Block Clusters are a new level in the CUDA programming hierarchy. Before Hopper, the hierarchy was threads, warps, thread blocks, and the grid. Clusters add a grouping of thread blocks that are guaranteed to be co-scheduled on nearby SMs and can communicate through distributed shared memory.
The thing is, this matters a lot for algorithms that need cooperation beyond a single SM. Before clusters, cross-block communication meant going through global memory (slow) or using cooperative groups with limited guarantees. With clusters, you can do direct loads from another block's shared memory using the mapa instruction, which opens up new algorithmic possibilities for things like fused attention kernels and multi-stage pipelines.
DPX instructions are the other addition worth mentioning. These accelerate dynamic programming algorithms (think Smith-Waterman for genomics, Viterbi decoding, graph shortest paths) by providing hardware acceleration for the inner loop comparison operations. Niche, but genuinely useful for bioinformatics and similar fields.
NVLink 4.0 bumps inter-GPU bandwidth to 900 GB/s and adds PCIe Gen5 support. For the multi-node training setups that LLM training demands, every bit of interconnect bandwidth matters.
Blackwell: The Chiplet Era Begins
Blackwell is the one I am most excited about architecturally, even though I have not had as much hands on time with it. The B200 packs 192 GB of HBM3e at a frankly absurd 8 TB/s of memory bandwidth. But the real story is the design itself.
The B200 is NVIDIA's first chiplet GPU. It is a two die design connected by a 10 TB/s die-to-die interconnect. This is significant for the entire industry because it means NVIDIA hit the practical limits of monolithic die scaling and had to go multi-die. The fact that they achieved 10 TB/s between the dies means that from a programmer's perspective, it should mostly look like a single GPU. Mostly. There will likely be NUMA-like effects that kernel developers need to think about.
Fifth generation Tensor Cores bring FP4 support. Think about that progression: FP32 to TF32 to FP16/BF16 to FP8 to FP4, each generation pushing the precision lower while the hardware gets smarter about maintaining accuracy. The second generation Transformer Engine now handles FP4 precision management automatically. For inference on trillion parameter models, FP4 could be transformative. You are looking at up to 9 PFLOPS of FP4 throughput and 4.5 PFLOPS at FP8.
NVLink 5.0 pushes to 1.8 TB/s per GPU. For context, that is 3x what Ampere's NVLink provided. When you are doing all-reduce across 8 GPUs during training, that bandwidth directly translates to less time waiting on communication.
The 192 GB of HBM3e is particularly interesting for LLM inference. A 70B parameter model in FP16 needs about 140 GB of memory. On an H100, you need at least two GPUs for that. On a B200, it fits in one. Fewer GPUs means less inter-GPU communication, means simpler kernels, means lower latency. Sometimes the best kernel optimization is not needing the kernel at all.
What This All Means If You Write Kernels
After working across these architectures, here is what I keep coming back to.
Memory bandwidth is the bottleneck, not compute
This has been true for a while, but each generation makes it more true. Look at the compute to memory bandwidth ratio across generations. The A100 does 312 TFLOPS FP16 with 2 TB/s bandwidth. The H100 does 1979 TFLOPS FP16 with 3.35 TB/s bandwidth. Compute went up roughly 6x. Bandwidth went up roughly 1.7x. The arithmetic intensity you need to be compute bound keeps going up.
What this means in practice: if your kernel is memory bound (and most kernels are), adding more FLOPS does nothing for you. You need to minimize data movement. Fused kernels, better tiling, keeping data in shared memory and registers as long as possible. The Hopper TMA and Thread Block Clusters are direct responses to this trend, giving you better tools to manage the memory hierarchy.
The precision ladder keeps going down
FP32 to TF32 to FP16 to BF16 to FP8 to FP4. Each step down roughly doubles your throughput because you are moving half the bytes and the tensor cores process more elements per cycle. Training still needs higher precision (BF16 or TF32 is the sweet spot right now), but inference is rapidly moving to FP8, and FP4 is on the horizon with Blackwell.
For kernel developers, this means you need to understand quantization at a deep level. Not just the math, but the hardware implications. FP8 has two formats (E4M3 for weights, E5M2 for gradients) and choosing the wrong one tanks your accuracy. FP4 will presumably have similar gotchas. Every new precision format is a new set of tradeoffs to internalize.
Understanding the memory hierarchy matters more than raw FLOPS
If you are profiling a kernel and it is running at 60% of theoretical peak bandwidth, the answer is almost never "I need a faster GPU." The answer is usually: your access pattern has bank conflicts in shared memory, you are not coalescing global memory loads, your occupancy is low because you are using too many registers, or your kernel is not overlapping computation with data fetching.
Every architecture I have discussed gives you more compute per byte of memory bandwidth. This means the penalty for inefficient memory access gets worse each generation. An L2 cache miss on Blackwell wastes more potential FLOPS than an L2 cache miss on Ampere, simply because there are more FLOPS available to waste.
Training vs inference: different optimization targets
Training cares about throughput and precision. You want BF16 or TF32 on tensor cores, large batch sizes to amortize communication overhead, and you are usually compute bound on the GEMMs. Inference cares about latency and memory. You want the lowest precision that maintains quality (FP8, soon FP4), small batch sizes (sometimes batch size 1 for interactive applications), and you are almost always memory bandwidth bound because you are reading the entire model for every token.
This is why Blackwell's combination of 8 TB/s bandwidth and FP4 support is so interesting for inference. It attacks both sides of the inference bottleneck simultaneously: more bandwidth to feed the model weights faster, and lower precision to make those weights smaller.
Looking Forward
If you are getting into GPU kernel development now, my honest advice is to spend less time memorizing TFLOPS numbers and more time understanding memory hierarchies. Learn to read Nsight Compute profiles. Understand what a warp stall looks like and why it happens. Get comfortable with shared memory tiling and async copies. These skills transfer across every NVIDIA architecture, and they become more important, not less, as the hardware gets faster.
The trend is clear: NVIDIA is building increasingly specialized hardware for transformer workloads (Transformer Engine, TMA, Thread Block Clusters for distributed attention). If you are writing kernels for LLM inference or training, learning to exploit these features is not optional. The gap between a naive kernel and one that uses the hardware properly is easily 5 to 10x on Hopper, and it will only get wider on Blackwell.
The architectures keep getting more complex, but the fundamental question stays the same: am I compute bound or memory bound, and what do I do about it? Everything else follows from there.