The Problem
Every team writing GPU kernels has the same ritual. You write a Triton or CUDA kernel, call torch.allclose with some copy pasted tolerance values, get a boolean back, and call it a day. If it fails, you stare at False with no idea what went wrong or where. If it passes, you have no idea if your tolerances were just loose enough to hide real bugs. I got tired of this, so I built gpucheck: a pytest plugin designed specifically for GPU kernel testing.
What It Does
The core idea is dtype-aware assertions. When you call assert_close(result, expected), gpucheck auto selects tolerances based on the actual dtype of the tensors. float32, float16, bfloat16, float8_e5m2, float8_e4m3fn all have different mantissa widths and therefore different numerical precision. Using the same atol and rtol for all of them is not testing, it is theater. gpucheck's tolerance tables are informed by failure analysis across hundreds of configurations.
On top of that, three decorators handle parametric testing: @dtypes, @shapes, and @devices. Stack them on a single test function and they generate the full cartesian product. Here is what it looks like in practice:
import torch
import pytest
from gpucheck import assert_close, dtypes, shapes, devices
@pytest.mark.gpu
@dtypes("float16", "bfloat16", "float32")
@shapes((128, 128), (371, 17), (1024, 1024))
@devices("cuda:0")
def test_softmax_kernel(dtype, shape, device):
x = torch.randn(shape, dtype=dtype, device=device)
result = my_triton_softmax(x)
expected = torch.softmax(x, dim=-1)
assert_close(result, expected)
Nine configurations from one function. Scale to five dtypes, ten shapes, two devices and you get 100 tests with zero copy paste. The decorators plug directly into pytest's parametrize machinery, so you keep proper test IDs, selective reruns, and all the pytest features you already know.
Rich Error Reports
When an assertion fails, you do not get a boolean. You get a Rich formatted mismatch report: max absolute error, max relative error, mean error, percentage of elements exceeding tolerance, an ASCII error histogram showing the distribution, and the exact index of the worst element. That is actionable debugging information instead of a naked False.
Finding Real Bugs
I ran gpucheck against Triton's official tutorials with 511 test configurations. It found 8 real bugs. The most severe: Triton's layer normalization kernel produces 83% error when given non power of 2 dimensions. Not a rounding issue. A correctness bug. I reported it as triton#9838. Without shape parametrization testing non standard dimensions, this would have stayed hidden in tutorial code that thousands of developers copy into production.
Second notable find: FP16 accumulation drift in Triton's matmul tutorial. Accumulating partial products in float16 instead of float32 causes errors to compound across the K dimension. Reported as triton#9839. This is exactly the bug that loose tolerances hide at small sizes but that blows up at scale.
The Full Toolkit
Beyond assertions and parametrization, gpucheck includes a gpu_benchmark fixture for CUDA event timing with warmup and statistical summaries, a memory_tracker fixture that catches VRAM leaks between tests, and fuzz_shapes for generating random valid tensor dimensions within constraints. The bugs that matter are almost never at neat powers of 2. They are at (371, 17) or (1, 8193).
gpucheck is on PyPI. pip install gpucheck and you are set. It is a pytest plugin, so fixtures register automatically. The tolerance tables, the error reporting, the parametric decorators, all of it is available out of the box. If you write GPU kernels, stop guessing tolerances and start testing properly.