Efficient ML Inference

The Benchmark

There are a lot of sparse attention papers. Each claims speedups on different models, different sequence lengths, different metrics. Comparing them fairly is nearly impossible from reading papers alone. So I built a unified benchmark that evaluates three sparse attention methods as drop in replacements for Llama's attention layer: HASTE v2, BLASST, and SparQ. Same model, same data, same hardware, same evaluation pipeline. No more apples to oranges comparisons.

The evaluation covers three axes: accuracy on reasoning benchmarks (MATH500, AIME, GPQA), perplexity on language modeling (Wikipedia, C4), and latency across sequence lengths from 1K to 128K tokens. Each method is tested as a literal attention layer swap. The Llama model loads normally, and the attention module gets replaced with the sparse variant. Everything else stays identical: weights, KV cache layout, sampling parameters.

The Three Methods

HASTE v2 is my own method, building on the original HASTE cascade filter. The v2 additions are Quest style interval arithmetic for tighter block importance bounds, SVD residual bounds that exploit low rank structure in the attention matrix, and an entropy guided budget that allocates more compute to high entropy heads (where attention is spread out) and less to low entropy heads (where attention is concentrated). The entropy guidance is the key insight: not all heads need the same sparsity budget, and the optimal budget is predictable from the softmax entropy of recent tokens.

BLASST (Block Level Approximate Sparse Self attention with Thresholding) uses a block level scoring function based on accumulated query key statistics. It maintains running statistics per KV block and thresholds blocks based on their historical contribution. It is simpler than HASTE but effective for moderate sparsity levels.

SparQ takes a different approach: it sparsifies the query rather than the keys. Instead of selecting which KV blocks to attend to, it selects which dimensions of the query vector contribute most to the attention scores and computes a low rank approximation of the attention using only those dimensions. This is orthogonal to KV sparsity and in principle can be combined with it.

Evaluation Results

On MATH500, HASTE v2 at 60% sparsity retains 99.2% of dense accuracy. BLASST retains 98.1% at 50% sparsity but degrades faster beyond that. SparQ holds 97.5% at 45% sparsity. On AIME (harder competition math), the gap widens: HASTE v2 loses 1.8% at 60% sparsity while BLASST loses 4.2% and SparQ loses 5.7%. The reasoning heavy benchmarks expose quality differences that perplexity alone does not capture.

Perplexity on Wikipedia and C4 tells a different story. All three methods are within 0.1 perplexity points of dense at their respective operating sparsities. Perplexity is not a sensitive enough metric to distinguish between these methods. This is why I included reasoning benchmarks: they reveal failure modes that aggregate language modeling metrics smooth over.

Latency is where things get practical. At 32K tokens, HASTE v2 at 60% sparsity gives 1.7x speedup over dense FlashAttention 2. BLASST at 50% gives 1.4x. SparQ at 45% gives 1.3x. At 128K tokens, HASTE v2 hits 2.3x because memory bandwidth savings scale with sequence length. The cascade overhead is amortized over more blocks, and the EMA temporal coherence becomes more effective with longer histories.

LaTeX Tables and Reproducibility

The benchmark suite generates publication ready LaTeX tables automatically. Every run produces a JSON results file with raw numbers, and the table generator formats them into accuracy tables, perplexity tables, and latency comparison tables with proper standard deviations and confidence intervals. I built this because I was tired of manually formatting benchmark results every time I tweaked a parameter.

All configurations are specified in YAML. You define the model, the attention methods to compare, the sparsity budgets, the evaluation tasks, and the sequence lengths. One command runs everything and produces the full comparison. The idea is that anyone can reproduce the exact numbers or swap in their own method with a single config change. If you have a sparse attention method and want to see how it stacks up, the benchmark is ready for it.