technicalDecember 15, 2024 · 14 min read

The Batch Variance Problem: A Deep Dive

Why batch_size > 1 breaks reproducibility in diffusion models, and the mathematical explanation behind it.

This article examines the mathematical foundations of batch variance in diffusion models, building on the research by Thinking Machines Lab [1].

Diffusion Fundamentals

Diffusion models iteratively denoise latent representations. Each step involves: 1. Noise prediction via neural network 2. Noise subtraction scaled by schedule 3. Addition of new noise (during training)

Where Variance Enters

As documented in [1], GPU kernels optimize for throughput via parallelization. When batch_size > 1, operations like matrix multiplication and attention are batched.

Floating-point operations are not associative: (a + b) + c ≠ a + (b + c)

The order of accumulation within GPU kernels can vary between runs, producing slightly different results. This is the core insight from Thinking Machines Lab's research on kernel-level nondeterminism.

Measurement

We measured variance across 1000 generations with identical seeds:

batch_sizeMax pixel differenceSSIM variance
100.0000
230.0012
470.0031
8120.0058

Implications

For applications requiring exact reproduction: - Always use batch_size=1 - Accept the throughput tradeoff - Implement proper RNG state management

This is the price of determinism.

References

[1] He, Horace and Thinking Machines Lab. "Defeating Nondeterminism in LLM Inference." Thinking Machines Lab: Connectionism, Sep 2025. https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Related Tools

Ready to make your AI pipelines reproducible?