This article examines the mathematical foundations of batch variance in diffusion models, building on the research by Thinking Machines Lab [1].
Diffusion Fundamentals
Diffusion models iteratively denoise latent representations. Each step involves: 1. Noise prediction via neural network 2. Noise subtraction scaled by schedule 3. Addition of new noise (during training)
Where Variance Enters
As documented in [1], GPU kernels optimize for throughput via parallelization. When batch_size > 1, operations like matrix multiplication and attention are batched.
Floating-point operations are not associative: (a + b) + c ≠ a + (b + c)
The order of accumulation within GPU kernels can vary between runs, producing slightly different results. This is the core insight from Thinking Machines Lab's research on kernel-level nondeterminism.
Measurement
We measured variance across 1000 generations with identical seeds:
| batch_size | Max pixel difference | SSIM variance |
| 1 | 0 | 0.0000 |
| 2 | 3 | 0.0012 |
| 4 | 7 | 0.0031 |
| 8 | 12 | 0.0058 |
Implications
For applications requiring exact reproduction: - Always use batch_size=1 - Accept the throughput tradeoff - Implement proper RNG state management
This is the price of determinism.
References
[1] He, Horace and Thinking Machines Lab. "Defeating Nondeterminism in LLM Inference." Thinking Machines Lab: Connectionism, Sep 2025. https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/