Diffusion, Flow and Stochastic Interpolants

Summary

The generative modeling landspace has evolved from time-discretized diffusion (DDPM) to the continuous score based models, and recently to Flow Matching and Stochastic Interpolants. They are all connected and they all try to estimate the same instantaneous velocity field to the probability distribution. Once we have clarified the connection, we can use the unifying stochastic interpolant framework to understand and interpret transformations like x-space, v-space and ϵ-space transformations, and denounce the myths like we estimate noise in diffusion models.

Diffusion Model

Let’s start from the Score Based Generative Model paper [2]. Song et al. demonstrated that DDPM [1] is a time-discretized form of a continuous probability evolution. By using stochastic differential equations (SDEs), we can generalize this evolution to continuous time. We begin with a data distribution, p0 and make it noisy through a Stochastic Differential equation (SDE).

(1)dx=β2x dt+β dW

where dW is the standard Brownian motion. For simplicity, I am setting β as a constant over time, t. The beauty here is that the reverse SDE is also analytically tractable:

(2)dx=(β2 xβ xlogpt(x)) dt+β dW~

This reverse SDE paves the way to generate data from noise. The forward SDE goes from data distribution to the noise, while the reverse goes from noise to data. So, if we can model the score function(xlogpt(x)), then we have all the terms to create a model that can generate a data distribution from complete noise.

The Probability Flow ODE

A surprising and crucial result is the existence of a reverse ordinary differential equation (ODE) that corresponds to the same marginal probability densities as the SDE:

(3)dx=(β2 xβ2 xlogpt(x)) dt

Equations (1), (2) and (3) satisfy same Fokker-Planck equation. What that means is that the forward SDE, the backward SDE and the backward ODE, they all share the same marginal distribution pt for all t. When I was new to diffusion model, it was unclear why that is the case. If we compare equations (2) and (3), they differ by β2xlogpt(x) and the stochastic term. The equivalence implies that the stochasticity of the Brownian motion is sort of canceled by the β2xlogpt(x) in the score function. This is actually true and I will talk about this in detail in another blog-post. For now, let’s just take that as a fact. So, we have a reverse ODE which looks like:

(4)dx=β2 (x+s) dt where s(x,t)=xlogpt(x) is the score function and v(x,t)=β2 (x+s) is the velocity field in the reverse direction. To unify these three frameworks, we need to show that we arrive at this exact equation in both flow model formulation and Stochastic interpolants formulation.

Flow Matching

The idea of flow matching is to find the velocity field that transports a standard Normal distribution to the data distribution. To do this, they first construct the flow map that maps noise ϵ to a sample xt:

xt=ψt(ϵ)=atx0+btϵ

Then the conditional velocity field at any time t for a sample x is given by the time derivative of the flow map

ut(x|x0,ϵ)=ddtψt=b˙(t)ϵ+at˙x0=b˙(t)b(t)(xatx0)+at˙x0

The average velocity for the whole distribution at time t is given by taking expectation with respect to the distribution of x0

(5)v(x,t)=E[ut(x|x0)]=E[b˙(t)b(t)(xatx0)+at˙x0]=b˙(t)b(t)(xatE[x0|xt=x])+at˙E[x0|xt=x]

Stochastic Interpolants

Another seminal idea is that of stochastic interpolants due to Albergo et al.[7] which establishes a unified analysis method generalizing Diffusion and Flow methods.

The Stochastic interpolant recipe is as follows:

  1. Express xt=a(t)x0+b(t)ϵ
  2. Velocity v(x,t)=a˙(t) E[x0xt=x]+b˙(t) E[ϵxt=x]
  3. The score function s(x,t)=b(t)1E[ϵxt=x]

The first thing I want to do here is to show that we obtain exact same velocity field that we obtained before from Flow Matching.

Stochastic Interpolant is equivalent to Flow Matching

(6)v(x,t)=a˙(t)E[x0xt=x]+b˙(t)E[ϵxt=x]

Taking conditional expectation on both sides of:

E[xtxt=x]=a(t) E[x0xt=x]+b(t) E[ϵxt=x]

(7)x=a(t) E[x0xt=x]+b(t) E[ϵxt=x] Substituting E[ϵxt=x]=1b(t)(xa(t) E[x0xt=x]), we obtain: v(x,t)=b˙(t)b(t)(xatE[x0|xt=x])+at˙E[x0|xt=x]

We have shown that the velocity field derived from Flow matching is exactly same as that from the stochastic interpolant.

Stochastic Interpolant is equivalent to Diffusion Model

Expressing xt as an interpolant

For a linear SDE of the form dx=Fx dt+L dW, the solution xt is Gaussian. Its mean, m(t) and covariance, P(t), evolve via ODE as follows [6]:

(8)m(t)=eF(tt0)m(t0)

(9)P(t)=eF(tt0)P(t0) e(F(tt0))T+t0teF(tu)LQLTe(F(tu))Tdu

Intuition

What these equations mean is suppose you start from a Gaussian distribution with mean and covariance m(t0),P(t)) respectively, and the system follows above linear SDE. Then at any time, t, x0 evolves into a Gaussian distribution with mean and covariances given above.

For our case, since we start from a single sample, m(t0)=x0 and P(t0)=0, F=β2I, L=β I.

Plugging these in, we obtain:

m(t)=x0e(βt2)

P(t)=(0tβeβ(tu) du)I=(1exp(βt)) I

Since xt is Gaussian distrbuted, sample can be written as:

(10)xt=eβt2x0+1eβt ϵ=a(t).x0+b(t).ϵ

Thus, we have written xt in an interpolant form where a(t)=eβt/2 and b(t)=1eβt.

Calculating velocity field

Substituting the score identity

(11)E[ϵxt=x]=b(t)s(x,t)

into the velocity field v(x,t)=a˙(t)E[x0xt=x]+b˙(t)E[ϵxt=x], we obtain:

(12)v(x,t)=a˙(t)E[x0xt=x]b˙(t)b(t)s(x,t)

Now, we compute the time derivative:

Let’s introduce a new variable τ=eβt2. Then,

a(t)=τ, b(t)=1τ2

With some algebra, we obtain:

a˙(t)=βτ2, b˙(t)b(t)=βτ22

Therefore, the velocity field is given by:

(13)v(x,t)=βτ2(E[x0xt=x]+τs)

There is another critical equation that helps us express equivalent relationship. Taking conditional expectation on both sides of interpolant:

E[xtxt=x]=a(t) E[x0xt=x]+b(t)E[ϵxt=x]

(14)x=a(t)E[x0xt=x]+b(t)E[ϵxt=x]

Substituting

E[x0xt=x]=xb(t)E[ϵxt=x]a(t)

into the velocity field, we obtain:

(15)v(x,t)=β2(x+s)

Conclusion:

Flow Matching, Stochastic Interpolants, and Diffusion models are targeting the same underlying velocity field.

Myth: We are estimating noise ϵ

One of the most common misconception in diffusion literature is we estimate the noise, ϵ during training and use that to go from noise to image by subtracting the estimated noise. This gives a sense that we can subtract noise from noisy image to obtain a clean image. This is quite incorrect.

Yes, we are averaging some noise values, but they are not any random noise values. The meaning of E[ϵ|xt] is that we average over only those noise values which lead to the construction of that precise xt=x. So, first, its average noise and second not just any average, but there is special condition on filtering those noise values. Therefore, our estimated ϵ is very different from the noise.

A more accurate interpretation is we are estimating negative log likelihood and that is very closely related to the velocity vector in which we need to move to reach image from noise. So, the whole point of diffusion modeling is trying to identify the distribution-level velocity vector, such that we can make incremental movements along those directions until we reach the final destination restored images.

Prediction in x0 space, ϵ space and v space

In earlier diffusion days, we had papers like DDIM[3] and other who introduced the concept of directly predicting x0 from xt. Somehow, directly predicting x0 and then adding noise to it to bring it back and then again predicting x0 produced better samples. However, it was hard for me to understand what exactly was going on.

The idea was something like this. Since xt=a(t)x0+b(t)ϵ, we can write x0=xtb(t)ϵa(t) Now, we would replace the ϵ with the ϵθ we obtained from training. Using this x0, we would again estimate noisy xt at a slightly closer time and repeat. If you use auxiliary network to estimate x0, you could generate better samples. Since ϵθ is not noise, it was not clear to me what x^0 was.

There was other paper on progressive distillation [4] of diffusion model, which introduced the idea of training in v-space where v represented velocity, defined as the time derivative of xt. In their case, they parameterized xt=cos(ϕ)x0+sin(ϕ)ϵ. Then they used velocity as ddϕxt. Their ϕ is a reparameterization of time, t, after which their definition of v-space coincides with the velocity we have been talking so far.

The latest paper that talks about x space prediction is Just Image Transformers (JiT)[5]. They use equations like these

xθ=netθ xt=txθ+(1t)ϵθ vθ=xθϵθ

Without the theory of stochastic interpolants, it is hard to interpret these equations. With stochastic interpolants, we can begin to explain the core ideas behind these space transformations. The key idea that we can take conditional expectation on the interpolant equation xt=a(t)x0+b(t)ϵ, which gives us

E[xtxt=x]=a(t) E[x0xt=x]+b(t) E[ϵxt=x]

x=a(t) xθ+b(t) ϵθ which is true beacuse of the linearity of conditional expectation. In light of this, we can safely say the quantities that the model learns are actually

(16)xθ=E[x0|xt=x],ϵθ=E[ϵxt=x]

i.e., expected ϵ and expected x0. We need to take this together with definition of velocity field for interpolants:

v=a˙(t) xθ+b˙(t) ϵθ

That unifies all the v-space, x-space, ϵ- space modeling from DDIM till JiT.


References

  1. Ho, J., Jain, A. and Abbeel, P., 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, pp.6840-6851.

  2. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. and Poole, B., Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations.

  3. Song, J., Meng, C. and Ermon, S., Denoising Diffusion Implicit Models. In International Conference on Learning Representations.

  4. Salimans, T. and Ho, J., Progressive Distillation for Fast Sampling of Diffusion Models. In International Conference on Learning Representations.

  5. Li, T. and He, K., 2025. Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720.

  6. Särkkä, S. and Solin, A. (2019) Applied Stochastic Differential Equations. Cambridge: Cambridge University Press.

  7. Albergo, M., Boffi, N.M. and Vanden-Eijnden, E., 2025. Stochastic interpolants: A unifying framework for flows and diffusions. Journal of Machine Learning Research, 26(209), pp.1-80.

Sandesh Ghimire
Staff Researcher and Engineer

My research interests include machine learning, computer vision and medical imaging.