Loading [MathJax]/jax/output/CommonHTML/jax.js
Press "Enter" to skip to content

Libres pensées d'un mathématicien ordinaire Posts

Wasserstein distance between two Gaussians

Leonid Vitaliyevich Kantorovich (1912 – 1986)
Leonid Vitaliyevich Kantorovich (1912 – 1986)

The W2 Wasserstein coupling distance between two probability measures μ and ν on Rn is

W2(μ;ν):=infE(XY22)1/2

where the infimum runs over all random vectors (X,Y) of Rn×Rn with Xμ and Yν. It turns out that we have the following nice formula for d:=W2(N(m1,Σ1);N(m2,Σ2)):

d2=m1m222+Tr(Σ1+Σ22(Σ1/21Σ2Σ1/21)1/2).     (1)

This formula interested several authors including Givens and Shortt, Knott and Smith, Olkin and Pukelsheim, and Dowson and Landau. Note in particular that we have

Tr((Σ1/21Σ2Σ1/21)1/2)=Tr((Σ1/22Σ1Σ1/22)1/2).

In the commutative case where Σ1Σ2=Σ2Σ1, the formula (1) boils down simply to

W2(N(m1,Σ1);N(m2,Σ2))2=m1m222+Σ1/21Σ1/222Frobenius.

To prove (1), one can first reduce to the centered case m1=m2=0. Next, if (X,Y) is a random vector (Gaussian or not) of Rn×Rn with covariance matrix

Γ=(Σ1CCΣ2)

then the quantity

E(XY22)=Tr(Σ1+Σ22C)

depends only on Γ. Also, when μ=N(0,Σ1) and ν=N(0,Σ2), one can restrict the infimum which defines W2 to run over Gaussian laws N(0,Γ) on Rn×Rn with covariance matrix Γ structured as above. The sole constrain on C is the Schur complement constraint:

Σ1CΣ12C0.

The minimization of the function

C2Tr(C)

under the constraint above leads to (1). A detailed proof is given by Givens and Shortt. Alternatively, one may find an optimal transportation map as Knott and Smith. It turns out that N(m2,Σ2) is the image law of N(m1,Σ1) with the linear map

xm2+A(xm1)

where

A=Σ1/21(Σ1/21Σ2Σ1/21)1/2Σ1/21=A.

To check that this maps N(m1,Σ1) to N(m2,Σ2), say in the case m1=m2=0 for simplicity, one may define the random column vectors XN(m1,Σ1) and Y=AX and write

E(YY)=AE(XX)A=Σ1/21(Σ1/21Σ2Σ1/21)1/2(Σ1/21Σ2Σ1/21)1/2Σ1/21=Σ2.

To check that the map is optimal, one may use,

E(XY22)=E(X22)+E(Y22)2E(X,Y)=Tr(Σ1)+Tr(Σ2)2E(X,AX)=Tr(Σ1)+Tr(Σ2)2Tr(Σ1A)

and observe that by the cyclic property of the trace,

Tr(Σ1A)=Tr((Σ1/21Σ2Σ1/21)1/2).

The generalizations to elliptic families of distributions and to infinite dimensional Hilbert spaces is probably easy. Some more geometric'' properties of Gaussians with respect to such distances where studied more recently by Takastu and Takastu and Yokota.

The optimal transport map looks strange. Actually, the uniqueness in the Brenier theorem states that if a map T maps μ to ν and is the gradient of a convex function, then it is the optimal transport map that maps μ to ν and

W22(μ,ν)=|T(x)x|2dμ.

On the other hand, the affine map xRdm+AxRd is the gradient of a convex function if and only if the d×d matrix A is semidefinite symmetric. As a direct application, since

xm2+Σ1/21(Σ1/21Σ2Σ1/21)1/2Σ1/21(xm1)

is the gradient of a convex function and maps N(m1,Σ1) to N(m2,Σ2), it is the optimal transport map, and we get the formula for W2 ! This is an alternative to Givens and Shortt.

24 Comments
Syntax · Style · .