
The W2 Wasserstein coupling distance between two probability measures μ and ν on Rn is
W2(μ;ν):=infE(‖X−Y‖22)1/2
where the infimum runs over all random vectors (X,Y) of Rn×Rn with X∼μ and Y∼ν. It turns out that we have the following nice formula for d:=W2(N(m1,Σ1);N(m2,Σ2)):
d2=‖m1−m2‖22+Tr(Σ1+Σ2−2(Σ1/21Σ2Σ1/21)1/2). (1)
This formula interested several authors including Givens and Shortt, Knott and Smith, Olkin and Pukelsheim, and Dowson and Landau. Note in particular that we have
Tr((Σ1/21Σ2Σ1/21)1/2)=Tr((Σ1/22Σ1Σ1/22)1/2).
In the commutative case where Σ1Σ2=Σ2Σ1, the formula (1) boils down simply to
W2(N(m1,Σ1);N(m2,Σ2))2=‖m1−m2‖22+‖Σ1/21−Σ1/22‖2Frobenius.
To prove (1), one can first reduce to the centered case m1=m2=0. Next, if (X,Y) is a random vector (Gaussian or not) of Rn×Rn with covariance matrix
Γ=(Σ1CC⊤Σ2)
then the quantity
E(‖X−Y‖22)=Tr(Σ1+Σ2−2C)
depends only on Γ. Also, when μ=N(0,Σ1) and ν=N(0,Σ2), one can restrict the infimum which defines W2 to run over Gaussian laws N(0,Γ) on Rn×Rn with covariance matrix Γ structured as above. The sole constrain on C is the Schur complement constraint:
Σ1−CΣ−12C⊤⪰0.
The minimization of the function
C↦−2Tr(C)
under the constraint above leads to (1). A detailed proof is given by Givens and Shortt. Alternatively, one may find an optimal transportation map as Knott and Smith. It turns out that N(m2,Σ2) is the image law of N(m1,Σ1) with the linear map
x↦m2+A(x−m1)
where
A=Σ−1/21(Σ1/21Σ2Σ1/21)1/2Σ−1/21=A⊤.
To check that this maps N(m1,Σ1) to N(m2,Σ2), say in the case m1=m2=0 for simplicity, one may define the random column vectors X∼N(m1,Σ1) and Y=AX and write
E(YY⊤)=AE(XX⊤)A⊤=Σ−1/21(Σ1/21Σ2Σ1/21)1/2(Σ1/21Σ2Σ1/21)1/2Σ−1/21=Σ2.
To check that the map is optimal, one may use,
E(‖X−Y‖22)=E(‖X‖22)+E(‖Y‖22)−2E(⟨X,Y⟩)=Tr(Σ1)+Tr(Σ2)−2E(⟨X,AX⟩)=Tr(Σ1)+Tr(Σ2)−2Tr(Σ1A)
and observe that by the cyclic property of the trace,
Tr(Σ1A)=Tr((Σ1/21Σ2Σ1/21)1/2).
The generalizations to elliptic families of distributions and to infinite dimensional Hilbert spaces is probably easy. Some more geometric'' properties of Gaussians with respect to such distances where studied more recently by Takastu and Takastu and Yokota.
The optimal transport map looks strange. Actually, the uniqueness in the Brenier theorem states that if a map T maps μ to ν and is the gradient of a convex function, then it is the optimal transport map that maps μ to ν and
W22(μ,ν)=∫|T(x)−x|2dμ.
On the other hand, the affine map x∈Rd→m+Ax∈Rd is the gradient of a convex function if and only if the d×d matrix A is semidefinite symmetric. As a direct application, since
x↦m2+Σ−1/21(Σ1/21Σ2Σ1/21)1/2Σ−1/21(x−m1)
is the gradient of a convex function and maps N(m1,Σ1) to N(m2,Σ2), it is the optimal transport map, and we get the formula for W2 ! This is an alternative to Givens and Shortt.
24 Comments