Press "Enter" to skip to content

Libres pensées d'un mathématicien ordinaire Posts

Wasserstein distance between two Gaussians

Leonid Vitaliyevich Kantorovich (1912 – 1986)
Leonid Vitaliyevich Kantorovich (1912 – 1986)

The \( {W_2} \) Wasserstein coupling distance between two probability measures \( {\mu} \) and \( {\nu} \) on \( {\mathbb{R}^n} \) is

\[ W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2} \]

where the infimum runs over all random vectors \( {(X,Y)} \) of \( {\mathbb{R}^n\times\mathbb{R}^n} \) with \( {X\sim\mu} \) and \( {Y\sim\nu} \). It turns out that we have the following nice formula for \( {d:=W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))} \):

\[ d^2=\Vert m_1-m_2\Vert_2^2 +\mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \ \ \ \ \ (1) \]

This formula interested several authors including Givens and Shortt, Knott and Smith, Olkin and Pukelsheim, and Dowson and Landau. Note in particular that we have

\[ \mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2})= \mathrm{Tr}((\Sigma_2^{1/2}\Sigma_1\Sigma_2^{1/2})^{1/2}). \]

In the commutative case where \( {\Sigma_1\Sigma_2=\Sigma_2\Sigma_1} \), the formula (1) boils down simply to

\[ W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))^2 =\Vert m_1-m_2\Vert_2^2 +\Vert\Sigma_1^{1/2}-\Sigma_2^{1/2}\Vert_{Frobenius}^2. \]

To prove (1), one can first reduce to the centered case \( {m_1=m_2=0} \). Next, if \( {(X,Y)} \) is a random vector (Gaussian or not) of \( {\mathbb{R}^n\times\mathbb{R}^n} \) with covariance matrix

\[ \Gamma= \begin{pmatrix} \Sigma_1 & C\\ C^\top&\Sigma_2 \end{pmatrix} \]

then the quantity

\[ \mathbb{E}(\Vert X-Y\Vert_2^2)=\mathrm{Tr}(\Sigma_1+\Sigma_2-2C) \]

depends only on \( {\Gamma} \). Also, when \( {\mu=\mathcal{N}(0,\Sigma_1)} \) and \( {\nu=\mathcal{N}(0,\Sigma_2)} \), one can restrict the infimum which defines \( {W_2} \) to run over Gaussian laws \( {\mathcal{N}(0,\Gamma)} \) on \( {\mathbb{R}^n\times\mathbb{R}^n} \) with covariance matrix \( {\Gamma} \) structured as above. The sole constrain on \( {C} \) is the Schur complement constraint:

\[ \Sigma_1-C\Sigma_2^{-1}C^\top\succeq0. \]

The minimization of the function

\[ C\mapsto-2\mathrm{Tr}(C) \]

under the constraint above leads to (1). A detailed proof is given by Givens and Shortt. Alternatively, one may find an optimal transportation map as Knott and Smith. It turns out that \( {\mathcal{N}(m_2,\Sigma_2)} \) is the image law of \( {\mathcal{N}(m_1,\Sigma_1)} \) with the linear map

\[ x\mapsto m_2+A(x-m_1) \]

where

\[ A=\Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}=A^\top. \]

To check that this maps \( {\mathcal{N}(m_1,\Sigma_1)} \) to \( {\mathcal{N}(m_2,\Sigma_2)} \), say in the case \( {m_1=m_2=0} \) for simplicity, one may define the random column vectors \( {X\sim\mathcal{N}(m_1,\Sigma_1)} \) and \( {Y=AX} \) and write

\[ \begin{array}{rcl} \mathbb{E}(YY^\top) &=& A \mathbb{E}(XX^\top) A^\top\\ &=& \Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2} (\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}\\ &=& \Sigma_2. \end{array} \]

To check that the map is optimal, one may use,

\[ \begin{array}{rcl} \mathbb{E}(\|X-Y\|_2^2) &=&\mathbb{E}(\|X\|_2^2)+\mathbb{E}(\|Y\|_2^2)-2\mathbb{E}(\left<X,Y\right>) \\ &=&\mathrm{Tr}(\Sigma_1)+\mathrm{Tr}(\Sigma_2)-2\mathbb{E}(\left<X,AX\right>)\\ &=&\mathrm{Tr}(\Sigma_1)+\mathrm{Tr}(\Sigma_2)-2\mathrm{Tr}(\Sigma_1A) \end{array} \]

and observe that by the cyclic property of the trace,

\[ \mathrm{Tr}(\Sigma_1 A) =\mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \]

The generalizations to elliptic families of distributions and to infinite dimensional Hilbert spaces is probably easy. Some more “geometric” properties of Gaussians with respect to such distances where studied more recently by Takastu and Takastu and Yokota.

The optimal transport map looks strange. Actually, the uniqueness in the Brenier theorem states that if a map \( {T} \) maps \( {\mu} \) to \( {\nu} \) and is the gradient of a convex function, then it is the optimal transport map that maps \( {\mu} \) to \( {\nu} \) and

\[ W_2^2(\mu,\nu)=\int|T(x)-x|^2\mathrm{d}\mu. \]

On the other hand, the affine map \( {x\in\mathbb{R}^d\rightarrow m+Ax\in\mathbb{R}^d} \) is the gradient of a convex function if and only if the \( {d\times d} \) matrix \( {A} \) is semidefinite symmetric. As a direct application, since

\[ x\mapsto m_2+\Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}(x-m_1) \]

is the gradient of a convex function and maps \( {\mathcal{N}(m_1,\Sigma_1)} \) to \( {\mathcal{N}(m_2,\Sigma_2)} \), it is the optimal transport map, and we get the formula for \( {W_2} \) ! This is an alternative to Givens and Shortt.

24 Comments
Syntax · Style · .