Wasserstein distance between two Gaussians – Libres pensées d'un mathématicien ordinaire

Leonid Vitaliyevich Kantorovich (1912 – 1986)

The $ {W_2} $ Wasserstein coupling distance between two probability measures $ {\mu} $ and $ {\nu} $ on $ {\mathbb{R}^n} $ is

\[ W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2} \]

where the infimum runs over all random vectors $ {(X,Y)} $ of $ {\mathbb{R}^n\times\mathbb{R}^n} $ with $ {X\sim\mu} $ and $ {Y\sim\nu} $. It turns out that we have the following nice formula for $ {d:=W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))} $:

\[ d^2=\Vert m_1-m_2\Vert_2^2 +\mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \ \ \ \ \ (1) \]

This formula interested several authors including Givens and Shortt, Knott and Smith, Olkin and Pukelsheim, and Dowson and Landau. Note in particular that we have

\[ \mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2})= \mathrm{Tr}((\Sigma_2^{1/2}\Sigma_1\Sigma_2^{1/2})^{1/2}). \]

In the commutative case where $ {\Sigma_1\Sigma_2=\Sigma_2\Sigma_1} $, the formula (1) boils down simply to

\[ W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))^2 =\Vert m_1-m_2\Vert_2^2 +\Vert\Sigma_1^{1/2}-\Sigma_2^{1/2}\Vert_{Frobenius}^2. \]

To prove (1), one can first reduce to the centered case $ {m_1=m_2=0} $. Next, if $ {(X,Y)} $ is a random vector (Gaussian or not) of $ {\mathbb{R}^n\times\mathbb{R}^n} $ with covariance matrix

\[ \Gamma= \begin{pmatrix} \Sigma_1 & C\\ C^\top&\Sigma_2 \end{pmatrix} \]

then the quantity

\[ \mathbb{E}(\Vert X-Y\Vert_2^2)=\mathrm{Tr}(\Sigma_1+\Sigma_2-2C) \]

depends only on $ {\Gamma} $. Also, when $ {\mu=\mathcal{N}(0,\Sigma_1)} $ and $ {\nu=\mathcal{N}(0,\Sigma_2)} $, one can restrict the infimum which defines $ {W_2} $ to run over Gaussian laws $ {\mathcal{N}(0,\Gamma)} $ on $ {\mathbb{R}^n\times\mathbb{R}^n} $ with covariance matrix $ {\Gamma} $ structured as above. The sole constrain on $ {C} $ is the Schur complement constraint:

\[ \Sigma_1-C\Sigma_2^{-1}C^\top\succeq0. \]

The minimization of the function

\[ C\mapsto-2\mathrm{Tr}(C) \]

under the constraint above leads to (1). A detailed proof is given by Givens and Shortt. Alternatively, one may find an optimal transportation map as Knott and Smith. It turns out that $ {\mathcal{N}(m_2,\Sigma_2)} $ is the image law of $ {\mathcal{N}(m_1,\Sigma_1)} $ with the linear map

\[ x\mapsto m_2+A(x-m_1) \]

where

\[ A=\Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}=A^\top. \]

To check that this maps $ {\mathcal{N}(m_1,\Sigma_1)} $ to $ {\mathcal{N}(m_2,\Sigma_2)} $, say in the case $ {m_1=m_2=0} $ for simplicity, one may define the random column vectors $ {X\sim\mathcal{N}(m_1,\Sigma_1)} $ and $ {Y=AX} $ and write

\[ \begin{array}{rcl} \mathbb{E}(YY^\top) &=& A \mathbb{E}(XX^\top) A^\top\\ &=& \Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2} (\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}\\ &=& \Sigma_2. \end{array} \]

To check that the map is optimal, one may use,

\[ \begin{array}{rcl} \mathbb{E}(\|X-Y\|_2^2) &=&\mathbb{E}(\|X\|_2^2)+\mathbb{E}(\|Y\|_2^2)-2\mathbb{E}(\left<X,Y\right>) \\ &=&\mathrm{Tr}(\Sigma_1)+\mathrm{Tr}(\Sigma_2)-2\mathbb{E}(\left<X,AX\right>)\\ &=&\mathrm{Tr}(\Sigma_1)+\mathrm{Tr}(\Sigma_2)-2\mathrm{Tr}(\Sigma_1A) \end{array} \]

and observe that by the cyclic property of the trace,

\[ \mathrm{Tr}(\Sigma_1 A) =\mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \]

The generalizations to elliptic families of distributions and to infinite dimensional Hilbert spaces is probably easy. Some more “geometric” properties of Gaussians with respect to such distances where studied more recently by Takastu and Takastu and Yokota.

The optimal transport map looks strange. Actually, the uniqueness in the Brenier theorem states that if a map $ {T} $ maps $ {\mu} $ to $ {\nu} $ and is the gradient of a convex function, then it is the optimal transport map that maps $ {\mu} $ to $ {\nu} $ and

\[ W_2^2(\mu,\nu)=\int|T(x)-x|^2\mathrm{d}\mu. \]

On the other hand, the affine map $ {x\in\mathbb{R}^d\rightarrow m+Ax\in\mathbb{R}^d} $ is the gradient of a convex function if and only if the $ {d\times d} $ matrix $ {A} $ is semidefinite symmetric. As a direct application, since

\[ x\mapsto m_2+\Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}(x-m_1) \]

is the gradient of a convex function and maps $ {\mathcal{N}(m_1,\Sigma_1)} $ to $ {\mathcal{N}(m_2,\Sigma_2)} $, it is the optimal transport map, and we get the formula for $ {W_2} $ ! This is an alternative to Givens and Shortt.

Some other posts:

24 Comments

Djalil Chafaï 2011-11-29
I’ve just corrected a typo pointed out by Pierre-André Zitt. Thanks, Mr PAZ!
Pierre-André Zitt 2011-12-28
You’re welcome, Mr DjaC!
For the record, the generalizations to elliptically symmetric distributions and infinite dimensional Hilbert spaces may be found in this paper by Gelbrich.
—
Mr PAZ
Gentil 2012-10-12
Voila justement la formule que je cherchais. Merci!
abhishek 2013-11-18
can you please tell me what does “tr” stand for ?
Djalil Chafaï 2013-11-18
Tr is a standard notation for Trace, the sum of the diagonal terms.
Djalil Chafaï 2013-12-16
Fixed a bug (pointed out by Nicolas F. by email) in the formula of the optimal transportation map.
Djalil Chafaï 2013-12-16
Note that this distance is also known as the Fréchet or Mallows or Kantorovitch distance in certain communities.
Djalil Chafaï 2014-10-28
It seems that the expression of the W2 distance between two Gaussian laws is called the Bure metric. See for instance the article arXiv:1410.6883 by Peter Forrester and Mario Kieburg entitled “Relating the Bures measure to the Cauchy two-matrix model”.
Luc, Chen 2018-01-16
Hi,
Just curious: Is that possible to get the W_2 distance of two centered Gaussian vectors with covariance matrices K, C in terms of operator norm of C – K ?
Thanks in advance.
Best, Luc
Kay P 2018-04-05
Thanks for the great post ! Can you provide some explanation of the statement ? I know cyclic property of trace but there is a trace of square root here.
$\mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2})= \mathrm{Tr}((\Sigma_2^{1/2}\Sigma_1\Sigma_2^{1/2})^{1/2})$.
Djalil Chafaï 2018-04-06
Yes. This identity comes from the fact that both terms are equal to $d^2$ which is a symmetric expression in $A$ and $B$. The argument that you mention with the cyclicity of the trace works when $A$ and $B$ commute.
Ting Pan 2018-04-10
Hi,
I am really interested in this topic. And I have also read the paper of Givens and Shortt. But I have a question about the derivation process in that paper, why the authors said this distance could only be used for gaussian distribution? I think that no propertise of gaussian distribution are used during the derivation process.
Thanks in advance.
Best, Ting
Djalil Chafaï 2018-04-10
Good question: actually when $X$ and $Y$ are Gaussian we know how to construct a couple $(X,Y)$ with fixed marginals and covariance (we just take a Gaussian), while when $X$ and $Y$ are not Gaussian, we don’t.
Kay P 2018-04-15
I am also wondering if the Wasserstein distance between two Gaussians is $d^{2}$ is negative definite distance. Meaning if $d^2:\mathbb{R}^{d}\times\mathbb{R}^{d} \mapsto \mathbb{R}$ , for any $x_{1},…x_n$ belonging to $\mathbb{R}^{d}$ and any real numbers such that $\sum_{j=1}^{n} c_{j}=0$ the following inequality holds :
$\sum_{i=1}^{n}\sum_{j=1}^{n} d\left(x_{i},x_{j}\right) c_{i}c_{j} \leq 0$
Djalil Chafaï 2018-04-15
No idea for now. This reminds me of course the Schoenberg theorem and the Bochner-Minlos theorem.
Giovanni Peccati 2018-06-08
Dear Djalil, do we know anything about optimal coupling of two Gaussian vectors when the Euclidean norm is replaced by the sup norm? (of course one can obtain bounds via the equivalence of norms, but my feeling is that the dependence on the dimension in such bounds would be suboptimal). The question is even more relevant in the multidimensional case (in a framework similar to the paper by Gelbrich). For instance, if we have sharp uniform bounds on the difference between the covariance functions of two smooth Gaussian fields on a compact set, is there anything known about how to couple them so that the expectation of the sup-norm of the difference is minimised ? I could not find anything general in the literature, and typically these couplings are made on a case-by-case basis. Thank you !
Djalil Chafaï 2018-06-08
Dear Giovanni, up to now, it is unclear to me. Best.
Gabriel 2021-02-20
What about a explicit (not bounds) formula for the 1-Wasserstein distance ?
Djalil Chafaï 2021-02-20
Dear Gabriel, an explicit formula for $W_2$ is natural for Gaussians due to their Hilbertian nature, their link with second moment. I do not see how to proceed for $W_1$, neither for Gaussians nor for other well known multivariate parametric distributions. But if you have an idea, intuition, or reference, please let me know! Note that in dimension 1, due to the monotony of transportation, there are exact formulas for $W_p$ in terms of $L^p$ norm of cumulative distribution function (or its inverse, the quantile function), but this is not that explicit:
$$
W_1(\mu_1,\mu_2)=\int_{-\infty}^{+\infty}|F_1(x)-F_2(x)|\mathrm{d}x
\quad\mbox{and}\quad
W_p^p(\mu_1,\mu_2)=\int_0^1|F^{-1}_1(t)-F^{-1}_2(t)|^p\mathrm{d}t.
$$
In particular we get the following explicit formula for the $W_1$ distance between exponential laws:
$$
W_1(\mathrm{Exp}(\lambda_1),\mathrm{Exp}(\lambda_2))
=\int_0^\infty(\mathrm{e}^{-(\lambda_1\wedge\lambda_2)x}-\mathrm{e}^{-(\lambda_1\vee\lambda_2)x})\mathrm{d}x
=\Bigr|\frac{1}{\lambda_1}-\frac{1}{\lambda_2}\Bigr|=\frac{|\lambda_1-\lambda_2|}{\lambda_1\lambda_2}.
$$
This is the famous engineer distance $|\mathbb{E}(X_1)-\mathbb{E}(X_2)|$.
William 2022-06-21
Hi Djalil,
Can I ask why we can restrict the infimum of W2 between two Gaussians over Gaussian laws? I mean whether it is possible that a non-Gaussian coupling between two Gaussians produces smaller W2 value.
Thanks!!
Djalil Chafaï 2022-06-22
Consider an arbitrary couple (X,Y), then E(|X-Y|^2) depends only of the covariance matrix of (X,Y). Therefore, if X and Y are Gaussian, we can find a Gaussian couple (X’,Y’) with same marginal laws and covariance matrix and such that E(|X’-Y’|^2)=E(|X-Y|^2). We can thus restrict the problem to Gaussian couples.
Herbert 2022-08-01
What is the definition of matrix square root X^(1/2)=sqrt(X) here ?
Is it X= sqrt(X) * sqrt(X) or is it X=sqrt(X)^T * sqrt(X) ?
Djalil Chafaï 2022-08-18
The standard definition of the square root of a positive semidefinite symmetric matrix $X = ODO^\top$ is $\sqrt{X} := O\sqrt{D} O^\top$ where $\sqrt{D}$ is the diagonal matrix formed by the square root of the entries of $D$.
Jacob Antony 2024-05-10
I am interested in applications for topological data analysis in machine learning. Would you please turn on rss feed for your blog ?

This site uses Akismet to reduce spam. Learn how your comment data is processed.