Press "Enter" to skip to content

Wasserstein distance between two Gaussians

The \( {W_2} \) Wasserstein coupling distance between two probability measures \( {\mu} \) and \( {\nu} \) on \( {\mathbb{R}^n} \) is

\[ W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2} \]

where the infimum runs over all random vectors \( {(X,Y)} \) of \( {\mathbb{R}^n\times\mathbb{R}^n} \) with \( {X\sim\mu} \) and \( {Y\sim\nu} \). It turns out that we have the following nice formula for \( {d:=W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))} \):

\[ d^2=\Vert m_1-m_2\Vert_2^2 +\mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \ \ \ \ \ (1) \]

This formula interested several authors including Givens and Shortt, Knott and Smith, Olkin and Pukelsheim, and Dowson and Landau. Note in particular that we have

\[ \mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2})= \mathrm{Tr}((\Sigma_2^{1/2}\Sigma_1\Sigma_2^{1/2})^{1/2}). \]

In the commutative case where \( {\Sigma_1\Sigma_2=\Sigma_2\Sigma_1} \), the formula (1) boils down simply to

\[ W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))^2 =\Vert m_1-m_2\Vert_2^2 +\Vert\Sigma_1^{1/2}-\Sigma_2^{1/2}\Vert_{Frobenius}^2. \]

To prove (1), one can first reduce to the centered case \( {m_1=m_2=0} \). Next, if \( {(X,Y)} \) is a random vector (Gaussian or not) of \( {\mathbb{R}^n\times\mathbb{R}^n} \) with covariance matrix

\[ \Gamma= \begin{pmatrix} \Sigma_1 & C\\ C^\top&\Sigma_2 \end{pmatrix} \]

then the quantity

\[ \mathbb{E}(\Vert X-Y\Vert_2^2)=\mathrm{Tr}(\Sigma_1+\Sigma_2-2C) \]

depends only on \( {\Gamma} \). Also, when \( {\mu=\mathcal{N}(0,\Sigma_1)} \) and \( {\nu=\mathcal{N}(0,\Sigma_2)} \), one can restrict the infimum which defines \( {W_2} \) to run over Gaussian laws \( {\mathcal{N}(0,\Gamma)} \) on \( {\mathbb{R}^n\times\mathbb{R}^n} \) with covariance matrix \( {\Gamma} \) structured as above. The sole constrain on \( {C} \) is the Schur complement constraint:

\[ \Sigma_1-C\Sigma_2^{-1}C^\top\succeq0. \]

The minimization of the function

\[ C\mapsto-2\mathrm{Tr}(C) \]

under the constraint above leads to (1). A detailed proof is given by Givens and Shortt. Alternatively, one may find an optimal transportation map as Knott and Smith. It turns out that \( {\mathcal{N}(m_2,\Sigma_2)} \) is the image law of \( {\mathcal{N}(m_1,\Sigma_1)} \) with the linear map

\[ x\mapsto m_2+A(x-m_1) \]


\[ A=\Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}=A^\top. \]

To check that this maps \( {\mathcal{N}(m_1,\Sigma_1)} \) to \( {\mathcal{N}(m_2,\Sigma_2)} \), say in the case \( {m_1=m_2=0} \) for simplicity, one may define the random column vectors \( {X\sim\mathcal{N}(m_1,\Sigma_1)} \) and \( {Y=AX} \) and write

\[ \begin{array}{rcl} \mathbb{E}(YY^\top) &=& A \mathbb{E}(XX^\top) A^\top\\ &=& \Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2} (\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}\\ &=& \Sigma_2. \end{array} \]

To check that the map is optimal, one may use,

\[ \begin{array}{rcl} \mathbb{E}(\|X-Y\|_2^2) &=&\mathbb{E}(\|X\|_2^2)+\mathbb{E}(\|Y\|_2^2)-2\mathbb{E}(\left<X,Y\right>) \\ &=&\mathrm{Tr}(\Sigma_1)+\mathrm{Tr}(\Sigma_2)-2\mathbb{E}(\left<X,AX\right>)\\ &=&\mathrm{Tr}(\Sigma_1)+\mathrm{Tr}(\Sigma_2)-2\mathrm{Tr}(\Sigma_1A) \end{array} \]

and observe that by the cyclic property of the trace,

\[ \mathrm{Tr}(\Sigma_1 A) =\mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \]

The generalizations to elliptic families of distributions and to infinite dimensional Hilbert spaces is probably easy. Some more “geometric” properties of Gaussians with respect to such distances where studied more recently by Takastu and Takastu and Yokota.


  1. Djalil Chafaï 2011-11-29

    I’ve just corrected a typo pointed out by Pierre-André Zitt. Thanks, Mr PAZ!

  2. Pierre-André Zitt 2011-12-28

    You’re welcome, Mr DjaC!

    For the record, the generalizations to elliptically symmetric distributions and infinite dimensional Hilbert spaces may be found in this paper by Gelbrich.

    Mr PAZ

  3. Gentil 2012-10-12

    Voila justement la formule que je cherchais. Merci!

  4. abhishek 2013-11-18

    can you please tell me what does “tr” stand for ?

  5. Djalil Chafaï 2013-11-18

    Tr is a standard notation for Trace, the sum of the diagonal terms.

  6. Djalil Chafaï 2013-12-16

    Fixed a bug (pointed out by Nicolas F. by email) in the formula of the optimal transportation map.

  7. Djalil Chafaï 2013-12-16

    Note that this distance is also known as the Fréchet or Mallows or Kantorovitch distance in certain communities.

  8. Djalil Chafaï 2014-10-28

    It seems that the expression of the W2 distance between two Gaussian laws is called the Bure metric. See for instance the article arXiv:1410.6883 by Peter Forrester and Mario Kieburg entitled “Relating the Bures measure to the Cauchy two-matrix model”.

  9. Luc, Chen 2018-01-16


    Just curious: Is that possible to get the W_2 distance of two centered Gaussian vectors with covariance matrices K, C in terms of operator norm of C – K ?

    Thanks in advance.

    Best, Luc

  10. Kay P 2018-04-05

    Thanks for the great post ! Can you provide some explanation of the statement ? I know cyclic property of trace but there is a trace of square root here.
    $\mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2})= \mathrm{Tr}((\Sigma_2^{1/2}\Sigma_1\Sigma_2^{1/2})^{1/2})$.

  11. Djalil Chafaï 2018-04-06

    Yes. This identity comes from the fact that both terms are equal to $d^2$ which is a symmetric expression in $A$ and $B$. The argument that you mention with the cyclicity of the trace works when $A$ and $B$ commute.

  12. Ting Pan 2018-04-10

    I am really interested in this topic. And I have also read the paper of Givens and Shortt. But I have a question about the derivation process in that paper, why the authors said this distance could only be used for gaussian distribution? I think that no propertise of gaussian distribution are used during the derivation process.

    Thanks in advance.
    Best, Ting

  13. Djalil Chafaï 2018-04-10

    Good question: actually when $X$ and $Y$ are Gaussian we know how to construct a couple $(X,Y)$ with fixed marginals and covariance (we just take a Gaussian), while when $X$ and $Y$ are not Gaussian, we don’t.

  14. Kay P 2018-04-15

    I am also wondering if the Wasserstein distance between two Gaussians is $d^{2}$ is negative definite distance. Meaning if $d^2:\mathbb{R}^{d}\times\mathbb{R}^{d} \mapsto \mathbb{R}$ , for any $x_{1},…x_n$ belonging to $\mathbb{R}^{d}$ and any real numbers such that $\sum_{j=1}^{n} c_{j}=0$ the following inequality holds :
    $\sum_{i=1}^{n}\sum_{j=1}^{n} d\left(x_{i},x_{j}\right) c_{i}c_{j} \leq 0$

Leave a Reply

Your email address will not be published. Required fields are marked *