Press "Enter" to skip to content

Month: April 2010

Nonnegative Matrix Factorization

Have you heard about the Nonnegative Matrix Factorization (NMF)? It has some similarities with the Principal Component Analysis (PCA) and belongs to the rank or dimension reduction techniques. The NMF incorporates a sign constraint, and is not based on the Singular Value Decomposition (SVD). It can be used with a Bregman entropic pseudo-distance. I have heard about NMF during a talk on speach recognition in INRIA Bordeaux I. It seems that NMF has some success in various domains of applications. It is quite natural to think about some adaptive sparse NMF, an analogue of the sparse PCA.

Leave a Comment

Wasserstein distance between two Gaussians

Leonid Vitaliyevich Kantorovich (1912 – 1986)
Leonid Vitaliyevich Kantorovich (1912 – 1986)

The \( {W_2} \) Wasserstein coupling distance between two probability measures \( {\mu} \) and \( {\nu} \) on \( {\mathbb{R}^n} \) is

\[ W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2} \]

where the infimum runs over all random vectors \( {(X,Y)} \) of \( {\mathbb{R}^n\times\mathbb{R}^n} \) with \( {X\sim\mu} \) and \( {Y\sim\nu} \). It turns out that we have the following nice formula for \( {d:=W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))} \):

\[ d^2=\Vert m_1-m_2\Vert_2^2 +\mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \ \ \ \ \ (1) \]

This formula interested several authors including Givens and Shortt, Knott and Smith, Olkin and Pukelsheim, and Dowson and Landau. Note in particular that we have

\[ \mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2})= \mathrm{Tr}((\Sigma_2^{1/2}\Sigma_1\Sigma_2^{1/2})^{1/2}). \]

In the commutative case where \( {\Sigma_1\Sigma_2=\Sigma_2\Sigma_1} \), the formula (1) boils down simply to

\[ W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))^2 =\Vert m_1-m_2\Vert_2^2 +\Vert\Sigma_1^{1/2}-\Sigma_2^{1/2}\Vert_{Frobenius}^2. \]

To prove (1), one can first reduce to the centered case \( {m_1=m_2=0} \). Next, if \( {(X,Y)} \) is a random vector (Gaussian or not) of \( {\mathbb{R}^n\times\mathbb{R}^n} \) with covariance matrix

\[ \Gamma= \begin{pmatrix} \Sigma_1 & C\\ C^\top&\Sigma_2 \end{pmatrix} \]

then the quantity

\[ \mathbb{E}(\Vert X-Y\Vert_2^2)=\mathrm{Tr}(\Sigma_1+\Sigma_2-2C) \]

depends only on \( {\Gamma} \). Also, when \( {\mu=\mathcal{N}(0,\Sigma_1)} \) and \( {\nu=\mathcal{N}(0,\Sigma_2)} \), one can restrict the infimum which defines \( {W_2} \) to run over Gaussian laws \( {\mathcal{N}(0,\Gamma)} \) on \( {\mathbb{R}^n\times\mathbb{R}^n} \) with covariance matrix \( {\Gamma} \) structured as above. The sole constrain on \( {C} \) is the Schur complement constraint:

\[ \Sigma_1-C\Sigma_2^{-1}C^\top\succeq0. \]

The minimization of the function

\[ C\mapsto-2\mathrm{Tr}(C) \]

under the constraint above leads to (1). A detailed proof is given by Givens and Shortt. Alternatively, one may find an optimal transportation map as Knott and Smith. It turns out that \( {\mathcal{N}(m_2,\Sigma_2)} \) is the image law of \( {\mathcal{N}(m_1,\Sigma_1)} \) with the linear map

\[ x\mapsto m_2+A(x-m_1) \]


\[ A=\Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}=A^\top. \]

To check that this maps \( {\mathcal{N}(m_1,\Sigma_1)} \) to \( {\mathcal{N}(m_2,\Sigma_2)} \), say in the case \( {m_1=m_2=0} \) for simplicity, one may define the random column vectors \( {X\sim\mathcal{N}(m_1,\Sigma_1)} \) and \( {Y=AX} \) and write

\[ \begin{array}{rcl} \mathbb{E}(YY^\top) &=& A \mathbb{E}(XX^\top) A^\top\\ &=& \Sigma_1^{-1/2}(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2} (\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}\Sigma_1^{-1/2}\\ &=& \Sigma_2. \end{array} \]

To check that the map is optimal, one may use,

\[ \begin{array}{rcl} \mathbb{E}(\|X-Y\|_2^2) &=&\mathbb{E}(\|X\|_2^2)+\mathbb{E}(\|Y\|_2^2)-2\mathbb{E}(\left<X,Y\right>) \\ &=&\mathrm{Tr}(\Sigma_1)+\mathrm{Tr}(\Sigma_2)-2\mathbb{E}(\left<X,AX\right>)\\ &=&\mathrm{Tr}(\Sigma_1)+\mathrm{Tr}(\Sigma_2)-2\mathrm{Tr}(\Sigma_1A) \end{array} \]

and observe that by the cyclic property of the trace,

\[ \mathrm{Tr}(\Sigma_1 A) =\mathrm{Tr}((\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \]

The generalizations to elliptic families of distributions and to infinite dimensional Hilbert spaces is probably easy. Some more “geometric” properties of Gaussians with respect to such distances where studied more recently by Takastu and Takastu and Yokota.

Last Updated on 2021-02-25


Exponential mixtures of exponentials are Pareto

Did you know that if $X\sim\mathcal{E}(\lambda)$ and $\mathcal{L}(Y\,\vert\,X=x)=\mathcal{E}(x)$ for all $x\geq0$ then $Y$ follows a Pareto distribution with  probability density function $x\mapsto 1/(\lambda+x)^2$? Funny!

Consider now the kinetic diffusion process $(X_t,Y_t)_{t\geq0}$ on $\mathbb{R}^2$ where

$$\begin{cases}dX_t&=dB_t-s(X_t)\lambda dt\\dY_t&=dW_t-s(Y_t)|X_t|dt\end{cases}$$ where $(B_t)_{t\geq0}$ and $(W_t)_{t\geq0}$ are independent standard Brownian motions and $s$ is the sign function… Can you guess the invariant measure and control the speed of convergence?

Last Updated on 2019-11-10



The Portmanteau theorem gives several statements equivalent to the narrow convergence i.e. the weak convergence of probability measures with respect to continuous bounded functions. I wonder if Portmanteau was a mathematician or if this name is just due to the fact that the theorem is a portmanteau for several statements.

Last Updated on 2010-08-22

Leave a Comment
Syntax · Style · .