Press "Enter" to skip to content

Month: January 2011

Singular values and rows distances

Any matrix \( {M\in\mathcal{M}_{m,n}} \) maps the unit sphere to an ellipsoid. The singular values of \( {M} \) are the half lenghts of the principal axes of this ellipsoid. We denote them

\[ s_1(M)\geq\cdots\geq s_{\min(n,m)}(M) \]

the \( {\max(m,n)-\min(m,n)} \) others are zero. If \( {M^*} \) is the conjugate transpose of \( {M} \) then the positive-semidefinite Hermitian matrices \( {\sqrt{MM^*}} \) and \( {\sqrt{MM^*}} \) share the same spectrum up to the multiplicity of \( {0} \), which is given by the singular values of \( {M} \). The singular values decomposition (SVD) theorem states that there exits unitary matrices \( {U} \) \( {m\times m} \) and \( {V} \) \( {n\times n} \) such that

\[ UMV=\mathrm{diag}(s_1(M),\ldots,s_{\min(n,m)}(M)) \]

where we append as much zeros as needed. The largest singular value

\[ s_1(M)=\max_{\left|x\right|_2=1}\left|Mx\right|_2=\left|M\right|_2 \]

is the operator norm of \( {M} \). For the smallest we have

\[ s_{\min(n,m)}(M)=\min_{\left|x\right|_2=1}\left|Mx\right|_2. \]

The matrix \( {M} \) has \( {\mathrm{rank}(M)} \) non zero singular values. The matrix \( {M} \) is invertible iff \( {m=n} \) and \( {s_n(M)>0} \) and in this case \( {s_n(M)=\left|M^{-1}\right|_2^{-1}} \). If \( {m=n} \) and \( {M} \) is normal (i.e. \( {MM^*=M^*M} \)) then the singular values of \( {M} \) are the modules of the eigenvalues of \( {M} \).

If \( {m=n} \) and \( {M} \) has rows \( {R_1,\ldots,R_n} \) then by viewing \( {\left|\det(M)\right|} \) as a hyperparallelogram volume,

\[ |\det(M)|=\prod_{k=1}^ns_k(M) =\prod_{k=1}^n\mathrm{dist}(R_k,\mathrm{span}\{R_1,\ldots,R_{k-1}\}). \]

Here are two additional nice facts relating singular values with rows distances, due to Rudelson and Vershynin [CPAM 2009] and Tao and Vu [AoP 2010].

Lemma 1 (Rudelson and Vershynin) If \( {M\in\mathcal{M}_{m,n}(K)} \) has rows \( {R_1,\ldots,R_m} \), then, denoting \( {R_{-i}=\mathrm{span}\{R_j:j\neq i\}} \), we have

\[ m^{-1/2}\min_{1\leq i\leq m}\mathrm{dist}_2(R_i,R_{-i}) \leq s_{\min(m,n)}(M) \leq \min_{1\leq i\leq m}\mathrm{dist}_2(R_i,R_{-i}). \]

Proof: Since \( {M} \) and \( {M^\top} \) have same singular values, we can prove the statement for the columns of \( {M} \). For every vector \( {x\in K^n} \) and every \( {i\in\{1,\ldots,n\}} \), the triangle inequality and the identity \( {Mx=x_1C_1+\cdots+x_n C_n} \) give

\[ \left|Mx\right|_2 \geq \mathrm{dist}_2(Mx,C_{-i}) =\min_{y\in C_{-i}} \left|Mx-y\right|_2 =\min_{y\in C_{-i}} \left|x_iC_i-y\right|_2 =\vert x_i\vert\mathrm{dist}_2(C_i,C_{-i}). \]

If \( {\left|x\right|_2 =1} \) then necessarily \( {\vert x_i\vert \geq n^{-1/2}} \) for some \( {i\in\{1,\ldots,n\}} \), and therefore

\[ s_{\min(m,n)}(M) =\min_{\left|x\right|_2 =1}\left|Mx\right|_2 \geq n^{-1/2}\min_{1\leq i\leq n}\mathrm{dist}_2(C_i,C_{-i}). \]

Conversely, for any \( {i\in\{1,\ldots,n\}} \), there exists a vector \( {y\in K^n} \) with \( {y_i=1} \) such that

\[ \mathrm{dist}_2(C_i,C_{-i}) =\left|y_1C_1+\cdots+y_nC_n\right|_2 =\left|My\right|_2 \geq \left|y\right|_2 \min_{\left|x\right|_2=1}\left|Mx\right|_2 \geq s_{\min(m,n)}(M) \]

where we used the fact that \( {\left|y\right|^2_2 = |y_1|^2+\cdots+|y_n|^2\geq |y_i|^2=1} \). ☐

Lemma 2 (Tao and Vu) Let \( {1\leq m\leq n} \) and \( {M\in\mathcal{M}_{m,n}(K)} \) with rows \( {R_1,\ldots,R_m} \). If \( {\mathrm{rank}(M)=m} \) then, denoting \( {R_{-i}=\mathrm{span}\{R_j:j\neq i\}} \), we have

\[ \sum_{i=1}^ms_i^{-2}(M)=\sum_{i=1}^m\mathrm{dist}_2(R_i,R_{-i})^{-2}. \]

Proof: The orthogonal projection of \( {R_i} \) on \( {R_{-i}} \) is \( {B^*(BB^*)^{-1}BR_i^*} \) where \( {B} \) is the \( {(m-1)\times n} \) matrix obtained from \( {M} \) by removing the row \( {R_i} \). In particular, we have

\[ \left| R_i\right|_2^2-\mathrm{dist}_2(R_i,R_{-i})^2 =\left| B^*(BB^*)^{-1}BR_i^*\right|_2^2 = (BR_i^*)^*(BB^*)^{-1}BR_i^* \]

by the Pythagoras theorem. On the other hand, the Schur bloc inversion formula states that if \( {M} \) is an \( {m\times m} \) matrix then for every partition \( {\{1,\ldots,m\}=I\cup I^c} \),

\[ (M^{-1})_{I,I}=(M_{I,I}-M_{I,I^c}(M_{I^c,I^c})^{-1}M_{I^c,I})^{-1}. \]

Now we take \( {M=MM^*} \) and \( {I=\{i\}} \), and we note that \( {(MM^*)_{i,j}=R_i\cdot R_j} \), which gives

\[ ((MM^*)^{-1})_{i,i}=(R_i\cdot R_i-(BR_i^*)^*(BB^*)^{-1}BR_i^*)^{-1} =\mathrm{dist}_2(R_i,R_{-i})^{-2}. \]

The desired formula follows by taking the sum over \( {i\in\{1,\ldots,m\}} \). ☐

Leave a Comment

The Marchenko-Pastur law

Marchenko-Pastur distributions

The MarchenkoPastur law \( {\mu_\rho} \) with shape parameter \( {\rho>0} \) is the probability distribution

\[ \left(1-\frac{1}{\rho}\right)_+\delta_0+\frac{1}{\rho 2\pi x}\sqrt{(b-x)(x-a)}\,\mathbf{1}_{[a,b]}(x)dx \]

where

\[ a=(1-\sqrt{\rho})^2 \quad\text{and}\quad b=(1+\sqrt{\rho})^2. \]

It has mean \( {1} \) and variance \( {\rho} \). It has an atom at point \( {0} \) if and only if \( {\rho>1} \).

Lemma 1 (Moments of the Marchenko-Pastur law) For all \( {r\geq1} \),

\[ \int\!x^r\,d\mu_\rho(x) =\sum_{k=0}^{r-1}\frac{\rho^k}{k+1}\binom{r}{k}\binom{r-1}{k}. \]

Proof: Since \( {a+b=2(1+\rho)} \) and \( {ab=(1-\rho)^2} \) we have

\[ \sqrt{(b-x)(x-a)} =\sqrt{\frac{(a+b)^2}{4}-ab-\left(x-\frac{a+b}{2}\right)^2} =\sqrt{4\rho-(x-(1+\rho))^2} \]

The change of variable \( {y=(x-(1+\rho))/\sqrt{\rho}} \) gives

\[ \int\!x^r\,d\mu_\rho(x) =\int\!x^r\,d\mu_\rho(x) =\frac{1}{2\pi} \int_{-2}^2\!(\sqrt{\rho}y+1+\rho)^{r-1}\sqrt{4-y^2}\,dy. \]

Recall that the moments of the semicircle law are given by the Catalan numbers :

\[ \frac{1}{2\pi}\int_{-2}^2\!y^{2k+1}\,\sqrt{4-y^2}\,dy =0 \quad\text{and}\quad \frac{1}{2\pi}\int_{-2}^2\!y^{2k}\,\sqrt{4-y^2}\,dy =\frac{1}{1+k}\binom{2k}{k}. \]

This gives, by using binomial expansions and Vandermonde’s identity

\[ \begin{array}{rcl} \int\!x^r\,d\mu_\rho(x) &=&\sum_{k=0}^{\lfloor(r-1)/2\rfloor} \rho^k(1+\rho)^{r-1-2k}\binom{r-1}{2k}\binom{2k}{k}\frac{1}{1+k}\\ &=&\sum_{k=0}^{\lfloor(r-1)/2\rfloor} \rho^k(1+\rho)^{r-1-2k}\frac{(r-1)!}{(r-1-2k)!k!(k+1)!}\\ &=&\sum_{k=0}^{\lfloor(r-1)/2\rfloor} \sum_{s=0}^{r-1-2k} \rho^{k+s}\frac{(r-1)!}{k!(k+1)!(r-1-2k-s)!s!}\\ &=&\sum_{t=0}^{r-1}\rho^t\sum_{k=0}^{\min(t,r-1-t)} \frac{(r-1)!}{k!(k+1)!(r-1-t-k)!(t-k)!}\\ &=&\frac{1}{r}\sum_{t=0}^{r-1}\rho^t\binom{r}{t}\sum_{k=0}^{\min(t,r-1-t)} \binom{t}{k}\binom{r-t}{k+1}\\ &=&\frac{1}{r}\sum_{t=0}^{r-1}\rho^t\binom{r}{t}\binom{r}{t+1} \\ &=&\sum_{t=0}^{r-1}\frac{\rho^t}{t+1}\binom{r}{t}\binom{r-1}{t}. \end{array} \]

☐ The Marchenko-Pastur law appears in the famous Marchenko-Pastur theorem. Namely, let \( {X_1,\ldots,X_n} \) be independent and identically distributed random column vectors of \( {\mathbb{R}^m} \), with mean \( {0} \) and covariance matrix \( {I_m} \). Let us consider the \( {m\times m} \) empirical covariance matrix

\[ \frac{1}{n}\sum_{i=1}^nX_iX_i^\top=\frac{1}{n}YY^\top \quad\text{where}\quad Y=\left(X_1\cdots X_n\right). \]

We have \( {\mathbb{E}(\frac{1}{n}YY^\top)=I_m} \) and the strong law of large numbers indicates that with probability one,

\[ \lim_{n\rightarrow\infty}\frac{1}{n}YY^\top = I_m. \]

What happens if \( {m} \) grows with \( {n} \)? Let us assume for simplicity that the components of $X_1$ are independent (automatic if $X_1$ is Gaussian).

Theorem 2 (Marchenko-Pastur) Suppose that \( {m=m_n} \) depends on \( {n} \) is such a way that

\[ \lim_{n\rightarrow\infty}\frac{m_n}{n}=\rho\in(0,\infty). \]

Let \( {\lambda_{n,1},\ldots,\lambda_{n,m_n}\in[0,\infty)} \) be the eigenvalues of the \( {m_n\times m_n} \) empirical covariance matrix \( {\frac{1}{n}YY^\top} \), and let us consider the empirical spectral distribution

\[ \mu_n:=\frac{1}{m_n}\sum_{k=1}^{m_n}\delta_{\lambda_{n,k}}. \]

Then with probability one, for every bounded continuous \( {f:\mathbb{R}\rightarrow\mathbb{R}} \),

\[ \lim_{n\rightarrow\infty}\int\!f\,d\mu_n=\int\!f\,d\mu_\rho \]

In other words, with probability one, for every \( {x\in\mathbb{R}} \) (\( {x\neq0} \) if \( {\rho>1} \)), denoting \( {I=(-\infty,x]} \),

\[ \lim_{n\rightarrow\infty}\mu_n(I)=\mu_\rho(I). \]

Note that

\[ \int\!x\,d\mu_n=\frac{1}{m_n}\sum_{k=1}^{m_n}\lambda_{n,k} =\frac{1}{m_n}\mathrm{Tr}\left(\frac{1}{n}YY^\top\right) =\frac{1}{nm_n}\sum_{\substack{1\leq i\leq m_n\\1\leq j\leq n}}Y_{ij}^2 \]

and thus, by the strong law of large numbers, with probability one,

\[ \lim_{n\rightarrow\infty} \int\!x\,d\mu_n = 1 =\int\!x\,d\mu_\rho(x). \]

This shows the almost sure convergence of the first moment. It shows also that the sequence \( {{(\mu_n)}_{n\geq1}} \) is tight with probability one since by Markov’s inequality, for all \( {r>0} \),

\[ \mu_n([0,r]^c) \leq \frac{1}{r}\int\!x\,d\mu_n. \]

The Marchenko-Pastur theorem can be proved by using the method of moments or the Cauchy-Stieltjes resolvent method. In the Gaussian case, one may also use orthogonal polynomials (in this case the empirical covariance matrix belongs to the Laguerre Ensemble and its law is Wishart). Useful references include the recent book by Bai and Silverstein and the forthcoming book of Pastur and Shcherbina. Regarding orthogonal polynomials, there is a differential operator approach due to Haagerup and Thorbjørnsen, developped by Ledoux, in which the Marchenko-Pastur distribution is recovered as a mixture of a uniform and arcsine distributions. In free probability theory, the Marchenko-Pastur law is known as the free Poisson law, see for instance the book by Nica and Speicher or the book by Hiai and Petz.

If \( {\rho=1} \) then \( {\frac{1}{n}YY^\top} \) is asymptotically square, \( {a=0} \), \( {b=4} \), and the change of variable \( {t\mapsto\sqrt{t}} \) shows that with probability one, the empirical spectral distribution of

\[ \sqrt{\frac{1}{n}YY^\top} \]

converges as \( {n\rightarrow\infty} \) to the quarter circle law with density

\[ \frac{1}{\pi}\sqrt{4-x^2}\,\mathbf{1}_{[0,2]}(x)dx. \]

Concerning the support of the spectrum, one may first order the spectrum of \( {\frac{1}{n}YY^\top} \) as follows:

\[ \lambda_{n,1}\geq\cdots\geq\lambda_{n,m}. \]

Note that \( {\lambda_{n,k}=0} \) if \( {k>\min(m,n)} \). We have then the following:

Theorem 3 (Bai-Yin) Under the assumptions of the Marchenko-Pastur theoren and if additionnaly \( {X_{11}} \) have a finite fourth moment then with probability one

\[ \lim_{n\rightarrow\infty}\lambda_{n,\min(m,n)}=a \quad\text{and}\quad \lim_{n\rightarrow\infty}\lambda_{n,1}=b. \]

The Marchenko-Pastur theorem remains valid even if \( {X_{11}} \) does not have a zero mean, in contrast with the Bai-Yin theorem which fails in that case. The edge of the spectrum of \( {\frac{1}{n}YY^\top} \) is sensitive to finite rank perturbations of \( {Y} \) while the bulk of the spectrum is not, asymptotically.

Vladimir Marchenko and Leonid Pastur
4 Comments

Curvature on discrete spaces

Let us consider an infinite continuous or discrete metric space \((S,d)\) equipped with a probability measure \(\mu\).  For instance a Riemannian manifold with a potential,  such as the sphere equipped with its uniform measure, or \(\mathbb{R}^n\) equipped with the Gaussian measure. One of the most studied problems are the concentration of measure phenomenon for \(\mu\)  and the numerical simulation of \(\mu\). To address these problems, one may introduce an auxiliary Markov process  \((X_t)_{t\geq0}\)  on \(S\) which admits \(\mu\) as an invariant distribution (e.g. the Brownian motion on the sphere and the Ornstein-Uhlenbeck process on \(\mathbb{R}^n\)). Perhaps the most famous instance of the approach is the Metropolis-Hastings simulation algorithm (Monte Carlo Markov Chain).

The speed of convergence of  \((X_t)_{t\geq0}\) to \(\mu\) in entropy can be related to the concentration of measure of \(\mu\). In the case of a Riemannian manifold, this can be related to a notion of curvature taking into account both the geometry (Ricci) of the manifold and the Markov dynamics. It is known as the Bakry-Émery curvature. By using Wasserstein coupling instead of entropy, Ollivier extended this notion of curvature to discrete spaces (there are some connections with ideas of M.-F. Chen, Joulin, Villani, Lott, and Sturm). This allows primarily to control the concentration of measure of \(\mu\). From this point of view the Markov process is instrumental.

If however one has a specific  Markov process of interest (coming from computer science, mathematical biology, or mathematical physics) for which the question is to study the speed of convergence to the equilibrium in entropy, then the curvature of Ollivier seems useless because Wasserstein coupling decay does not  give entropy decay in general (it does for elliptic Markov diffusions). To our knowledge, for birth and death processes, a good notion of curvature for the decay of entropy was developed by Caputo, Dai Pra, and Posta, and relies only on commutation and convexity as in the Bakry-Émery approach. To our knowledge, there is a lack of a good notion of curvature for the decay of entropy for multidimensional discrete processes.

1 Comment
Syntax · Style · Tracking & Privacy.