Euclidean kernel random matrices in high dimension – Libres pensées d'un mathématicien ordinaire

This post is formed by the notes of a talk given on February 12, 2013 for our Tuesday morning working seminar ASPro (Analyse, Statistique, et Probabilités) in Marne-la-Vallée. It concerns a preprint by Charles Bordenave arXiv:1209.5888 entitled On Euclidean random matrices in high dimension. We follow very closely the text of Bordenave, adding some comments and skipping some details. Indeed, this short paper is of particular interest for our group which comprises experts in log-concave distributions, random matrices, statistics, and optimal transportation. The first part of the talk was devoted to the positivity of deterministic kernels, and essentially to the Bochner and the Schoenberg theorems, following a previous post.

Let \( {X} \) be a random vector of \( {\mathbb{R}^p} \) with mean zero and covariance matrix \( {\frac{1}{p}I} \):

\[ \mathbb{E}(X)=(\mathbb{E}(X_k))_{1\leq k\leq p}=0, \quad \mathbb{E}(X\otimes X)=(\mathbb{E}(X_jX_k))_{1\leq j,k\leq p}=\frac{1}{p}I. \]

If \( {X} \) is seen as a column vector then \( {X\otimes X=XX^\top} \) is an empirical covariance matrix of a sample of size \( {1} \). In the jargon, the random vector \( {\sqrt{p}X} \) is isotropic. Let \( {X_1,\ldots,X_n} \) be i.i.d. copies of \( {X} \). Let \( {f:\mathbb{R}_+\rightarrow\mathbb{R}} \) be a mesurable function. We are interested by the Euclidean random kernel \( {n\times n} \) matrix \( {A} \) defined by

\[ A_{i,j}:=f(\left\Vert X_i-X_j\right\Vert_2^2). \]

The notation \( {A} \) (like Adjacency) taken from the paper shows maybe the tropism of the author for (random) graph theory. Kernel matrices appear in many areas of pure and applied mathematics. Every statistician knows something about kernel methods. We have already studied the closely related Bochner, Schoenberg, and Bernstein theorems in a previous post. Actually, the study of random kernel matrices is quite natural from various points of views. We refer in particular to Mézard, Parisi, Zhee for statistical physics, to Vershik for random metric spaces, to El Karoui for mathematical statistics, and to Lovasz-Szegedy, Bordenave, and Vu and Do for random graphs and combinatorial optimization.

Main result. Here is the main result discussed in this post: suppose that

\( {X} \) has a log-concave distribution;
\( {f} \) is three times differentiable at point \( {2} \);
\( {p=p_n} \) is such that \( {p_n/n\rightarrow y\in(0,\infty)} \) as \( {n\rightarrow\infty} \).

Then almost surely, the empirical spectral distribution

\[ \mu_A:=\frac{1}{n}\sum_{k=1}^n\delta_{\lambda_k(A)} \]

converges weakly as \( {n\rightarrow\infty} \) to the law \( {\mu} \) of

\[ f(0)-f(2)+2f'(2)(1-S) \]

where \( {S} \) follows the Marchenko-Pastur distribution given by (here \( {y_\pm:=(1\pm\frac{1}{\sqrt{y}})^2} \)),

\[ \mu_{MP}(dx):=(1-y)^+\delta_0(dx) +\frac{y}{2\pi x}\sqrt{(y_+-x)(x-y_-)}\mathbf{1}_{[y_-,y_+]}(x)\,dx. \]

Exercise: take 1) \( {f(x)=x} \) and 2) \( {f(x)=\sqrt{x}} \).

Rough idea of the proof. The idea is to start from the formula

\[ \left\Vert X_i-X_j\right\Vert_2^2 =\left\Vert X_i\right\Vert_2^2 + \left\Vert X_j\right\Vert_2^2 – 2\left<X_i,X_j\right>. \]

Now, following Guédon and Milman, there exists universal constants \( {c_0,c_1>0} \) such that for any \( {t\geq0} \),

\[ \mathbb{P}(\left|\left\Vert X\right\Vert_2-1\right|\geq t) \leq c_1\exp\left(-c_0\sqrt{p}\min(t,t^3)\right). \]

It follows that \( {\left\Vert X\right\Vert_2\approx1} \) with high probability as \( {p\gg1} \). Consequently, using a Taylor expansion at point \( {2} \), we get

\[ A_{i,j} =f(\left\Vert X_i-X_j\right\Vert_2^2) \approx \mathbf{1}_{i=j}f(0)+\mathbf{1}_{i\neq j}(f(2)-2f'(2)\left<X_i,X_j\right>). \]

This can be rewritten in matrix form as

\[ A\approx M:=(f(0)-f(2)+2f'(2))I+f(2)J-2f'(2)Y^\top Y \]

where \( {J} \) is the \( {p\times p} \) matrix full of ones, and where \( {Y} \) is the \( {p\times n} \) matrix with columns \( {X_1,\ldots,X_n} \). The \( {p\times p} \) matrix \( {YY^\top=X_1X_1^\top+\cdots+X_nX_n^\top} \) is the empirical matrix of the i.i.d. sample \( {X_1,\ldots,X_n} \). It is well known that \( {YY^\top} \) and \( {Y^\top Y} \) have identical eigenvalues, up to the multiplicity of zero (contributes to the Dirac mass at point zero in the empirical spectral distribution).

We can give now a rough intuitive proof of the result: since \( {J} \) has rank one,

\[ \mu_A-\mu_{(f(0)-f(2)+2f'(2))I-2f'(2)Y^\top Y}\rightarrow0 \]

weakly. Now, by a result of Pajor and Pastur (see also Adamczak),

\[ \mu_{Y^\top Y}\rightarrow\mu_{MP} \]

weakly. Therefore \( {\mu_A\rightarrow\mu} \) weakly, where \( {\mu} \) is the law of

\[ f(0)-f(2)+2f'(2)-2f'(2)S,\quad S\sim\mu_{MP}. \]

Conjectures. The thin-shell conjecture formulated by Anttila, Ball, and Perissinaki (see also Bobkov and Koldobsky) states that with high probability \( {X} \) belongs to a shell of thin width, negligible with respect to the radius of the shell. In other words, the norm \( {\left\Vert X\right\Vert_2} \) is well concentrated around say \( {1=\mathbb{E}(\left\Vert X\right\Vert_2^2)^{1/2}\geq\mathbb{E}(\left\Vert X\right\Vert_2)} \):

\[ \mathbb{P}(|\left\Vert X\right\Vert_2-1|\geq\varepsilon_p) \leq\varepsilon_p. \]

The thin-shell property is essential in the available proofs of the central limit theorem for convex bodies, and is naturally linked with the famous Kannan-Lovász-Simonovits (KLS) spectral gap conjecture. A thin-shell property in the form \( {\mathbb{E}((\left\Vert X\right\Vert_2-1)^2)=\mathcal{O}(1/p)} \) has been proved by Klartag if we assume furthermore unconditionallity (symmetries of the cube: the law of \( {X} \) is invariant by i.i.d. sign flips of the coordinates). The result of Guédon and Milman mentioned earlier is a weak instance of the thin shell property. Following Bordenave, a strong thin shell property allows to show that the result above on \( {A} \) remains valid is we assume only that \( {f} \) is differentiable at point \( {2} \), a result conjectured by Vu and Do.

More precisely, starting from \( {A} \), the proof goes by successive Taylor formulas replacing norms with constants, producing each time a new comparison with a simpler matrix, the last matrix being \( {M} \). At each step, the error is controlled by using suitable distances on the empirical spectral distributions of random matrices. This error is shown to be asymptotically negligible by using concentration or rank bound.

Proof 0/7. Functional distances and spectral distributions. Recall that the variation of \( {f:\mathbb{R}\rightarrow\mathbb{R}} \) is

\[ \left\Vert f\right\Vert_V:=\sup_{(x_k)_{k\in\mathbb{Z}}\nearrow}\sum_k|f(x_{k+1})-f(x_k)| \]

where the supremum runs over all real increasing sequences \( {(x_k)_{k\in\mathbb{Z}}} \). When \( {f} \) is \( {\mathcal{C}^1} \) then \( {\left\Vert f\right\Vert_V=\int\!|f'(x)|\,dx=\left\Vert f’\right\Vert_{\mathrm{L}^1}} \) while if \( {f=\mathbf{1}_{(-\infty,t]}} \) then \( {\left\Vert f\right\Vert_V=1} \). The Kolmogorov-Smirnov distance between two probability measures \( {\mu} \) and \( {\nu} \) is

\[ d_{\mathrm{K}}(\mu,\nu):= \sup_{\left\Vert f\right\Vert_V\leq 1}\int\!f\,d(\mu-\nu). \]

The convergence for the \( {d_{\mathrm{K}}} \) distance implies the weak convergence. Another useful distance that implies weak convergence is the Wasserstein coupling distance of order \( {p\geq1} \) defined by

\[ W_p(\mu,\nu):=\inf_{(U,V)}\mathbb{E}(|U-V|^p)^{1/p} \]

where the infimum runs over the set of couples \( {(U,V)} \) of random variables with \( {U\sim\mu} \) and \( {V\sim\nu} \). The map \( {p\mapsto W_p} \) is non-decreasing and it particular \( {W_1\leq W_2} \). Now thanks to a duality result due to Kantorovich and Rubinstein,

\[ W_1(\mu,\nu) =\sup_{\left\Vert f\right\Vert_{\mathrm{Lip}}\leq1}\int\!f\,d(\mu-\nu). \]

The following mixed metric also characterizes weak convergence:

\[ d_{\mathrm{KW}}(\mu,\nu):= \sup_{\max(\left\Vert f\right\Vert_V,\left\Vert f\right\Vert_{\mathrm{Lip}})\leq1} \int\!f\,d(\mu-\nu) \leq\min(d_{\mathrm{K}},W_1)\leq\min(d_{\mathrm{K}},W_2). \]

This reminds the well known Fortet-Mourier distance

\[ d_{\mathrm{F}}(\mu,\nu):= \sup_{\max(\left\Vert f\right\Vert_\infty,\left\Vert f\right\Vert_{\mathrm{Lip}})\leq1} \int\!f\,d(\mu-\nu). \]

By the Hoffman-Wielandt inequality, for every Hermitian matrices \( {H_1} \) and \( {H_2} \),

\[ nW_2(\mu_{H_1},\mu_{H_2})^2 =\min_{\sigma\in\mathcal{S}_n}\sum_{k=1}^n(\lambda_k(H_1)-\lambda_{\sigma(k)}(H_2))^2 \leq \mathrm{Tr}((H_1-H_2)^2)=\left\Vert H_1-H_2\right\Vert_{\mathrm{HS}}^2. \]

On the other hand, by the Courant-Fischer min-max variation formulas for the eigenvalues, for every \( {n\times n} \) Hermitian matrices \( {H_1} \) and \( {H_2} \):

\[ d_{\mathrm{K}}(\mu_{H_1},\mu_{H_2}) \leq \frac{\mathrm{rank}(H_1-H_2)}{n}. \]

We obtain finally that for every \( {n\times n} \) Hermitian matrices \( {H_1} \) and \( {H_2} \),

\[ d_{\mathrm{KW}}(\mu_{H_1},\mu_{H_2}) \leq \min\left( \sqrt{\frac{\mathrm{Tr}((H_1-H_2)^2)}{n}}, \frac{\mathrm{rank}(H_1-H_2)}{n}\right). \]

In random matrix theory, this is maybe the most useful global perturbation inequality.

Proof 1/7. Reduction to expectation by concentration. Let us see our random matrix \( {A=A(X_1,\ldots,X_n)} \) as a measurable function of \( {X_1,\ldots,X_n} \). Now, if \( {x_1,\ldots,x_n} \) and \( {x_1′,\ldots,x_n’} \) differ only on one coordinate then

\[ \mathrm{rank}\left(A(x_1,\ldots,x_n)-A(x_1′,\ldots,x_n’)\right)\leq 2. \]

By combining this with the rank inequality above and the Azuma-Hoeffding bounded differences martingale method, we obtain the following concentration inequality, valid for every real function \( {f} \) and every real number \( {r\geq0} \),

\[ \mathbb{P}\left(\int\!f\,d\mu_A-\mathbb{E}\int\!f\,d\mu_A\geq r\right) \leq \exp\left(-\frac{nr^2}{8\left\Vert f\right\Vert_V^2}\right). \]

By the first Borel-Cantelli lemma, it follows then that \( {\mu_A-\mathbb{E}\mu_A\rightarrow0} \) weakly. This reduces the analysis to the expected spectral distribution \( {\mathbb{E}\mu_A} \).

Proof 2/7. Reduction to linearized version. Recall that by the Pajor-Pastur theorem, \( {\mathbb{E}\mu_{Y^\top Y}\rightarrow\mu_{MP}} \) weakly. Since \( {J} \) has rank one, it follows from the rank inequality above that \( {\mathbb{E}\mu_M\rightarrow\mu} \) weakly. Therefore it suffices to prove that \( {\mathbb{E}\mu_A-\mathbb{E}\mu_M\rightarrow0} \), and thus for example to prove that

\[ d_{\mathrm{KW}}(\mathbb{E}\mu_A,\mathbb{E}\mu_M)\rightarrow0. \]

The method consists in performing Taylor expansions, and in using concentration.

Proof 3/7. Concentration of norms. We need to control the concentration of \( {\left\Vert X_i-X_j\right\Vert_2^2} \) around \( {2} \) and \( {\left\Vert X_i\right\Vert_2^2} \) around \( {1} \). There exists \( {\delta>0} \) such that \( {f} \) in \( {\mathcal{C}^1} \) on \( {K:=(2-\delta,2+\delta)} \) and for every \( {x\in K} \),

\[ f(x)=f(2)+f'(2)(x-2)+\frac{f”(2)}{2}(x-2)^2+\frac{f”'(2)}{6}(x-2)^3(1+o(1)). \]

Since the random vector \( {\sqrt{\frac{p}{2}}(X_i-X_j)} \) is log-concave and isotropic for every \( {i\neq j} \), it follows from the Guédon-Milman result mentioned earlier that as \( {n\rightarrow\infty} \), with \( {\varepsilon_n:=\max(n^{-\kappa},\delta/2)} \) for a small enough \( {\kappa>0} \), we have \( {\mathbb{P}(\mathcal{E})\rightarrow1} \) where

\[ \mathcal{E}:=\left\{ \max_{1\leq i,j\leq n}|\left\Vert X_i-X_j\right\Vert_2^2-2|\leq\varepsilon_n; \max_{1\leq i\leq n}|\left\Vert X_i\right\Vert_2^2-1|\leq\varepsilon_n\right\}. \]

Proof 4/7. First Taylor expansion. Since \( {f} \) is \( {\mathcal{C}^1} \) in \( {K} \), we perform a Taylor expansion around \( {\left\Vert X_i\right\Vert_2^2+\left\Vert X_j\right\Vert_2^2} \), which belongs to \( {K} \) on the event \( {\mathcal{E}} \). This leads to compare \( {A_{i,j}=f(\left\Vert X_i-X_j\right\Vert_2^2)} \) with

\[ B_{i,j}:=\mathbf{1}_{i=j}f(0) + \mathbf{1}_{i\neq j}\left[f(\left\Vert X_i\right\Vert_2^2 +\left\Vert X_j\right\Vert_2^2) -2f'(\left\Vert X_i\right\Vert_2^2+\left\Vert X_j\right\Vert_2^2) \left<X_i,X_j\right>\right]. \]

Now, for some \( {\delta_n\rightarrow0} \),

\[ A_{i,j}-B_{i,j} =o(\left\Vert X_i-X_j\right\Vert_2^2-\left\Vert X_i\right\Vert_2^2-\left\Vert X_j\right\Vert_2^2) \leq \delta_n|\left<X_i,X_i\right>|. \]

Consequently, we have

\[ d_{\mathrm{KW}}(\mathbb{E}\mu_A,\mathbb{E}\mu_B) \leq \mathbb{E}(d(\mu_A,\mu_B)) \leq \mathbb{P}(\mathcal{E}^c) + \delta_n\left(\frac{1}{n}\sum_{i\neq j}\mathbb{E}(|A_{i,j}-B_{i,j}|^2\mathbf{1}_{\mathcal{E}})\right)^{1/2}. \]

Thus

\[ d_{\mathrm{KW}}(\mathbb{E}\mu_A,\mathbb{E}\mu_B) \leq o(1)+\delta_n\left(n\mathbb{E}(|\left<X_1,X_2\right>|^2)\right)^{1/2}. \]

Now by isotropy

\[ \mathbb{E}(|\left<X_1,X_2\right>|^2) =\sum_{k,k’=1}^p\mathbb{E}(X_{k,1}X_{k,2}X_{k’,1}X_{k’,2}) =\sum_{k,k’=1}^p\mathbb{E}(X_{k}X_{k’})^2 =\frac{1}{p}. \]

Since \( {p/n=p_n/n\rightarrow y} \) as \( {n\rightarrow\infty} \), it follows that \( {d_F(\mathbb{E}\mu_A,\mathbb{E}\mu_B)\rightarrow0} \) as \( {n\rightarrow\infty} \).

Proof 5/7. Second Taylor expansion. It suffices now to show that \( {d_F(\mathbb{E}\mu_B,\mathbb{E}\mu)\rightarrow0} \) as \( {n\rightarrow\infty} \). The formula for \( {B} \) suggest to perform a Taylor expansion at point \( {2} \) in the \( {f’} \) term, and to compare \( {B} \) with the matrix

\[ C_{i,j}=\mathbf{1}_{i=j}f(0) +\mathbf{1}_{i\neq j}f(\left\Vert X_i\right\Vert_2^2+\left\Vert X_j\right\Vert_2^2) -2f'(2)\left<X_i,X_j\right>. \]

Now an analysis similar to the previous step gives \( {d_F(\mathbb{E}\mu_B,\mathbb{E}\mu_C)\rightarrow0} \) as \( {n\rightarrow\infty} \). The details are in the paper.

Proof 6/7. Third Taylor expansion. It suffices now to show that \( {d_F(\mathbb{E}\mu_C,\mathbb{E}\mu)\rightarrow0} \) as \( {n\rightarrow\infty} \). The formula for \( {C} \) suggest to perform a Taylor expansion at point \( {2} \) in the \( {f} \) term, and to compare \( {C} \) with the matrix

\[ D_{i,j} =\mathbf{1}_{i=j}f(0) +\mathbf{1}_{i\neq j} \left[ f(2)+f'(2)\Delta +\frac{f”(2)}{2}\Delta^2 +\frac{f”'(2)}{6}\Delta^3-2f'(2)\left<X_i,X_j\right>. \right] \]

where \( {\Delta:=\left\Vert X_i\right\Vert_2^2+\left\Vert X_j\right\Vert_2^2-2} \). The control of \( {d_F(\mathbb{E}\mu_C,\mathbb{E}\mu_D)} \) leads to use the fact that

\[ \mathbb{E}((\left\Vert X\right\Vert_2^2-1)^6\mathbf{1}_{\mathcal{E}}) =\mathcal{O}(1/n), \]

which follows from the Guédon-Milman thin-shell property mentioned earlier (using the fact that \( {p=p_n=(y+o(1))n} \)). The details are in the paper.

Proof 7/7. End of the proof. It suffices now to show that \( {d_F(\mathbb{E}\mu_D,\mathbb{E}\mu)\rightarrow0} \) as \( {n\rightarrow\infty} \). This follows by very similar arguments showing that \( {d_F(\mathbb{E}\mu_D,\mathbb{E}\mu_M)\rightarrow0} \). The details are in the paper.

Some other posts:

One Comment

Chercher, et bloguer | Freakonometrics 2013-03-08
[…] en tant que chercheur, on blogue avant tout pour soi. Quand Djalil publie un billet intitulé Euclidean kernel random matrices in high dimension sur son (superbe) blog djalil.chafai.net, il ne blogue pour 100,000 lecteur. Enfin, je pense. Et […]

This site uses Akismet to reduce spam. Learn how your comment data is processed.