Press "Enter" to skip to content

Month: February 2013

Mean of a random variable on a metric space

In this short post, we recall the pleasant notion of Fréchet mean (or Karcher mean) of a probability measure on a metric space, a concept already considered in an old previous post. Let \( {(E,d)} \) be a metric space, such as a graph (with vertices and edges) or a Riemannian manifold, equipped with its Borel \( {\sigma} \)-field. Let \( {\mu} \) be a probability measure on \( {E} \). How can we define the mean \( {m_\mu} \) and the variance \( {v_\mu} \) of \( {\mu} \)? A very natural idea is to consider the variational definition:

\[ v_\mu:=\inf_{x\in E}\mathbb{E}(d(x,Y)^2), \]

where \( {Y\sim\mu} \). The set \( {m_\mu:=\arg\inf_{x\in E}\mathbb{E}(d(x,Y)^2)} \) where this infimum is achieved plays the role of a mean (which is not necessarily unique), while the value of the infimum plays the role of the variance. Note that

\[v_\mu=\inf_{x\in E}W_2(\delta_x,\mu)^2\]

where $W_2$ is the so-called Wasserstein-Fréchet-Kantorovich coupling distance

\[ W_2(\nu,\mu)^2=\inf_{(X,Y),X\sim\nu,Y\sim\mu}\mathbb{E}(d(X,Y)^2).\]

From this observation, the Fréchet means of $\mu$ are the atoms of the closest Dirac masses to $\mu$ in $W_2$ distance. We may replace the exponent \( {2} \) by any real \( {p\geq1} \) to get a more general notion of moments of \( {\mu} \) (this leads by the way to moments problems on metric spaces!). The notion of Fréchet mean is so natural that it has been studied by various authors in pure and applied contexts. Here are some recent examples:

Leave a Comment

Théorème vivant

Je viens de terminer la lecture de Théorème vivant, de Cédric Villani, publié chez Grasset en 2012. Et non, il ne s’agit pas de vulgarisation, et encore moins de littérature. Il s’agit plutôt d’un récit auto-biographique écrit à la première personne, une sorte de journal enlevé et illustré, parsemé de formules, d’anecdotes, de courriers électroniques, de petits portraits, de quelques dessins, et de diverses considérations sociologiques, historiques, psychologiques, etc. Un fil conducteur important du texte est le récit d’une collaboration scientifique sur un problème de physique des plasmas (amortissement Landau). Bien que ce type de mathématiques me laisse parfois assez tiède(*), j’ai beaucoup apprécié ce livre, que je n’ai pas pu lâcher ! On y découvre diverses facettes d’une personnalité célèbre, mais aussi et plus généralement l’univers des mathématiciens et la nature de leur métier. Cédric Villani a vraiment fait œuvre utile en écrivant ce livre. Et pourtant, certains de mes collègues, agacés,  le trouvent quelque peu prétentieux. Est-ce parce qu’il les renvoie à leurs propres complexes ? Il est vrai que la modestie n’est pas son fort, mais la fausse modestie non plus.  D’autres génies du moment, et non des moindres, ont le narcissisme sans doute moins exprimé. Est-ce le prix du panache ? Villani agacerait-il autant que Sarkozy ? En tout cas son livre est frais, sain, juste et pétillant, sans prétention littéraire ou pédagogique, et peut vraiment apporter quelque chose au lecteur sensible, qu’il soit mathématicien ou pas. Rares sont ceux qui auraient réussi aussi bien un tel exercice !

(*) Chez Boltzmann et Landau, la physique semble parfois un peu plus belle et épurée que les mathématiques qui en sont issues. Est-ce vraiment moins le cas chez Einstein et Wigner ?

Leave a Comment

Euclidean kernel random matrices in high dimension

This post is formed by the notes of a talk given on February 12, 2013 for our Tuesday morning working seminar ASPro (Analyse, Statistique, et Probabilités) in Marne-la-Vallée. It concerns a preprint by Charles Bordenave arXiv:1209.5888 entitled On Euclidean random matrices in high dimension. We follow very closely the text of Bordenave, adding some comments and skipping some details. Indeed, this short paper is of particular interest for our group which comprises experts in log-concave distributions, random matrices, statistics, and optimal transportation. The first part of the talk was devoted to the positivity of deterministic kernels, and essentially to the Bochner and the Schoenberg theorems, following a previous post.

Let \( {X} \) be a random vector of \( {\mathbb{R}^p} \) with mean zero and covariance matrix \( {\frac{1}{p}I} \):

\[ \mathbb{E}(X)=(\mathbb{E}(X_k))_{1\leq k\leq p}=0, \quad \mathbb{E}(X\otimes X)=(\mathbb{E}(X_jX_k))_{1\leq j,k\leq p}=\frac{1}{p}I. \]

If \( {X} \) is seen as a column vector then \( {X\otimes X=XX^\top} \) is an empirical covariance matrix of a sample of size \( {1} \). In the jargon, the random vector \( {\sqrt{p}X} \) is isotropic. Let \( {X_1,\ldots,X_n} \) be i.i.d. copies of \( {X} \). Let \( {f:\mathbb{R}_+\rightarrow\mathbb{R}} \) be a mesurable function. We are interested by the Euclidean random kernel \( {n\times n} \) matrix \( {A} \) defined by

\[ A_{i,j}:=f(\left\Vert X_i-X_j\right\Vert_2^2). \]

The notation \( {A} \) (like Adjacency) taken from the paper shows maybe the tropism of the author for (random) graph theory. Kernel matrices appear in many areas of pure and applied mathematics. Every statistician knows something about kernel methods. We have already studied the closely related Bochner, Schoenberg, and Bernstein theorems in a previous post. Actually, the study of random kernel matrices is quite natural from various points of views. We refer in particular to Mézard, Parisi, Zhee for statistical physics, to Vershik for random metric spaces, to El Karoui for mathematical statistics, and to Lovasz-Szegedy, Bordenave, and Vu and Do for random graphs and combinatorial optimization.

Main result. Here is the main result discussed in this post: suppose that

  • \( {X} \) has a log-concave distribution;
  • \( {f} \) is three times differentiable at point \( {2} \);
  • \( {p=p_n} \) is such that \( {p_n/n\rightarrow y\in(0,\infty)} \) as \( {n\rightarrow\infty} \).

Then almost surely, the empirical spectral distribution

\[ \mu_A:=\frac{1}{n}\sum_{k=1}^n\delta_{\lambda_k(A)} \]

converges weakly as \( {n\rightarrow\infty} \) to the law \( {\mu} \) of

\[ f(0)-f(2)+2f'(2)(1-S) \]

where \( {S} \) follows the Marchenko-Pastur distribution given by (here \( {y_\pm:=(1\pm\frac{1}{\sqrt{y}})^2} \)),

\[ \mu_{MP}(dx):=(1-y)^+\delta_0(dx) +\frac{y}{2\pi x}\sqrt{(y_+-x)(x-y_-)}\mathbf{1}_{[y_-,y_+]}(x)\,dx. \]

Exercise: take 1) \( {f(x)=x} \) and 2) \( {f(x)=\sqrt{x}} \).

Rough idea of the proof. The idea is to start from the formula

\[ \left\Vert X_i-X_j\right\Vert_2^2 =\left\Vert X_i\right\Vert_2^2 + \left\Vert X_j\right\Vert_2^2 – 2\left<X_i,X_j\right>. \]

Now, following Guédon and Milman, there exists universal constants \( {c_0,c_1>0} \) such that for any \( {t\geq0} \),

\[ \mathbb{P}(\left|\left\Vert X\right\Vert_2-1\right|\geq t) \leq c_1\exp\left(-c_0\sqrt{p}\min(t,t^3)\right). \]

It follows that \( {\left\Vert X\right\Vert_2\approx1} \) with high probability as \( {p\gg1} \). Consequently, using a Taylor expansion at point \( {2} \), we get

\[ A_{i,j} =f(\left\Vert X_i-X_j\right\Vert_2^2) \approx \mathbf{1}_{i=j}f(0)+\mathbf{1}_{i\neq j}(f(2)-2f'(2)\left<X_i,X_j\right>). \]

This can be rewritten in matrix form as

\[ A\approx M:=(f(0)-f(2)+2f'(2))I+f(2)J-2f'(2)Y^\top Y \]

where \( {J} \) is the \( {p\times p} \) matrix full of ones, and where \( {Y} \) is the \( {p\times n} \) matrix with columns \( {X_1,\ldots,X_n} \). The \( {p\times p} \) matrix \( {YY^\top=X_1X_1^\top+\cdots+X_nX_n^\top} \) is the empirical matrix of the i.i.d. sample \( {X_1,\ldots,X_n} \). It is well known that \( {YY^\top} \) and \( {Y^\top Y} \) have identical eigenvalues, up to the multiplicity of zero (contributes to the Dirac mass at point zero in the empirical spectral distribution).

We can give now a rough intuitive proof of the result: since \( {J} \) has rank one,

\[ \mu_A-\mu_{(f(0)-f(2)+2f'(2))I-2f'(2)Y^\top Y}\rightarrow0 \]

weakly. Now, by a result of Pajor and Pastur (see also Adamczak),

\[ \mu_{Y^\top Y}\rightarrow\mu_{MP} \]

weakly. Therefore \( {\mu_A\rightarrow\mu} \) weakly, where \( {\mu} \) is the law of

\[ f(0)-f(2)+2f'(2)-2f'(2)S,\quad S\sim\mu_{MP}. \]

Conjectures. The thin-shell conjecture formulated by Anttila, Ball, and Perissinaki (see also Bobkov and Koldobsky) states that with high probability \( {X} \) belongs to a shell of thin width, negligible with respect to the radius of the shell. In other words, the norm \( {\left\Vert X\right\Vert_2} \) is well concentrated around say \( {1=\mathbb{E}(\left\Vert X\right\Vert_2^2)^{1/2}\geq\mathbb{E}(\left\Vert X\right\Vert_2)} \):

\[ \mathbb{P}(|\left\Vert X\right\Vert_2-1|\geq\varepsilon_p) \leq\varepsilon_p. \]

The thin-shell property is essential in the available proofs of the central limit theorem for convex bodies, and is naturally linked with the famous Kannan-Lovász-Simonovits (KLS) spectral gap conjecture. A thin-shell property in the form \( {\mathbb{E}((\left\Vert X\right\Vert_2-1)^2)=\mathcal{O}(1/p)} \) has been proved by Klartag if we assume furthermore unconditionallity (symmetries of the cube: the law of \( {X} \) is invariant by i.i.d. sign flips of the coordinates). The result of Guédon and Milman mentioned earlier is a weak instance of the thin shell property. Following Bordenave, a strong thin shell property allows to show that the result above on \( {A} \) remains valid is we assume only that \( {f} \) is differentiable at point \( {2} \), a result conjectured by Vu and Do.

More precisely, starting from \( {A} \), the proof goes by successive Taylor formulas replacing norms with constants, producing each time a new comparison with a simpler matrix, the last matrix being \( {M} \). At each step, the error is controlled by using suitable distances on the empirical spectral distributions of random matrices. This error is shown to be asymptotically negligible by using concentration or rank bound.

Proof 0/7. Functional distances and spectral distributions. Recall that the variation of \( {f:\mathbb{R}\rightarrow\mathbb{R}} \) is

\[ \left\Vert f\right\Vert_V:=\sup_{(x_k)_{k\in\mathbb{Z}}\nearrow}\sum_k|f(x_{k+1})-f(x_k)| \]

where the supremum runs over all real increasing sequences \( {(x_k)_{k\in\mathbb{Z}}} \). When \( {f} \) is \( {\mathcal{C}^1} \) then \( {\left\Vert f\right\Vert_V=\int\!|f'(x)|\,dx=\left\Vert f’\right\Vert_{\mathrm{L}^1}} \) while if \( {f=\mathbf{1}_{(-\infty,t]}} \) then \( {\left\Vert f\right\Vert_V=1} \). The Kolmogorov-Smirnov distance between two probability measures \( {\mu} \) and \( {\nu} \) is

\[ d_{\mathrm{K}}(\mu,\nu):= \sup_{\left\Vert f\right\Vert_V\leq 1}\int\!f\,d(\mu-\nu). \]

The convergence for the \( {d_{\mathrm{K}}} \) distance implies the weak convergence. Another useful distance that implies weak convergence is the Wasserstein coupling distance of order \( {p\geq1} \) defined by

\[ W_p(\mu,\nu):=\inf_{(U,V)}\mathbb{E}(|U-V|^p)^{1/p} \]

where the infimum runs over the set of couples \( {(U,V)} \) of random variables with \( {U\sim\mu} \) and \( {V\sim\nu} \). The map \( {p\mapsto W_p} \) is non-decreasing and it particular \( {W_1\leq W_2} \). Now thanks to a duality result due to Kantorovich and Rubinstein,

\[ W_1(\mu,\nu) =\sup_{\left\Vert f\right\Vert_{\mathrm{Lip}}\leq1}\int\!f\,d(\mu-\nu). \]

The following mixed metric also characterizes weak convergence:

\[ d_{\mathrm{KW}}(\mu,\nu):= \sup_{\max(\left\Vert f\right\Vert_V,\left\Vert f\right\Vert_{\mathrm{Lip}})\leq1} \int\!f\,d(\mu-\nu) \leq\min(d_{\mathrm{K}},W_1)\leq\min(d_{\mathrm{K}},W_2). \]

This reminds the well known Fortet-Mourier distance

\[ d_{\mathrm{F}}(\mu,\nu):= \sup_{\max(\left\Vert f\right\Vert_\infty,\left\Vert f\right\Vert_{\mathrm{Lip}})\leq1} \int\!f\,d(\mu-\nu). \]

By the Hoffman-Wielandt inequality, for every Hermitian matrices \( {H_1} \) and \( {H_2} \),

\[ nW_2(\mu_{H_1},\mu_{H_2})^2 =\min_{\sigma\in\mathcal{S}_n}\sum_{k=1}^n(\lambda_k(H_1)-\lambda_{\sigma(k)}(H_2))^2 \leq \mathrm{Tr}((H_1-H_2)^2)=\left\Vert H_1-H_2\right\Vert_{\mathrm{HS}}^2. \]

On the other hand, by the Courant-Fischer min-max variation formulas for the eigenvalues, for every \( {n\times n} \) Hermitian matrices \( {H_1} \) and \( {H_2} \):

\[ d_{\mathrm{K}}(\mu_{H_1},\mu_{H_2}) \leq \frac{\mathrm{rank}(H_1-H_2)}{n}. \]

We obtain finally that for every \( {n\times n} \) Hermitian matrices \( {H_1} \) and \( {H_2} \),

\[ d_{\mathrm{KW}}(\mu_{H_1},\mu_{H_2}) \leq \min\left( \sqrt{\frac{\mathrm{Tr}((H_1-H_2)^2)}{n}}, \frac{\mathrm{rank}(H_1-H_2)}{n}\right). \]

In random matrix theory, this is maybe the most useful global perturbation inequality.

Proof 1/7. Reduction to expectation by concentration. Let us see our random matrix \( {A=A(X_1,\ldots,X_n)} \) as a measurable function of \( {X_1,\ldots,X_n} \). Now, if \( {x_1,\ldots,x_n} \) and \( {x_1′,\ldots,x_n’} \) differ only on one coordinate then

\[ \mathrm{rank}\left(A(x_1,\ldots,x_n)-A(x_1′,\ldots,x_n’)\right)\leq 2. \]

By combining this with the rank inequality above and the Azuma-Hoeffding bounded differences martingale method, we obtain the following concentration inequality, valid for every real function \( {f} \) and every real number \( {r\geq0} \),

\[ \mathbb{P}\left(\int\!f\,d\mu_A-\mathbb{E}\int\!f\,d\mu_A\geq r\right) \leq \exp\left(-\frac{nr^2}{8\left\Vert f\right\Vert_V^2}\right). \]

By the first Borel-Cantelli lemma, it follows then that \( {\mu_A-\mathbb{E}\mu_A\rightarrow0} \) weakly. This reduces the analysis to the expected spectral distribution \( {\mathbb{E}\mu_A} \).

Proof 2/7. Reduction to linearized version. Recall that by the Pajor-Pastur theorem, \( {\mathbb{E}\mu_{Y^\top Y}\rightarrow\mu_{MP}} \) weakly. Since \( {J} \) has rank one, it follows from the rank inequality above that \( {\mathbb{E}\mu_M\rightarrow\mu} \) weakly. Therefore it suffices to prove that \( {\mathbb{E}\mu_A-\mathbb{E}\mu_M\rightarrow0} \), and thus for example to prove that

\[ d_{\mathrm{KW}}(\mathbb{E}\mu_A,\mathbb{E}\mu_M)\rightarrow0. \]

The method consists in performing Taylor expansions, and in using concentration.

Proof 3/7. Concentration of norms. We need to control the concentration of \( {\left\Vert X_i-X_j\right\Vert_2^2} \) around \( {2} \) and \( {\left\Vert X_i\right\Vert_2^2} \) around \( {1} \). There exists \( {\delta>0} \) such that \( {f} \) in \( {\mathcal{C}^1} \) on \( {K:=(2-\delta,2+\delta)} \) and for every \( {x\in K} \),

\[ f(x)=f(2)+f'(2)(x-2)+\frac{f”(2)}{2}(x-2)^2+\frac{f”'(2)}{6}(x-2)^3(1+o(1)). \]

Since the random vector \( {\sqrt{\frac{p}{2}}(X_i-X_j)} \) is log-concave and isotropic for every \( {i\neq j} \), it follows from the Guédon-Milman result mentioned earlier that as \( {n\rightarrow\infty} \), with \( {\varepsilon_n:=\max(n^{-\kappa},\delta/2)} \) for a small enough \( {\kappa>0} \), we have \( {\mathbb{P}(\mathcal{E})\rightarrow1} \) where

\[ \mathcal{E}:=\left\{ \max_{1\leq i,j\leq n}|\left\Vert X_i-X_j\right\Vert_2^2-2|\leq\varepsilon_n; \max_{1\leq i\leq n}|\left\Vert X_i\right\Vert_2^2-1|\leq\varepsilon_n\right\}. \]

Proof 4/7. First Taylor expansion. Since \( {f} \) is \( {\mathcal{C}^1} \) in \( {K} \), we perform a Taylor expansion around \( {\left\Vert X_i\right\Vert_2^2+\left\Vert X_j\right\Vert_2^2} \), which belongs to \( {K} \) on the event \( {\mathcal{E}} \). This leads to compare \( {A_{i,j}=f(\left\Vert X_i-X_j\right\Vert_2^2)} \) with

\[ B_{i,j}:=\mathbf{1}_{i=j}f(0) + \mathbf{1}_{i\neq j}\left[f(\left\Vert X_i\right\Vert_2^2 +\left\Vert X_j\right\Vert_2^2) -2f'(\left\Vert X_i\right\Vert_2^2+\left\Vert X_j\right\Vert_2^2) \left<X_i,X_j\right>\right]. \]

Now, for some \( {\delta_n\rightarrow0} \),

\[ A_{i,j}-B_{i,j} =o(\left\Vert X_i-X_j\right\Vert_2^2-\left\Vert X_i\right\Vert_2^2-\left\Vert X_j\right\Vert_2^2) \leq \delta_n|\left<X_i,X_i\right>|. \]

Consequently, we have

\[ d_{\mathrm{KW}}(\mathbb{E}\mu_A,\mathbb{E}\mu_B) \leq \mathbb{E}(d(\mu_A,\mu_B)) \leq \mathbb{P}(\mathcal{E}^c) + \delta_n\left(\frac{1}{n}\sum_{i\neq j}\mathbb{E}(|A_{i,j}-B_{i,j}|^2\mathbf{1}_{\mathcal{E}})\right)^{1/2}. \]

Thus

\[ d_{\mathrm{KW}}(\mathbb{E}\mu_A,\mathbb{E}\mu_B) \leq o(1)+\delta_n\left(n\mathbb{E}(|\left<X_1,X_2\right>|^2)\right)^{1/2}. \]

Now by isotropy

\[ \mathbb{E}(|\left<X_1,X_2\right>|^2) =\sum_{k,k’=1}^p\mathbb{E}(X_{k,1}X_{k,2}X_{k’,1}X_{k’,2}) =\sum_{k,k’=1}^p\mathbb{E}(X_{k}X_{k’})^2 =\frac{1}{p}. \]

Since \( {p/n=p_n/n\rightarrow y} \) as \( {n\rightarrow\infty} \), it follows that \( {d_F(\mathbb{E}\mu_A,\mathbb{E}\mu_B)\rightarrow0} \) as \( {n\rightarrow\infty} \).

Proof 5/7. Second Taylor expansion. It suffices now to show that \( {d_F(\mathbb{E}\mu_B,\mathbb{E}\mu)\rightarrow0} \) as \( {n\rightarrow\infty} \). The formula for \( {B} \) suggest to perform a Taylor expansion at point \( {2} \) in the \( {f’} \) term, and to compare \( {B} \) with the matrix

\[ C_{i,j}=\mathbf{1}_{i=j}f(0) +\mathbf{1}_{i\neq j}f(\left\Vert X_i\right\Vert_2^2+\left\Vert X_j\right\Vert_2^2) -2f'(2)\left<X_i,X_j\right>. \]

Now an analysis similar to the previous step gives \( {d_F(\mathbb{E}\mu_B,\mathbb{E}\mu_C)\rightarrow0} \) as \( {n\rightarrow\infty} \). The details are in the paper.

Proof 6/7. Third Taylor expansion. It suffices now to show that \( {d_F(\mathbb{E}\mu_C,\mathbb{E}\mu)\rightarrow0} \) as \( {n\rightarrow\infty} \). The formula for \( {C} \) suggest to perform a Taylor expansion at point \( {2} \) in the \( {f} \) term, and to compare \( {C} \) with the matrix

\[ D_{i,j} =\mathbf{1}_{i=j}f(0) +\mathbf{1}_{i\neq j} \left[ f(2)+f'(2)\Delta +\frac{f”(2)}{2}\Delta^2 +\frac{f”'(2)}{6}\Delta^3-2f'(2)\left<X_i,X_j\right>. \right] \]

where \( {\Delta:=\left\Vert X_i\right\Vert_2^2+\left\Vert X_j\right\Vert_2^2-2} \). The control of \( {d_F(\mathbb{E}\mu_C,\mathbb{E}\mu_D)} \) leads to use the fact that

\[ \mathbb{E}((\left\Vert X\right\Vert_2^2-1)^6\mathbf{1}_{\mathcal{E}}) =\mathcal{O}(1/n), \]

which follows from the Guédon-Milman thin-shell property mentioned earlier (using the fact that \( {p=p_n=(y+o(1))n} \)). The details are in the paper.

Proof 7/7. End of the proof. It suffices now to show that \( {d_F(\mathbb{E}\mu_D,\mathbb{E}\mu)\rightarrow0} \) as \( {n\rightarrow\infty} \). This follows by very similar arguments showing that \( {d_F(\mathbb{E}\mu_D,\mathbb{E}\mu_M)\rightarrow0} \). The details are in the paper.

1 Comment