Azuma-Hoeffding concentration and random matrices – Libres pensées d'un mathématicien ordinaire

In honor of Bernard Maurey — Bernard Maurey

A general principle in high dimensional geometric analysis states that many interesting stochastic empirical quantities are well concentrated around their mean, reducing the large dimensional analysis to an averaged quantity. The most elementary instance of such a principle is provided by the concentration of the empirical mean around the actual mean in the law of large numbers (note the presence of a martingale here).

The concentration of measure was initially studied by mathematicians such as Chebyshev, Markov, Chernoff, Bernstein, and Hoeffding, among others. Further progresses where made in the past decades, following the works of Gromov, Milman, Schechtman, Maurey, McDiarmid, Pisier, Talagrand, Ledoux, and Massart. Among various settings, one can distinguish for its usefulness and simplicity the one involving sums of bounded (martingale) differences, which generalizes a work of Azuma (not Azéma!) and Hoeffding. We recommend the reading of McDiarmid who contributed to the development of these techniques.

Lemma 1 (Azuma-Hoeffding) If \( {X\in\mathrm{L}^1(\Omega,\mathcal{F},\mathbb{P},\mathbb{R})} \) then for every \( {r\geq0} \),
\[ \mathbb{P}\left(|X-\mathbb{E}(X)|\geq r\right) \leq 2\exp\left(-\frac{2r^2}{\mathrm{osc}\left(d_1\right)^2+\cdots+\mathrm{osc}\left(d_n\right)^2}\right) \]
where \( {d_k=\mathbb{E}(X\,|\,\mathcal{F}_k)-\mathbb{E}(X\,|\,\mathcal{F}_{k-1})} \) for an arbitrary filtration
\[ \{\varnothing,\Omega\} =\mathcal{F}_0\subset\mathcal{F}_1\subset\cdots\subset\mathcal{F}_n =\mathcal{F}. \]

Note that \( {\mathrm{osc}(f):=\sup f-\inf f=\mathrm{diam}(\mathrm{supp}(f))\leq2\left\Vert f\right\Vert_\infty} \).

Proof: If \( {U} \) is a random variable with \( {\mathbb{E}(U)=0} \) and \( {a\leq U\leq b} \), then, by convexity, \( {e^{tx}\leq\frac{x-a}{b-a}e^{tb}+\frac{b-x}{b-a}e^{ta}} \) for all \( {t\geq0} \) and \( {a\leq x\leq b} \), which gives after some analysis

\[ \mathbb{E}(e^{tU})\leq \frac{b}{b-a}e^{ta}-\frac{a}{b-a}e^{tb} \leq e^{\frac{t^2}{8}(b-a)^2}. \]

Used with \( {U=d_k=\mathbb{E}(X\,|\,\mathcal{F}_k)-\mathbb{E}(X\,|\,\mathcal{F}_{k-1})} \) conditional on \( {\mathcal{F}_{k-1}} \), this gives

\[ \mathbb{E}(e^{td_k}\,|\,\mathcal{F}_{k-1}) \leq e^{\frac{t^2}{8}\mathrm{osc}(d_k)^2}. \]

By writing the Doob martingale telescopic sum \( {X-\mathbb{E}(X)=d_n+\cdots+d_1} \) we get

\[ \mathbb{E}(e^{t(X-\mathbb{E}(X))}) =\mathbb{E}(e^{t(d_{n-1}+\cdots+d_1)}\mathbb{E}(e^{td_n}\,|\,\mathcal{F}_{n-1})) \leq\cdots\leq e^{\frac{t^2}{8}(\mathrm{osc}(d_1)^2+\cdots+\mathrm{osc}(d_n)^2)}. \]

Now the desired result follows from Markov's inequality and an optimization of \( {t} \). ☐

Definition 2 (Total variation of a function) The total variation \( {f:\mathbb{R}\rightarrow\mathbb{R} } \) is
\[ \left\Vert f\right\Vert_{\mathrm{TV}} :=\sup_{(x_k)_{k\in\mathbb{Z}}} \sum_{k \in \mathbb{Z}} | f(x_{k+1})-f(x_k) |\in[0,+\infty] \]
where the supremum runs over all non decreasing sequences \( {(x_k)_{k \in \mathbb{Z}}} \). If \( {f'\in\mathrm{L}^1(\mathbb{R})} \) then \( {\left\Vert f\right\Vert_{\mathrm{TV}}=\left\Vert f'\right\Vert_1} \). If \( {f = \mathbf{1}_{(-\infty,s]}} \) for a real \( {s} \) then \( {\left\Vert f\right\Vert_{\mathrm{TV}}=1} \).

Theorem 3 (Concentration for empirical spectral distributions) If \( {H} \) is an \( {n\times n} \) random Hermitian matrix, and if the vectors \( {{(H_i)}_{1\leq i\leq n}} \), where \( {H_i={(H_{ij})}_{1\leq j\leq n}\in\mathbb{C}^i} \), are independent, then for any bounded measurable \( {f:\mathbb{R} \rightarrow\mathbb{R}} \) and any real number \( {r\geq0} \), the counting probability measure \( {\mu_H:=\frac{1}{n}\sum_{k=1}^n\delta_{\lambda_k(H)}} \) satisfies to the concentration inequality
\[ \mathbb{P}\left( \left|\int\!f\,d\mu_H -\mathbb{E}\int\!f\,d\mu_H\right| \geq r \right) \leq 2 \exp\left(- \frac{nr^2}{8V(f)^2}\right). \]
Similarly, if \( {M} \) is an \( {m\times n} \) complex random matrix with independent rows then for any bounded measurable \( {f:\mathbb{R} \rightarrow\mathbb{R}} \) and any real number \( {r\geq0} \), the counting probability measure \( {\mu_M:=\frac{1}{m}\sum_{k=1}^m\delta_{\lambda_k(MM^*)}} \) satisfies to the concentration inequality
\[ \mathbb{P}\left( \left|\int\!f\,d\mu_M -\mathbb{E}\int\!f\,d\mu_M\right| \geq r \right) \leq 2 \exp\left(-\frac{mr^2}{2V(f)^2}\right). \]

Proof: We give the proof for the singular values, the Hermitian case being entirely similar, up to a factor \( {2} \) (see below). If \( {A} \) and \( {B} \) are two \( {m\times n} \) complex matrices and if \( {F_A} \) and \( {F_B} \) are the cumulative distribution functions of the counting probability measures \( {\mu_A:=\frac{1}{m}\sum_{k=1}^m\delta_{\lambda_k(AA^*)}} \) and \( {\mu_B:=\frac{1}{m}\sum_{k=1}^m\delta_{\lambda_k(BB^*)}} \) then it is easily seen from the Courant-Fischer minimax variational formulas that

\[ \left\Vert F_A- F_B\right\Vert_\infty\leq\frac{\mathrm{rank}(A-B)}{m}. \]

Now for a smooth \( {f:\mathbb{R}\rightarrow\mathbb{R}} \), we get, by integrating by parts,

\[ \left|\int\!f\,d\mu_A-\int\!f\,d\mu_B\right| =\left|\int_\mathbb{R} \!f'(t)(F_A(t)-F_B(t))\,dt\right| \leq \frac{\mathrm{rank}(A-B)}{m}\int_\mathbb{R}\!|f'(t) |\,dt. \]

Since the left hand side depends on at most \( {2m} \) points, we get, by approximation, for every measurable function \( {f:\mathbb{R}\rightarrow\mathbb{R}} \) with \( {\left\Vert f\right\Vert_{\mathrm{TV}} \leq 1} \),

\[ \left|\int\!f\,d\mu_A-\int\!f\,d\mu_{B}\right| \leq V(f)\frac{\mathrm{rank}(A-B)}{m}. \]

From now on, \( {f:\mathbb{R}\rightarrow\mathbb{R}} \) is a fixed measurable function with \( {\left\Vert f\right\Vert_{\mathrm{TV}}\leq 1} \). For every row vectors \( {x_1,\ldots,x_m} \) in \( {\mathbb{C}^n} \), we denote by \( {A(x_1,\ldots,x_m)} \) the \( {m\times n} \) matrix with rows \( {x_1,\ldots,x_m} \) and we define \( {F:(\mathbb{C}^n)^m\rightarrow\mathbb{R}} \) by

\[ F(x_1,\ldots,x_m):=\int\!f\,d\mu_{A(x_1,\ldots,x_m)}. \]

For any \( {i\in\{1,\ldots,m\}} \) and any row vectors \( {x_1,\ldots,x_m,x_i'} \) of \( {\mathbb{C}^n} \), we have (\( {2} \) instead of \( {1} \) in the Hermitian case)

\[ \mathrm{rank}\left\{A(x_1,\ldots,x_{i-1},x_i,x_{i+1},\ldots,x_m) -A(x_1,\ldots,x_{i-1},x_i',x_{i+1},\ldots,x_m)\right\} \leq 1 \]

and thus

\[ |F(x_1,\ldots,x_{i-1},x_i,x_{i+1},\ldots,x_m) -F(x_1,\ldots,x_{i-1},x_i',x_{i+1},\ldots,x_m)| \leq \frac{V(f)}{m}. \]

Let us define \( {X=F(R_1,\ldots,R_m)} \) where \( {R_1,\ldots,R_m} \) are the rows of \( {M} \). If \( {R'_1,\ldots,R'_n} \) is an independent copy of \( {R_1,\ldots,R_n} \) and \( {\mathcal{F}_k=\sigma(R_1,\ldots,R_k)} \) and \( {\mathcal{F}_0=\{\varnothing,\Omega\}} \) then for every \( {1\leq k\leq n} \),

\[ \mathbb{E}(F(R_1,\ldots,R_i,\ldots,R_n)\,|\,\mathcal{F}_{k-1}) = \mathbb{E}(F(R_1,\ldots,R'_i,\ldots,R_n)\,|\,\mathcal{F}_k). \]

The result follows then from the Azuma-Hoeffding lemma 1 since

\[ d_k=\mathbb{E}(F(R_1,\ldots,R_i,\ldots,R_n)-F(R_1,\ldots,R'_i,\ldots,R_n)\,|\,\mathcal{F}_k) \]

which gives \( {\mathrm{osc}(d_k)\leq2\left\Vert d_k\right\Vert_\infty\leq 2V(f)/m} \). ☐

It is remarkable that such a concentration holds under almost minimal assumptions on the entries of the matrix, and relies on the martingale dependency structure based on rows/columns additions. It allows for instance to simplify the proof of the almost sure Wigner and Marchenko-Pastur theorems, and may serve actually far beyond these two classical cases. The martingale structure hidden in random matrices with independent entries is also at the heart of the so called REFORM (REsolvent FORmula and Martingales) of Girko, so useful for the derivation of Central Limit Theorems for empirical spectral distributions via Lyapunov or Lindeberg criteria.

Concentration for the resolvent. Recall that the resolvent of \( {H} \) at point \( {z\in\{z\in\mathbb{C}:\Im(z)>0\}} \) is \( {R_H(z):=(H-zI)^{-1}} \). The normalized trace of the resolvent is the so-called Cauchy-Stieltjes transform:

\[ S_{\mu_H}(z)=\int\!\frac{d\mu_H(\lambda)}{\lambda-z} =\frac{1}{n}\mathrm{Tr}(R_H(z)). \]

The function \( {\lambda\mapsto 1/(\lambda-z)} \) is bounded in module by \( {1/\Im(z)} \), and the real and imaginary parts have finite variation. On can then obtain the concentration of \( {S_{\mu_H}} \) around its expectation by an application of the preceding concentration inequality, up to a loss in the constants. Alternatively, one may reuse the argument on more general Lipschitz functions of the diagonal of the resolvent (we recover the normalized trace of the resolvent when the function is the identity).

Theorem 4 (Concentration for the resolvent) If \( {H} \) is an \( {n\times n} \) random Hermitian matrix and if the vectors \( {{(H_i)}_{1\leq i\leq n}} \), where \( {H_i={(H_{ij})}_{1\leq j\leq n}\in\mathbb{C}^i} \), are independent, then for any Lipschitz function \( {f:\mathbb{C}\rightarrow\mathbb{R}} \), every complex number \( {z\in\{z\in\mathbb{C}:\Im(z)>0\}} \) and every real number \( {r\geq0} \), the diagonal entries of the resolvent \( {R_H:=(H-zI)^{-1}} \) satisfy to the concentration inequality
\[ \mathbb{P}\left( \left|\frac{1}{n}\sum_{k=1}^nf(R_H(z)_{kk})-\mathbb{E}\left(\frac{1}{n}\sum_{k=1}^nf(R_H(z)_{kk})\right)\right| \geq r \right) \leq 2 \exp\left(-\frac{nr^2\Im(z)^2}{8\|f\|_L^2}\right). \]
In particular, taking for \( {f} \) the identity function, we get a concentration around the mean for the Cauchy-Stieltjes transform \( {S_{\mu_H}(z):=\frac{1}{n}\mathrm{Tr}(R_H(z))} \).

Proof: For any \( {n\times n} \) Hermitian matrices \( {A} \) and \( {B} \) and any Lipschitz function \( {f:\mathbb{C}\rightarrow\mathbb{C}} \),

\[ \left| \frac{1}{n}\sum_{k=1}^nf(R_A(z)_{kk})-\frac{1}{n}\sum_{k=1}^nf(R_B(z)_{kk}) \right| \leq \frac{{\|f\|}_L}{n}\sum_{k=1}^n\left|R_A(z)_{kk}-R_B(z)_{kk}\right|. \]

Now for any \( {n\times n} \) matrix \( {M} \), the singular values decomposition \( {M=V^*DU} \) gives \( {M_{kk}=\left<V_{\bullet k},U_{\bullet k}\right>s_k} \), and thus, denoting \( {\|\cdot\|} \) the operator norm,

\[ \sum_{k=1}^n|M_{kk}| \leq \left(\max_{1\leq k\leq n}s_k\right)\mathrm{card}\{1\leq k\leq n:s_k\neq0\} = \|M\|\mathrm{rank}(M). \]

To use it with \( {M=R_A(z)-R_B(z)} \), we observe that the resolvent identity \( {R_A-R_B=R_A(B-A)R_B} \) gives \( {\mathrm{rank}(R_A(z)-R_B(z))\leq\mathrm{rank}(A-B)} \), while we have \( {\|R_A(z)-R_B(z)\|\leq\|R_A(z)\|+\|R_B(z)\|\leq\frac{2}{\Im(z)}} \). Putting all together we get

\[ \left| \frac{1}{n}\sum_{k=1}^nf(R_A(z)_{kk})-\frac{1}{n}\sum_{k=1}^nf(R_B(z)_{kk}) \right| \leq \frac{2{\|f\|}_L\mathrm{rank}(A-B)}{n\Im(z)}. \]

It remains now to follow the strategy of bounded martingale differences as before. ☐

A very similar result holds for \( {R_{MM^*}} \) where \( {M} \) is a rectangular random matrix with i.i.d. rows. We leave the proof as an exercise. More generally, various other instances of the same idea are available.

Open problem. Let \( {H} \) be an \( {n\times n} \) Hermitian matrix with real positive i.i.d. entries. Define the random Markov (or stochastic) \( {n\times n} \) matrix \( {M} \) by \( {M_{ij}:=H_{ij}/\sum_{k=1}^nH_{ik}} \). The random matrix \( {M} \) has a real spectrum, included in \( {[-1,1]} \), since \( {M} \) is reversible. Does the empirical spectral distribution \( {\mu_M} \) of \( {M} \) concentrates around its mean \( {\mathbb{E}\mu_M} \) as for \( {H} \)? The difficulty here comes from the rows normalization in the definition of \( {M} \), which produces a propagation of information from local to global in the matrix.

Note. We learned the concentration for spectral distributions from Charles Bordenave (appears in arXiv:1006.1713 and arXiv:0907.4244, search for "Azuma"). More recently, Charles discovered that the result was actually obtained independently by Guntuboyina and Leeb in their 2009 ECP paper. Actually the argument is so natural that it was reinvented many times by different authors independently, including Adamczak, El Karoui, etc. It seems actually that the essence of the argument can be traced back to the works of Girko in the 1970s. Every relevant references are welcome! Such a concentration inequality is helpful mainly to show that \( {\mu_H-\mathbb{E}\mu_H\rightarrow0} \) almost surely as \( {n\rightarrow\infty} \), via the first Borel-Cantelli lemma. Alternatively, instead of concentration, one may use Burkholder type inequalities for sums of martingales differences at the level of the Cauchy-Stieltjes transform, see for instance Bai and Zhou. Both approaches rely on martingale differences, and require only independent rows without any moment assumptions on the matrix entries.

Beyond Azuma-Hoeffding inequalities, the concentration for empirical spectral distributions of random matrices was also considered, differently, by Guionnet and Zeitouni, Ledoux, Davidson and Szarek, Chatterjee and Bose, Bobkov, Götze, and Tikhomirov, and Tropp among others.

Some other posts:

2 Comments

Djalil Chafaï 2012-06-05
I've written some talk notes (in French) on this.
Djalil Chafaï 2013-04-15
Regarding Azuma-Hoeffding concentration for empirical spectral distributions, there is also a paper by El Karoui, published in 2009, entitled Concentration of measure and spectra of random matrices: Applications to correlation matrices, elliptical distributions and beyond. Good ideas are often discovered independently by multiple people...

This site uses Akismet to reduce spam. Learn how your comment data is processed.