Back to basics : Johnson-Lindenstrauss lemma

Photo of Joram Lindenstrauss (1936 - 2012) — Joram Lindenstrauss (1936 - 2012)

This short post is about the Johnson-Lindenstrauss lemma for dimension reduction. It is an emblematic basic result of high-dimensional probability or asymptotic geometric analysis. It provides a low dimensional quasi-isometric embedding. Its proof involves a random matrix and the concentration of measure phenomenon.

Johnson-Lindenstrauss lemma for dimension reduction. Let us denote by $|v|$ the Euclidean norm of $v$. For all $n,N\geq1$, all finite $S\subset\mathbb{R}^n$ of cardinal $N$, all $0 < \varepsilon<1$, and all $k > \frac{24}{\varepsilon^2}\log(N)$, there exists a $k\times n$ real matrix $A$ such that for all $x,y\in S$, $x\neq y$, \[ \sqrt{1-\varepsilon}\leq\frac{|Ax-Ay|}{|x-y|}\leq\sqrt{1+\varepsilon}. \] In other words, it is possible to embed quasi-isometrically $N$ points of $\mathbb{R}^n$ in $\mathbb{R}^{O(\log(N))}$.

Proof. The desired property can be rewritten as \[ ||A(x-y)|^2-|x-y|^2|\leq\varepsilon|x-y|^2. \] If $R_1,\ldots,R_k$ are the rows of $\sqrt{k}A$, then \[ |A(x-y)|^2=\frac{1}{k}\sum_{i=1}^k\langle R_i,x-y\rangle^2. \] The idea now is that if $R_1,\ldots,R_k$ are taken random and i.i.d. of law $\mathcal{N}(0,I_n)$, then we have $\langle R_i,x-y\rangle\sim\mathcal{N}(0,|x-y|^2)$, making $|A(x-y)|^2$ a sum of $k$ i.i.d. $\chi^2$ random variables of total mean $|x-y|^2$, which suggests to use the concentration of measure phenomenon around the mean $|x-y|^2$, for small deviations up to $\pm\varepsilon|x-y|^2$.

Let $Z_1,\ldots,Z_n$ be i.i.d. of law $\mathcal{N}(0,\sigma^2)$. For all $\theta < \frac{1}{2\sigma^2}$, \[ \mathbb{E}(\mathrm{e}^{\theta(Z_i^2-\sigma^2)}) =\frac{\mathrm{e}^{-\theta\sigma^2}}{\sqrt{2\pi}\sigma} \int\mathrm{e}^{\theta u^2-\frac{u^2}{2\sigma^2}}\mathrm{d}u =\frac{\mathrm{e}^{-\theta\sigma^2}}{\sqrt{1-2\theta\sigma^2}}. \] The Markov inequality gives, for all $\varepsilon > 0$ and all $0 < \theta<\frac{1}{2\sigma^2}$, \[ \mathbb{P}\Bigr(\frac{1}{k}\sum_{i=1}^k(Z_i^2-\sigma^2)\geq \varepsilon\sigma^2\Bigr) \leq\mathrm{e}^{-k(\theta\sigma^2(\varepsilon+1)+\log\sqrt{1-2\theta\sigma^2})}. \] Taking $\theta=\frac{\varepsilon}{2\sigma^2(1+\varepsilon)} < \frac{1}{2\sigma^2}$ and using $\varepsilon-\log(1+\varepsilon)\geq\frac{\varepsilon^2}{4}$ for $0 < \varepsilon<1$ gives \[ \mathbb{P}\Bigr(\frac{1}{k}\sum_{i=1}^k(Z_i^2-\sigma^2)\geq \varepsilon\sigma^2\Bigr) \leq\mathrm{e}^{-\frac{k}{8}\varepsilon^2}. \] By the same method for $-(Z_i^2-\sigma^2)$ with $\theta=\frac{\varepsilon}{2\sigma^2(1-\varepsilon)}$, $-\varepsilon+\log(1-\varepsilon)\geq\frac{\varepsilon^2}{4}$, $0 < \varepsilon<1$, \[ \mathbb{P}\Bigr(-\sum_{i=1}^k(Z_i^2-\sigma^2)\geq r\Bigr) \leq\mathrm{e}^{-\frac{k}{8}\varepsilon^2}. \] Finally by the union bound, we get the following $\sigma$ free upper bound, for all $0 < \varepsilon<1$: \[ \mathbb{P}\Bigr(\bigr|\frac{1}{k}\sum_{i=1}^kZ_i^2-\sigma^2\bigr|\geq\varepsilon\sigma^2\Bigr) \leq2\mathrm{e}^{-\frac{k}{8}\varepsilon^2}. \] But $R_1,\ldots,R_k$ are i.i.d. $\mathcal{N}(0,I_n)$. Thus, for all $x\in\mathbb{R}^n$, $|Ax|^2=\frac{1}{k}\sum_{i=1}^k\langle R_i,x\rangle^2$ has same law as $\frac{1}{k}\sum_{i=1}^kZ_i^2$ with $\sigma^2=|x|^2$, thus, for all $0 < \varepsilon<1$, \[ \mathbb{P}\Bigr(|\Delta(x)|\geq\varepsilon|x|^2\Bigr) \leq2\mathrm{e}^{-\frac{k}{8}\varepsilon^2} \quad\text{where}\quad \Delta(x):=|Ax|^2-|x|^2. \] This bound is $x$ free. Thus, by the union bound, for all $0 < \varepsilon<1$, \begin{align*} \mathbb{P}\bigr(\exists x,y\in S:|\Delta(x-y)|>\varepsilon|x-y|^2\bigr) &\leq|S^2|\max_{x,y\in S}\mathbb{P}\bigr(|\Delta(x-y)|>\varepsilon|x-y|^2\bigr)\\ &\leq2N^2\mathrm{e}^{-\frac{k}{8}\varepsilon^2}. \end{align*} If $k > \frac{8\log(2N^2)}{\varepsilon^2}$, then the bound is strictly smaller than $1$, hence there exists $\omega$ such that $||A(x-y)|^2-|x-y|^2|\leq\varepsilon|x-y|^2$ for all $x,y\in S$. This probabilistic method becomes constructive by using Monte Carlo. The probability of finding such an $\omega$ decreases when $\varepsilon$ or $k$ decreases, or when $N$ increases, which makes sense.

Comments. What is important is the behavior of the concentration of $\langle R_i,x\rangle^2$ around its mean $|x|^2$, for small left and right deviations around the mean. The result remains up to a constant if we replace the Gaussian $\mathcal{N}(0,1)$ by the discrete uniform on $\pm1$ (Rademacher law) for the entries of $\sqrt{k}A$. This can be written in terms of $\psi$-norms, see [CGLP, Lemma 1.3.1 p. 34].

The rows of $A$ can be chosen uniformly distributed on the sphere, which is the historical choice made in [JL]. Recent extensions are presented in [PV].

The matrix $A$ does not depend on the data $S$, contrary to other dimension reduction methods such as Principal Components Analysis (PCA) or Singular Values Decomposition (SVD).

Joram’s and my mathematical collaborations ranged from nonseparable Banach spaces to the geometry of finite metric spaces. To Joram these were not very different. He viewed his migrations from topic to topic as natural. Our early research was in the linear world, although even in the 1970s Joram tried to interest me in the nonlinear geometry of Banach spaces. I knew well his landmark 1964 paper, in which he laid out a blueprint for what nonlinear Banach space theory should be, but I thought I had no intuition for the topic. That changed in 1981 during our sabbaticals. Marcus and Pisier, as a consequence of their work on stochastic processes, proved a seemingly unrelated result on the extension of Lipschitz mappings from finite subsets of Lp , 1 < p < 2, into a Hilbert space. Their theorem suggested a general result, where Lp is replaced by a general Banach space, but because of the nature of their proof, they did not get a result even for the Banach space L1. Joram and I realized that we could solve the problem if we could prove a dimension reduction lemma in Hilbert space. After formulating the lemma (now called the Johnson-Lindenstrauss Lemma) we proved it in fifteen minutes. Still, we knew it was a neat result because it not only allowed us to solve the problem of Marcus and Pisier but also eliminated the “curse of dimensionality” in certain high-dimensional pattern recognition problems. Of course, we had no idea that this lemma would become the most quoted result of either of us, having 1,000+ references according to Google Scholar, more than three times the references for all of our other research articles combined, and getting 134,000 hits when Googling “Johnson- Lindenstrauss lemma.” Joram's appreciation of the J-L Lemma is revealed by looking at the list of his selected publications that Joram drew up in the year before his death when he knew that the end was near; that is, the paper containing the lemma is not among the twenty-six articles he selected! Actually, I was not surprised by that; Joram put a premium on difficulty and was not very comfortable with the attention the J-L Lemma received.
William B. Johnson, excerpt from [JS].

Further reading.

[JL] William Buhmann Johnson and Joram Lindenstrauss
Extensions of Lipschitz mappings into a Hilbert space
Conference in modern analysis and probability (New Haven, Conn., 1982)
Contemp. Math., vol. 26, Amer. Math. Soc., 1984, p. 189-206.
[CGLP] Djalil Chafaï, Olivier Guédon, Guillaume Lecué, and Alain Pajor
Interactions between compressed sensing, random matrices, and high dim. geometry
Panoramas et Synthèses 37, Société Mathématique de France (SMF), (2012) 182p.
[JS] Compiled by William B. Johnson and Gideon Schechtman
Joram Lindenstrauss, in Memoriam
Notices of the American Mathematical Society 26(1) (2015)
[PV] Yaniv Plan and Roman Vershynin
Random matrices acting on sets: independent columns
Preprint arXiv:2502.16827

Photo of William (Bill) Buhmann Johnson (1944 - ) — William (Bill) Buhmann Johnson (1944 - )

Some other posts: