A probabilistic proof of the Schoenberg theorem

This post is mainly devoted to a probabilistic proof of a famous theorem due to Schoenberg on radial positive definite functions. Let us begin with a general notion: we say that \( {K:\mathbb{R}^d\times\mathbb{R}^d\rightarrow\mathbb{R}} \) is a positive definite kernel when

\[ \forall n\geq1, \forall x_1,\ldots,x_n\in\mathbb{R}^d, \forall c\in\mathbb{C}^n, \quad\sum_{i=1}^n\sum_{j=1}^nc_iK(x_i,x_j)\bar{c}_j\geq0. \]

When \( {K} \) is symmetric, i.e. \( {K(x,y)=K(y,x)} \) for all \( {x,y\in\mathbb{R}^d} \), this means that the symmetric matrix \( {(K(x_i,x_j))_{1\leq i,j\leq n}} \) has spectrum in \( {\mathbb{R}_+:=[0,\infty)} \) for every \( {n\geq1} \) and \( {x_1,\ldots,x_n\in\mathbb{R}^d} \). Contrary to matrices, the terminology is not fully standard and some authors use instead the terms positive, positive semidefinite, non-negative definite, etc. This notion plays an important role in various areas of pure and applied mathematics, including for instance moments problems of classical analysis and probability theory, harmonic analysis, functional analysis, machine learning, data analysis in statistics, etc. We say that a function \( {f:\mathbb{R}^d\rightarrow\mathbb{R}} \) is a positive definite function when \( {(x,y)\mapsto K(x,y):=f(x-y)} \) is a positive definite kernel. The famous Bochner theorem on positive definite functions states that for every measurable \( {f:\mathbb{R}^d\rightarrow\mathbb{R}} \), the following are equivalent:

(i) \( {f} \) is continuous bounded and positive definite:
\[ \forall n\geq1, \forall x_1,\ldots,x_n\in\mathbb{R}^d, \forall c\in\mathbb{C}^n, \quad\sum_{i=1}^n\sum_{j=1}^nc_if(x_i-x_j)\bar{c}_j\geq0; \]
(ii) \( {f} \) is the Fourier transform of a finite Borel measure \( {\mu} \) on \( {\mathbb{R}^d} \):
\[ \forall x\in\mathbb{R}^d, \quad f(x)=\int\!e^{i\left<x,y\right>}\,d\mu(y). \]

The function \( {f} \) is even iff the measure \( {\mu} \) is symmetric. The boundedness of \( {f} \) is related to the finiteness of \( {\mu} \). Since the product of Fourier transforms is the Fourier transform of a convolution, and since for Borel measures, finiteness is stable by convolution, it follows that positive definiteness is stable by product: \( {fg} \) is positive definite as soon as \( {f} \) and \( {g} \) are positive definite. If \( {f} \) is positive definite then \( {f(0)\geq0} \) with equality iff \( {f=0} \). Moreover, \( {f(0)=1} \) iff \( {\mu} \) is a probability measure. In probability theory, the Bochner theorem characterizes the set of characteristic functions of probability measures as being the set of continuous bounded positive definite functions. The Bochner theorem is a nice piece of harmonic analysis relating positivity with trigonometry. To understand how to obtain (i) from (ii), just write simply

\[ \begin{array}{rcl} \sum_{j=1}^n\sum_{k=1}^nc_j\bar{c}_kf(x_j-x_k) &=&\sum_{j=1}^n\sum_{k=1}^nc_j\bar{c}_k\int\!e^{i\left<x_j-x_k,y\right>}\,d\mu(y)\\ &=&\int\!\sum_{j=1}^nc_je^{i\left<x_j,y\right>}\sum_{k=1}^n\bar{c}_ke^{-i\left<x_k,y\right>}\,d\mu(y)\\ &=&\int\!|\sum_{j=1}^nc_je^{i\left<x_j,y\right>}|^2\,d\mu(y)\geq0. \end{array} \]

A basic yet widely used example is given by the Gaussian function \( {f_c(x)=e^{-c\left\Vert x\right\Vert_2^2}} \) on \( {\mathbb{R}^d} \), \( {c>0} \), which is itself the Fourier transform of a Gaussian:

\[ \int_{\mathbb{R}^d}\!f_c(x)e^{i\left<x,y\right>}\,dy = \int_{\mathbb{R}^d}\!e^{i\left<x,y\right>}e^{-c\left\Vert y\right\Vert_2^2}\,dy = \left(\frac{\pi}{c}\right)^{d/2}e^{-\frac{\left\Vert x\right\Vert_2^2}{4c}} = \left(\frac{\pi}{c}\right)^{d/2} f_{1/4c}(x). \]

The Bochner theorem can be seen as a characterization of the Fourier transforms of finite Borel measures on \( {\mathbb{R}^d} \). By taking the (inverse) Fourier transform, it can also be seen as a characterization of functions with non-negative Fourier transform. We will not prove the Bochner theorem in this post (it remains to prove that (i) implies (ii)). We will rather be interested by the Schoenberg theorem on positive definite radial functions, which states that for any measurable \( {f:\mathbb{R}_+\rightarrow\mathbb{R}_+} \), the following are equivalent:

(j) \( {x\in\mathbb{R}^d\rightarrow f(\left\Vert x\right\Vert_2)} \) is continuous bounded positive definite for every \( {d\geq1} \):
\[ \forall d\geq1, \forall n\geq1, \forall x_1,\ldots,x_n\in\mathbb{R}^d, \forall c\in\mathbb{C}^n, \quad \sum_{i=1}^n\sum_{j=1}^nc_if(\left\Vert x_i-x_j\right\Vert_2)\bar{c}_j\geq0; \]
(jj) \( {t\in\mathbb{R}_+\mapsto f(\sqrt{t})} \) is the Laplace transform of a finite Borel measure \( {\mu} \) on \( {\mathbb{R}_+} \):
\[ \forall t\in\mathbb{R}_+, \quad f(\sqrt{t})=\int_0^\infty\!e^{-tx}\,d\mu(x). \]

Probabilistic proof of the Schoenberg theorem. By the Bochner theorem, (j) is equivalent to

(j') for every \( {d\geq1} \), there exists a finite Borel measure \( {\mu_d} \) on \( {\mathbb{R}^d} \) such that
\[ \forall x\in\mathbb{R}^d, \quad f(\left\Vert x\right\Vert_2)=\int_{\mathbb{R}^d}\!e^{i\left<x,y\right>}\,d\mu_d(y). \]

On the other hand, (jj) is equivalent to say that \( {f} \) is a (scale) mixture of Gaussians:

(jj') there exists a finite Borel measure \( {\mu} \) on \( {\mathbb{R}_+} \) such that
\[ \forall t\in\mathbb{R}_+, \quad f(t)=\int_0^\infty\!e^{-t^2s}\,d\mu(s). \]

Now to prove that (jj') implies (j') we write using (jj'):

\[ f(\left\Vert x\right\Vert_2) \overset{(jj')}{=} \int_0^\infty\!e^{-s\left \Vert x\right\Vert_2^2}\,d\mu(s) =\int_0^\infty\! \left(\int_{\mathbb{R}^d}\!e^{i\left<x,y\right>}\, \frac{e^{-\frac{\left\Vert y\right\Vert_2^2}{4s}}}{(4s\pi)^{d/2}}\,dy \right)d\mu(s) \]

and by the Fubini theorem we get (j') with \( {\mu_d} \) a Gaussian scale mixture:

\[ f(\left\Vert x\right\Vert_2) =\int_{\mathbb{R}^d}\!e^{i\left<x,y\right>} \left(\int_0^\infty\!\, \frac{e^{-\frac{\left\Vert y\right\Vert_2^2}{4s}}} {(4\pi s)^{d/2}}\,d\mu(s)\right)dy =: \int_{\mathbb{R}^d}\!e^{i\left<x,y\right>}\,d\mu_d(y). \]

Let us prove now that (j') implies (jj'). Let \( {S_n:=\frac{Z_1^2+\cdots+Z_n^2}{n}} \) where \( {Z_1,\ldots,Z_n} \) are i.i.d. Gaussian random variables on \( {\mathbb{R}} \) with zero mean and variance \( {t^2} \). Then, denoting \( {X:=\frac{1}{\sqrt{n}}(Z_1,\ldots,Z_n)} \) and using (j') and the Fubini theorem,

\[ \mathbb{E}(f(\sqrt{S_n})) =\mathbb{E}(f(\left\Vert X\right\Vert_2)) \overset{(j')}{=} \mathbb{E}\left(\int_{\mathbb{R}^n}\!e^{i\left<X,y\right>}\,d\mu_n(y)\right) =\int_{\mathbb{R}^n}\!\mathbb{E}\left(e^{i\left<X,y\right>}\right)\,d\mu_n(y). \]

Now, since \( {Z_1,\ldots,Z_n} \) are i.i.d. Gaussians of mean zero and variance \( {t^2} \),

\[ \mathbb{E}(f(\sqrt{S_n})) =\int_{\mathbb{R}^n}\!e^{-\frac{t^2}{2n}\left\Vert y\right\Vert_2^2}\,d\mu_n(y). \]

One may assume by scaling that \( {f(0)=1} \) and thus that \( {\mu_n} \) is a probability measure. If \( {(Y_1,\ldots,Y_n)} \) is a random vector of law \( {\mu_n} \), it follows that

\[ \mathbb{E}(f(\sqrt{S_n})) =\mathbb{E}\left(\exp\left(-\frac{t^2}{2}\frac{Y_1^2+\cdots+Y_n^2}{n}\right)\right). \]

The behavior of the left hand side is simple: by a weak law of large numbers, \( {\mathbb{E}(f(\sqrt{S_n}))\rightarrow f(t)} \) as \( {n\rightarrow\infty} \). On the other hand, for the right hand side, we note first that \( {\mu_n} \) is the first \( {n} \) marginal of \( {\mu_{m}} \) for every \( {m\geq n} \) (compatibility), and that \( {\mu_n} \) is exchangeable, and therefore, by the de Finetti theorem, there exists a \( {\sigma} \)-field \( {\mathcal{F}} \) such that \( {(Y_n)_{n\geq1}} \) are i.i.d. conditional on \( {\mathcal{F}} \). Conditional on \( {\mathcal{F}} \), by the Kolmogorov strong law of large numbers, \( {\frac{Y_1^2+\cdots+Y_n^2}{n}\rightarrow M} \) almost surely as \( {n\rightarrow\infty} \) where \( {M} \) is a \( {\mathcal{F}} \)-measurable random variable taking values in \( {[0,\infty]} \). Therefore by the dominated convergence theorem we obtain

\[ f(t)=\mathbb{E}(e^{-\frac{t^2}{2}M}\mathbf{1}_{\{M<\infty\}}). \]

Taking \( {t=0} \) gives \( {\mathbb{P}(M<\infty)=1} \) and (jj') follows: \( {\mu} \) is the law of \( {\frac{1}{2}M} \). This ends our probabilistic proof of the Schoenberg theorem.

Boundedness and Gaussian mixtures. In the Schoenberg theorem, the boundedness of \( {f} \) corresponds to the finiteness of \( {\mu} \). It is worthwhile to notice that the derivation of (j) from (jj') via (j') is indeed robust to unboundedness: if \( {f} \) is a mixture of Gaussian with a mixing Borel measure \( {\mu} \) which is not finite, then (j) still holds true. This observation is useful since in practice, many concrete kernels are not bounded around \( {0} \) or \( {+\infty} \). For example, one may think about the power kernel, which can be represented as Gaussian mixtures:

\[ \forall \beta>0, t>0,\quad \frac{1}{t^\beta} =\int_0^\infty\!\frac{x^{\beta/2-1}}{\Gamma(\beta/2)}e^{-xt^2}\,dx. \]

A careful analysis of the proof of the derivation of (jj') from (j') reveals that the argument may work beyond the bounded case, provided that \( {f} \) does not blow too fast neat \( {0} \) and \( {\infty} \).

For the logarithmic kernel, the Gaussian mixture has a constant negative shift:

\[ \forall t>0,\quad -\log(t)=\int_0^\infty\!\frac{e^{-xt^2}-e^{-x}}{2x}\,dx. \]

This is enough to get a definite positiveness restricted to the condition that \( {\sum_kc_k=0} \). As one can understand in the sequel, this restricted definite positiveness is actually enough to get the convexity (and even the strict convexity) of the integral functional associated to the logarithmic kernel.

Bernstein theorem on completely monotone functions. It states that for every continuous function \( {g:\mathbb{R}_+\rightarrow\mathbb{R}_+} \) in \( {\mathcal{C}^\infty((0,\infty))} \) the following are equivalent:

(k) \( {g} \) is completely monotone:
\[ \forall k\geq0, \forall t>0, \quad (-1)^kg^{(k)}(t)\geq0; \]
(i.e. \( {g\geq0} \), \( {g'\leq0} \), \( {g''\geq0} \), etc, in particular \( {g} \) is bounded and convex);
(kk) \( {g} \) is the Laplace transform of a finite Borel measure \( {\mu} \) on \( {\mathbb{R}_+} \):
\[ \forall t\in\mathbb{R}_+, \quad g(t)=\int_0^\infty\!e^{-tx}\,d\mu(x). \]

This theorem is also known as the (Hausdorff-)Bernstein(-Widder) theorem and can be seen as a characterization of the Laplace transforms of finite Borel measures on \( {\mathbb{R}_+} \). By combining the Schoenberg theorem with the Bernstein theorem, we obtain that for every continuous function \( {g:\mathbb{R}_+\rightarrow\mathbb{R}_+} \) in \( {\mathcal{C}^\infty((0,\infty))} \), the following are equivalent:

(l) \( {x\in\mathbb{R}^d\rightarrow g(\left\Vert x\right\Vert_2^2)} \) is positive definite for every \( {d\geq1} \):
\[ \forall d\geq1, \forall n\geq1, \forall x_1,\ldots,x_n\in\mathbb{R}^d, \forall c\in\mathbb{C}^n, \quad \sum_{i=1}^n\sum_{j=1}^nc_ig(\left\Vert x_i-x_j\right\Vert_2^2)\bar{c}_j\geq0; \]
(ll) \( {g} \) is completely monotone:
\[ \forall k\geq0, \forall t>0, \quad (-1)^kg^{(k)}(t)\geq0; \]

This has the advantage to furnish immediate examples:

\( {g(t)=\alpha} \), \( {\alpha\geq0} \) (constant kernel);
\( {g(t)=e^{-\alpha t}} \), \( {\alpha\geq0} \) (Gaussian kernel!);
\( {g(t)=\frac{1}{(\alpha+t)^\beta}} \), \( {\alpha>0} \), \( {\beta\geq0} \).

Application to the convexity of energy functionals. Let \( {K:\mathbb{R}^d\times\mathbb{R}^d\rightarrow\mathbb{R}} \) be a continuous bounded symmetric kernel, and let \( {\mathcal{P}_d} \) be the set of probability measures on \( {\mathbb{R}^d} \). We define the functional \( {I_K:\mathcal{P}_d\rightarrow\mathbb{R}} \) by

\[ I_K(\mu):=\iint_{\mathbb{R}^d\times\mathbb{R}^d}\!K(x,y)\,d\mu(x)\,d\mu(y). \]

If \( {K} \) is positive semidefinite then \( {I_K} \) is convex since for every \( {\mu,\nu\in\mathcal{P}_d} \) and \( {t\in[0,1]} \),

\[ \frac{tI_K(\mu)+(1-t)I_K(\nu) -I_K(t\mu+(1-t)\nu)}{t(1-t)} =\iint_{\mathbb{R}^d\times\mathbb{R}^d}\!K(x,y)d(\mu-\nu)(x)d(\mu-\nu)(y) \geq0. \]

More generally, if \( {c+K} \) is positive definite for some arbitrary real constant \( {c\in\mathbb{R}} \) then \( {I_K} \) is convex. The constant \( {c} \) is killed by the signed measure \( {\mu-\nu} \) of zero mass. In particular, if \( {g:\mathbb{R}_+\rightarrow\mathbb{R}_+} \) is completely monotone and if \( {c\in\mathbb{R}} \) is an arbitrary constant and if \( {K(x,y)=c+g(\left\Vert x-y\right\Vert_2^2)} \) then, thanks to the Bernstein and Schoenberg theorems, \( {I_K} \) is convex for every \( {d\geq1} \).

Another common way to show the convexity of the functional \( {I_K} \) is to represent the kernel \( {K} \) as a conic mixture of positive definite kernels, say

\[ K(x,y)=\int_{T}\!w(t)\,k_t(x,y)\,d\eta(t) \]

where \( {w\geq0} \) is a non-negative weight on some space \( {T} \) (typically \( {\mathbb{R}_+} \)) and where \( {\eta} \) is a Borel measure on \( {T} \), and where \( {k_t} \) is positive definite for every \( {t\in T} \). This is particularly convenient when \( {k_t} \) is a Gaussian kernel, since it allows then even the study of the strict convexity of \( {I_K} \) by expressing the Gaussian kernel as the Fourier transform of a Gaussian kernel, as in the proof of the first part of the Bochner theorem.

Mercer theorem and Hilbert-Schmidt trace class operators. It states in its simplest form that if \( {K:[0,1]\times[0,1]\rightarrow\mathbb{R}} \) is a continuous symmetric (\( {K(u,v)=K(v,u)} \) for all \( {u,v\in[0,1]} \)) positive definite kernel then the bounded linear operator \( {T:L^2([0,1])\rightarrow L^2([0,1])} \) defined by

\[ T(g)(u):=\int\!K(u,v)g(v)\,dv \]

is compact and \( {T(g)\geq0} \) for every \( {g\in L^2([0,1])} \). Moreover, if \( {e_1,e_2,\ldots} \) is an orthonormal basis of \( {L^2([0,1])} \) formed by eigenvectors of \( {T} \) associated to eigenvalues \( {\lambda_1\geq0,\lambda_2\geq0,\ldots} \) then \( {K(u,v)=\sum_{k=1}^\infty\lambda_k e_k(u)e_k(v)} \) for every \( {u,v\in[0,1]} \), and the convergence is absolute and uniform. Note that \( {T} \) is a trace class operator: \( {\mathrm{Tr}(T)=\int_0^1\!K(u,u)\,du=\sum_{k=1}^\infty\lambda_k} \).

Notes and further reading. Beyond the Euclidean space, the theory of positive definite functions can be developed on abstract structures, see for instance the book Harmonic analysis on semigroups by Berg, Christensen, and Ressel. Salomon Bochner (1999 - 1982) proved his theorem in a book in 1932. One may deduce (ii) from (i) by using the Stone-Weierstrass theorem, see for instance theorem 33.3 in the book Abstract Harmonic Analysis II by Hewitt and Ross. Isaac Jacob Schoenberg (1903 - 1990) published his theorem in 1938 in the Annals of Mathematics. The simple probabilistic proof of the Schoenberg theorem that we gave above is extracted from a short note by Davar Khoshnevisan and was reinvented many times by several authors. Sergei Natarovich Bernstein (1880 - 1968) proved his theorem in a paper published in 1928 in Acta Mathematica, see also theorem 12a in the book The Laplace Transform by Widder. For a simple proof, see for instance section 2 in a paper by Korenblum et al. James Mercer published his theorem in 1907 in the Philosophical Transactions of the Royal Society. Last but not least, one can take a look at the nice book entitled Fourier analysis in convex geometry by Koldobsky, which contains interesting material around the Bochner and the Schoenberg theorems together with functional analysis and convexity.

Some other posts: