Libres pensées d'un mathématicien ordinaire – LPMO

Photo of Alston Scott Householder (1904 – 1993). An early adopter of ReLU as an activation function for neural networks. — Alston Scott Householder (1904 – 1993). An early adopter of ReLU as an activation function for neural networks.

From a functional point of view, convexity is a variational way to go beyond linearity, via the envelope property. This elementary post is devoted to another representation of convex functions, as a superposition of elementary convex nonlinearities. This representation is related to the positivity of the second derivative in a weak sense.

Integral representation of convex functions as mixtures of ReLU. If $f:\mathbb{R}\to\mathbb{R}$ is convex then there exists a positive Borel measure $\mu$ such that for all $a\in\mathbb{R}$ and $x\geq a$, \[ f(x)=f(a)+f'_-(a)(x-a)+\int_a^\infty(x-t)_+\mathrm{d}\mu(t). \] This integral formula expresses the convex function $f$ as a mixture or conic combination of an affine function and a mixture of shifts of the elementary nonlinear convex function $x\mapsto x_+$, known as the Rectified Linear Unit (ReLU). The positive Borel measure $\mu$ is the second derivative $f''$ of $f$ in the sense of Schwartz distributions.

Proof. Let us consider first the case where $f:\mathbb{R}\to\mathbb{R}$ is $\mathcal{C}^2$ but not necessarily convex. The Taylor formula with Laplace integral reminder writes then, for all $a\leq x$, \[ f(x)=f(a)+f'(a)(x-a)+\int_a^xf''(t)(x-t)\mathrm{d}t. \] This is simply a consequence of the fundamental theorem of calculus and integration by parts. Now, since $(x-t)\mathbf{1}_{[a,x]}(t)=(x-t)_+\mathbf{1}_{[a,+\infty)}(t)$ this rewrites as \[ f(x)=f(a)+f'(a)(x-a)+\int_a^\infty(x-t)_+f''(t)\mathrm{d}t. \] When $f$ is convex, this integral formula expresses $f$ as a mixture or conic combination of an affine function and a mixture of shifts of the ReLU function $x\mapsto x_+$.

Suppose now that $f:\mathbb{R}\to\mathbb{R}$ is convex but not necessarily $\mathcal{C}^2$. It is then natural to use an extended fundamental theorem of calculus (FTC) and integration by parts (IBP). Recall that there exists a countable subset $N\subset\mathbb{R}$ such that $f$ is differentiable on $\mathbb{R}\setminus N$ and $f'(x)=f'_-(x)=f'_+(x)$ for all $x\in\mathbb{R}\setminus N$. The FTC states that for all $a < b$, \[ f(b)-f(a)=\int_a^b\varphi(x)\mathrm{d}x, \] where $\varphi(x)\in f'(x)=[f'_-(x),f'_+(x)]$ is chosen arbitrarily, for all $x\in\mathbb{R}$. See [NP, Proposition 1.6.1] or [KS, Problem 6.21]. Since $f'_-$ is a non-decreasing function, it defines a Stieltjes positive Borel measure, denoted $\mu$, that coincides with the second derivative $f''$ in the sense of Schwartz distributions. We have $\mu([a,b))=f'_-(b)-f'_-(a)$, $a < b$. The integration by parts (IBP) reads \[ \int g\mathrm{d}\mu =-\int g'(t)f'_-(t)\mathrm{d}t \] for all $g:I\to\mathbb{R}$ which is continuous and piecewise $\mathcal{C}^1$ with compact support.

We would like to use it with $g(t)=(x-t)_+$, which is not compactly supported. Also, we introduce the following tent function, for any fixed $a\leq x$ and $\varepsilon > 0$ : \begin{align*} g_\varepsilon(t) &=(x-a)\frac{t-a+\varepsilon}{\varepsilon}\mathbf{1}_{[a-\varepsilon,a]}(t)+(x-t)\mathbf{1}_{[a,x]}(t)\\ &=(x-a)\frac{t-a+\varepsilon}{\varepsilon}\mathbf{1}_{[a-\varepsilon,a]}(t)+(x-t)_+\mathbf{1}_{[a,+\infty)}(t). \end{align*} It is continuous, compactly supported, piecewise linear, and, for $t\not\in\{a-\varepsilon,x,a\}$, \[ g_\varepsilon'(t)=(x-a)\mathbf{1}_{(a-\varepsilon,a)}(t)-\mathbf{1}_{(a,x)}(t). \] Now, using integration by parts (IBP) and the fundamental theorem of calculus (FTC), \begin{align*} (x-a)\int_{a-\varepsilon}^a\frac{t-a+\varepsilon}{\varepsilon}\mathrm{d}\mu(t)+\int_a^\infty(x-t)_+\mathrm{d}\mu(t) &\overset{\mathrm{IBP}}{=}-\int_{\mathbb{R}}g_\varepsilon'(t)f'_-(t)\mathrm{d}t\\ &=-\frac{x-a}{\varepsilon}\int_{a-\varepsilon}^af'_-(t)\mathrm{d}t+\int_{a}^xf'_-(t)\mathrm{d}t\\ &\overset{\mathrm{FTC}}{=}-\frac{x-a}{\varepsilon}\int_{a-\varepsilon}^af'_-(t)\mathrm{d}t+f(x)-f(a). \end{align*} Finally, it remains to note that \[ \lim_{\varepsilon\to0}\int_{a-\varepsilon}^a\frac{t-a+\varepsilon}{\varepsilon}\mathrm{d}\mu(t)=0 \quad\text{and}\quad \lim_{\varepsilon\to0} \frac{1}{\varepsilon}\int_{a-\varepsilon}^af'_-(t)\mathrm{d}t =f'_-(a). \] This reasoning is a variant of [KS, Proof of Theorem 7.1(v) pages 223-224].

Alternative. Following [NP, Theorem 1.6.3], it is alternatively possible to show, still by using IBP and FTC, that for all $a < b$ and $x\in[a,b]$, \[ f(x)=\int_a^bg(x,y)\mathrm{d}\mu(y) +\frac{b-x}{b-a}f(a) +\frac{x-a}{b-a}f(b), \] where this time $g$ is the Green function of the interval $[a,b]$ given by \[ g(x,y)= \frac{(x-a)(y-b)}{b-a}\mathbf{1}_{a\leq x\leq y\leq b} +\frac{(x-b)(y-a)}{b-a}\mathbf{1}_{a\leq y\leq x\leq b}. \] The function $g$ is continuous, symmetric, $\leq0$ on $[a,b]\times[a,b]$, convex in both variables, vanishes at the boundary, and $\partial_1g(x_+,x)-\partial_1g(x_-,x)=1$.

Extension. If a function $f:\mathbb{R}\to\mathbb{R}$ is continuous, and if $f'$ and $f''$ exist and are continuous except on a finite set, with left and right limits at all these points, then $f$ is the difference of two convex functions, see for instance [KS, Problem 24]. This observation allows to extend the integral representation immediately.

Desintegration. By taking the derivative in the sense of Schwartz distribution in the integral representation formula $f(x)=f(a)+f'_-(x-a)+\int_a^\infty(x-t)_+\mathrm{d}\mu(t)$, we get $f''(x)=\int_a^\infty\delta_t(x)\mathrm{d}\mu(t)$ on $[a,+\infty)$, which gives $f''=\mu$ on $[a,+\infty)$ since \[ \int_a^\infty\delta_t\mathrm{d}\mu(t)=\mathbf{1}_{[a,+\infty)}\mu. \] This last formula is intuitive but strange. Actually, by disintegration of measure, we know that for all positive Borel measures $\mu$ and $\nu$ on $\mathbb{R}$, there exists a measurable map $x\mapsto\mu_x$ taking values in positive Borel measures on $\mathbb{R}$, such that \[ \mu=\int\mu_x\mathrm{d}\nu(x) \] in the sense that $\int f\mathrm{d}\mu=\int(\int f(y)\mathrm{d}\mu_x(y))\mathrm{d}\nu(x)$ for every bounded measurable $f:\mathbb{R}\to\mathbb{R}$. In the special case where $\nu_x=\delta_x$ for all $x$, and $\nu=\mu$, this reads \[ \mu=\int\delta_x\mathrm{d}\mu(x). \] In other words a measure is always a mixture of Dirac masses by itself. When $\mu$ is a probability measure, this corresponds to conditioning for the couple $(X,X)$ with $X\sim\mu$.

Further comments. This integral representation of convex functions is useful in many contexts. It can be used for instance to extend the Itô formula of stochastic calculus from $\mathcal{C}^2$ functions to convex functions, see [KS, Theorem 7.1(v)].

Actually it is a generalization of an even simpler observation for piecewise linear convex functions. More precisely, a theorem due to Tiberiu Popoviciu (1906-1975), see [P], states that if $f:[a,b]\to\mathbb{R}$ is piecewise linear and convex then it is the conic combination of an affine function and shifts of ReLU, namely we have a representation of the form \[ f(x)=\alpha x+\beta+\sum_{k=1}^nc_k(x-x_k)_+ \] for some $\alpha,\beta\in\mathbb{R}$, $c_1\geq0,\ldots,c_n\geq0$, $x_1,\ldots,x_n\in\mathbb{R}$. See for instance [NP, Theorem 1.5.7]. Since $x_+=\frac{1}{2}(x+|x|)$, we have equivalently a representation of the form \[ f(x)=\widetilde\alpha x+\beta+\frac{1}{2}\sum_{k=1}^nc_k|x-x_k| \quad\text{where}\quad \widetilde\alpha=\alpha+\frac{1}{2}\sum_{k=1}^nc_k. \] It is natural to ask if for a multivariate or higher dimensional convex function, a similar representation always holds by replacing the absolute value with the norm. The answer in negative, see [NP, 4.1 Exercise 3] for a counter example.

Further reading.

[NP] Constantin P. Niculescu and Lars-Erik Persson
Convex functions and their applications - A contemporary approach
Springer (2006, First Edition)
Theorem 1.5.7, Proposition 1.6.1, Theorem 1.6.3
[KS] Ioannis Karatzas and Steven E. Shreve
Brownian Motion and Stochastic Calculus
Springer (1991, Second Edition)
Problem 6.21, Proof of Theorem 7.1(v) pages 223-224
[P] Tiberiu Popoviciu
Sur certaines inégalités qui caractérisent les fonctions convexes
An. Sti. Univ. Al. I. Cuza Iasi, N. Ser., Sect. Ia 11B, 155-164 (1965)

Acknowledgments. The idea of writing this post came after teaching the de La Vallée Poussin criterion of uniform integrability during the first year of master course on stochastic processes at École normale supérieure. The reference [KS] and the link with the Itô formula for convex functions were pointed out by Laëtitia Comminges.

Libres pensées d'un mathématicien ordinaire Posts

Integral representation of convexity by ReLU mixtures