The aim of this short post is to explain why the maximum entropy principle could be better seen as a minimum relative entropy principle, in other words an entropic projection.

Relative entropy. Let $\lambda$ be a reference measure on some measurable space $E$. The relative entropy with respect to $\lambda$ is defined for every measure $\mu$ on $E$ with density $\mathrm{d}\mu/\mathrm{d}\lambda$ by $$\mathrm{H}(\mu\mid\lambda):=\int\frac{\mathrm{d}\mu}{\mathrm{d}\lambda}\log\frac{\mathrm{d}\mu}{\mathrm{d}\lambda}\mathrm{d}\lambda.$$ If the integral is not well defined, we could simply set $\mathrm{H}(\mu\mid\lambda):=+\infty$.

• An important case is when $\lambda$ is a probability measure. In this case $\mathrm{H}$ becomes the Kullback-Leibler divergence, and the Jensen inequality for the strictly convex function $u\mapsto u\log(u)$ indicates then that $\mathrm{H}(\mu\mid\lambda)\geq0$ with equality if and only if $\mu=\lambda$.
• Another important case is when $\lambda$ is the Lebesgue measure on $\mathbb{R}^n$ or the counting measure on a discrete set, then $$-\mathrm{H}(\mu\mid\lambda)$$ is the Boltzmann-Shannon entropy of $\mu$. Beware that when $E=\mathbb{R}^n$, this entropy takes its values in the whole $(-\infty,+\infty)$ since for all positive scale factor $\sigma>0$, denoting $\mu_\sigma$ the push forward of $\mu$ by the dilation $x\mapsto\sigma x$, we have $$\mathrm{H}(\mu_\sigma\mid\lambda)=\mathrm{H}(\mu\mid\lambda)-n\log \sigma.$$

Boltzmann-Gibbs probability measures. Such a probability measure $\mu_{V,\beta}$ takes the form $$\mathrm{d}\mu_{V,\beta}:=\frac{\mathrm{e}^{-\beta V}}{Z_{V,\beta}}\mathrm{d}\lambda$$ where $V:E\mapsto(-\infty,+\infty]$, $\beta\in[0,+\infty)$, and $$Z_{V,\beta}:=\int\mathrm{e}^{-\beta V}\mathrm{d}\lambda<\infty$$ is the normalizing factor. The more $\beta$ is large, the more $\mu_{V,\beta}$ puts its probability mass on the regions where $V$ is low. The corresponding asymptotic analysis, known as the Laplace method, states that as $\beta\to\infty$ the probability measure $\mu_{V,\beta}$ concentrates on the minimizers of $V$.

The mean of $V$ or $V$-moment of $\mu_{V,\beta}$ writes
$$\int V\mathrm{d}\mu_{V,\beta} =-\frac{1}{\beta}\mathrm{H}(\mu_{V,\beta}\mid\lambda)-\frac{1}{\beta}\log Z_{V,\beta}.$$
In thermodynamics $-\frac{1}{\beta}\log Z_{V,\beta}$ appears as a Helmholtz free energy since it is equal to $\int V\mathrm{d}\mu_{V,\beta}$ (mean energy) minus $\frac{1}{\beta}\times-\mathrm{H}(\mu_{V,\beta}\mid\lambda)$ (temperature times entropy).

When $\beta$ ranges from $-\infty$ to $\infty$, the $V$-moment of $\mu_{V,\beta}$ ranges from $\sup V$ downto $\inf V$, and $$\partial_\beta\int V\mathrm{d}\mu_{V,\beta}=\Bigr(\int V\mathrm{d}\mu_{V,\beta}\Bigr)^2-\int V^2\mathrm{d}\mu_{V,\beta}\leq0.$$ If $\lambda(E)<\infty$ then $\mu_{V,0}=\frac{1}{\lambda(E)}\lambda$ and its $V$-moment is $\frac{1}{\lambda(E)}\int V\mathrm{d}\lambda$.

Variational principle. Let $\beta\geq0$ such that $Z_{V,\beta}<\infty$ and $c:=\int V\mathrm{d}\mu_{V,\beta}<\infty$. Then, among all the probability measures $\mu$ on $E$ with same $V$-moment as $\mu_{V,\beta}$, the relative entropy $\mathrm{H}(\mu\mid\lambda)$ is minimized by the Boltzmann-Gibbs measures $\mu_{V,\beta}$. In other words,$$\min_{\int V\mathrm{d}\mu=c}\mathrm{H}(\mu\mid\lambda)=\mathrm{H}(\mu_{V,\beta}\mid\lambda).$$

Indeed we have \begin{align*}\mathrm{H}(\mu\mid\lambda)-\mathrm{H}(\mu_{V,\beta}\mid\lambda)&=\int\log\frac{\mathrm{d}\mu}{\mathrm{d}\lambda}\mathrm{d}\lambda-\int\log\frac{\mathrm{d}\mu_{V,\beta}}{\mathrm{d}\lambda}\mathrm{d}\mu_{V,\beta}\\&=\int\log\frac{\mathrm{d}\mu}{\mathrm{d}\lambda}\mathrm{d}\lambda+\int(\log(Z_{V,\beta})+\beta V)\mathrm{d}\mu_{V,\beta}\\&=\int\log\frac{\mathrm{d}\mu}{\mathrm{d}\lambda}\mathrm{d}\lambda+\int(\log(Z_{V,\beta})+\beta V)\mathrm{d}\mu\\&=\int\log\frac{\mathrm{d}\mu}{\mathrm{d}\lambda}\mathrm{d}\lambda-\int\log\frac{\mathrm{d}\mu_{V,\beta}}{\mathrm{d}\lambda}\mathrm{d}\mu\\&=\mathrm{H}(\mu\mid\mu_{V,\beta})\geq0\end{align*} with equality if and only if $\mu=\mu_{V,\beta}$. The crucial point is that $\mu$ and $\mu_{V,\beta}$ are equal on test functions of the form $a+bV$ where $a,b$ are arbitrary real constants, by assumption.

• When $\lambda$ is the Lebesgue measure on $\mathbb{R}^n$ or the counting measure on a discrete set, we recover the usual maximum Boltzmann-Shannon entropy principe $$\max_{\int V\mathrm{d}\mu=c}-\mathrm{H}(\mu\mid\lambda)=-\mathrm{H}(\mu_{V,\beta}).$$In particular, Gaussians maximize the Boltzmann-Shannon entropy under variance constraint (take for $V$ a quadratic form), while the uniform measures maximize the Boltzmann-Shannon entropy under support constraint (take $V$ constant on a set of finite measure for $\lambda$, and infinity elsewere). Maximum entropy is minimum relative entropy with respect to Lebesgue or counting measure, a way to find, among the probability measures with a moment constraint, the closest to the Lebesgue or counting measure.
• When $\lambda$ is a probability measure, then we recover the fact that the Boltzmann-Gibbs measures realize the projection or least Kullback-Leibler divergence of $\lambda$ on the set of probability measures with a given $V$-moment. This is the Csiszár $\mathrm{I}$-projection.
• There are other interesiting applications, for instance when $\lambda$ is a Poisson point process.

Note. The concept of maximum entropy was studied notably by

and by Edwin Thompson Jaynes (1922 – 1998) in relation with thermodynamics, statistical physics, statistical mechanics, information theory, and Bayesian statistics. The concept of I-projection or minimum relative entropy was studied notably by Imre Csiszár (1938 – ).

Related.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Syntax · Style · .