Processing math: 100%
Press "Enter" to skip to content

A few words about entropy

Nicolas Léonard Sadi Carnot (1796 - 1932)
Nicolas Léonard Sadi Carnot (1796 - 1932) A romantic figure behind entropy

Why and how entropy emerges in basic mathematics? This tiny post aims to provide some answers. We have already tried something in this spirit in a previous post almost ten years ago.

Combinatorics. Asymptotic analysis of the multinomial coefficient (nn1,,nr):=n!n1!nr! :
1nlog(nn1,,nr)νi=ninpin=n1++nrS(p):=ri=1pilog(pi). Recall that if A={a1,,ar} is a finite set of cardinal r and n=n1++nr then
Card{(x1,,xn)An:1ir,nk=11xk=ai=ni}=(nn1,,nr). The multinomial coefficient can be interpreted as the number of microstates (x1,,xn) compatible with the macrostate (n1,,nr), while the quantity S(p) appears as a normalized asymptotic measure of additive degrees of freedom or disorder. This is already in the work of Ludwig Eduard Boltzmann (1844 -- 1906) in kinetic gas theory at the origins of statistical physics. The quantity S(p) is also the one used by Claude Elwood Shannon (1916 -- 2001) in information and communication theory as the average length of optimal lossless coding. It is characterized by the following three natural axioms or properties, denoting S(n) to remember n :

  • for all n1, pS(n)(p) is continuous
  • for all n1, S(n)(1n,,1n)<S(n+1)(1n+1,,1n+1)
  • for all n=n1++nr1, S(n)(1n,,1n)=S(r)(n1n,,nrn)+ri=1ninS(ni)(1ni,,1ni).

Probability. If X1,,Xn are independent and identically distributed random variables of law μ on a finite set or alphabet A={a1,,ar}, then for all x1,,xnA,
P((X1,,Xn)=(x1,,xn))=ri=1μnk=11xk=aii=ri=1μnνii=enni=1νilogμi=en(S(ν)+H(νμ)), a remarkable identity where S(ν) is the Boltzmann-Shannon entropy considered before, and where H(νμ) is a new quantity known as the Kullback-Leibler divergence or relative entropy :
S(ν):=ri=1νilogνi=f(x)logf(x)dx where f is the density of ν with respect to the counting measure dx, and
H(νμ):=ri=1νilogνiμi=ri=1νiμilogνiμiμi=dνdμlogdνdμdμ. This comes from information theory after Solomon Kullback (1907 - 1994) and Richard Leibler (1914 -- 2003). Here S(ν) measures the combinatorics on x1,,xn at prescribed frequencies ν, while H(νμ) measures the cost or energy of deviation from the actual distribution μ. This is a Boltzmann--Gibbsfication of the probability P((X1,,Xn)=(x1,,xn)), see below, leading via the Laplace method to the large deviations principle of Ivan Nikolaevich Sanov (1919 -- 1968). The Jensen inequality for the strictly convex function uulog(u) gives
H(νμ)0with equality iffν=μ.

Statistics. If Y1,,Yn are independent and identically distributed random variables of law μ(θ) in parametric family parametrized by θ, on a finite set A, then, following Ronald Aylmer Fisher (1890 - 1962), the likelihood of data (x1,,xn)An is
x1,,xn(θ):=P(Y1=x1,,Yn=xn)=ni=1μ(θ)xi. It can also be seen as the likelihood of θ with respect to x1,,xn. This dual point of view leads to the following : if X1,,Xn is an observed sample of μ(θ) with θ unknown then the maximum likelihood estimator of θ is
ˆθn:=argmaxθΘX1,,Xn(θ)=argmaxθΘ(1nlogX1,,Xn(θ)). The asymptotic analysis via the law of large numbers reveals entropy as asymptotic contrast
1nlogX1,,Xn(θ)=1nni=1logμ(θ)Xia.s.nrk=1μ(θ)klogμ(θ)k=S(μ(θ))constH(μ(θ)μ(θ)).

Analysis. The entropy appears naturally as a derivative of the Lp norm of f0 as follows:
pfpp=pfpdμ=peplog(f)dμ=fplog(f)dμ=1pfplog(fp)dμ. This is at the heart of the Leonard Gross (1931 -- ) theorem relating the hypercontractivity of Markov semigroups with the logarithmic Sobolev inequality for the invariant measure. This can also be used to extract from the William Henry Young (1843 - 1942) convolution inequalities certain entropic uncertainty principles.

Boltzmann-Gibbs measures, variational characterizations, and Helmholtz free energy. We take V:AR, interpreted as an energy. Maximizing μS(μ) over the constraint of average energy Vdμ=v gives the maximizer μβ:=1ZβeβVdxwhereZβ:=eβVdx. We use integrals instead of sums to lightnen notation. The notation dx stands for the counting measure on A. The parameter β>0, interpreted as inverse temperature, is dictated by v. Such a probability distribution μβ is known as a Boltzmann-Gibbs distribution, after Ludwig Eduard Boltzmann (1844 - 1906) and Josiah Willard Gibbs (1839 - 1903). We have a variational characterization as a maximum entropy at fixed average energy :
Vdμ=VdμβS(μβ)S(μ)=H(μμβ). There is a dual point of view in which instead of fixing the average energy, we fix the inverse temperature β and we introduce the Hermann von Helmholtz (1821 - 1894) free energy
F(μ):=Vdμ1βS(μ) This can be seen as a Joseph-Louis Lagrange (1736 - 1813) point of view in which the constraint is added to the functional. We have
F(μβ)=1βlog(Zβ)sinceS(μβ)=βVdμβ+logZβ. We have then a new variational characterization as a minimum free energy at fixed temperature :
F(μ)F(μβ)=1βH(μμβ). This explains why H is often called free energy.

Legrendre transform. The relative entropy νH(νμ) is the Legendre transform of the log-Laplace transform, in the sense that
supg{gdνlogegdμ}=H(νμ). Indeed, for all h such that ehdμ=1, by the Jensen inequality, with f:=dνdμ,
hdν=flog(f)dμ+logehffdμflog(f)dμ+logehdμ=flog(f)dμ=H(νμ), and equality is achieved for h=logf. It remains to reparametrize with h=glogegdμ. Conversely, the Legendre transform of the relative entropy is the log-Laplace transform :
supν{gdνH(νμ)}=logegdμ. This is an instance of the convex duality for the convex functional νH(νμ).

Same story for S which is convex as a function of the Lebesgue density of its argument.

Heat equation. The heat equation tft=Δft is the gradient flow of entropy :tftlog(ft)dx=ft2ftdxwhere we used integration by parts, the right hand side is the Fisher information. In other words, the entropy is a Lyapunov function for the heat equation seen as an infinite dimensional ODE.

Further reading.

    Leave a Reply

    Your email address will not be published.

    This site uses Akismet to reduce spam. Learn how your comment data is processed.

    Syntax · Style · .