
Why and how entropy emerges in basic mathematics? This tiny post aims to provide some answers. We have already tried something in this spirit in a previous post almost ten years ago.
Combinatorics. Asymptotic analysis of the multinomial coefficient (nn1,…,nr):=n!n1!⋯nr! :
1nlog(nn1,…,nr)νi=nin→pi→n=n1+⋯+nr→∞S(p):=−r∑i=1pilog(pi). Recall that if A={a1,…,ar} is a finite set of cardinal r and n=n1+⋯+nr then
Card{(x1,…,xn)∈An:∀1≤i≤r,n∑k=11xk=ai=ni}=(nn1,…,nr). The multinomial coefficient can be interpreted as the number of microstates (x1,…,xn) compatible with the macrostate (n1,…,nr), while the quantity S(p) appears as a normalized asymptotic measure of additive degrees of freedom or disorder. This is already in the work of Ludwig Eduard Boltzmann (1844 -- 1906) in kinetic gas theory at the origins of statistical physics. The quantity S(p) is also the one used by Claude Elwood Shannon (1916 -- 2001) in information and communication theory as the average length of optimal lossless coding. It is characterized by the following three natural axioms or properties, denoting S(n) to remember n :
- for all n≥1, p↦S(n)(p) is continuous
- for all n≥1, S(n)(1n,…,1n)<S(n+1)(1n+1,…,1n+1)
- for all n=n1+⋯+nr≥1, S(n)(1n,…,1n)=S(r)(n1n,…,nrn)+∑ri=1ninS(ni)(1ni,…,1ni).
Probability. If X1,…,Xn are independent and identically distributed random variables of law μ on a finite set or alphabet A={a1,…,ar}, then for all x1,…,xn∈A,
P((X1,…,Xn)=(x1,…,xn))=r∏i=1μ∑nk=11xk=aii=r∏i=1μnνii=en∑ni=1νilogμi=e−n(S(ν)+H(ν∣μ)), a remarkable identity where S(ν) is the Boltzmann-Shannon entropy considered before, and where H(ν∣μ) is a new quantity known as the Kullback-Leibler divergence or relative entropy :
S(ν):=−r∑i=1νilogνi=−∫f(x)logf(x)dx where f is the density of ν with respect to the counting measure dx, and
H(ν∣μ):=r∑i=1νilogνiμi=r∑i=1νiμilogνiμiμi=∫dνdμlogdνdμdμ. This comes from information theory after Solomon Kullback (1907 - 1994) and Richard Leibler (1914 -- 2003). Here S(ν) measures the combinatorics on x1,…,xn at prescribed frequencies ν, while H(ν∣μ) measures the cost or energy of deviation from the actual distribution μ. This is a Boltzmann--Gibbsfication of the probability P((X1,…,Xn)=(x1,…,xn)), see below, leading via the Laplace method to the large deviations principle of Ivan Nikolaevich Sanov (1919 -- 1968). The Jensen inequality for the strictly convex function u↦ulog(u) gives
H(ν∣μ)≥0with equality iffν=μ.
Statistics. If Y1,…,Yn are independent and identically distributed random variables of law μ(θ) in parametric family parametrized by θ, on a finite set A, then, following Ronald Aylmer Fisher (1890 - 1962), the likelihood of data (x1,…,xn)∈An is
ℓx1,…,xn(θ):=P(Y1=x1,…,Yn=xn)=n∏i=1μ(θ)xi. It can also be seen as the likelihood of θ with respect to x1,…,xn. This dual point of view leads to the following : if X1,…,Xn is an observed sample of μ(θ∗) with θ∗ unknown then the maximum likelihood estimator of θ∗ is
ˆθn:=argmaxθ∈ΘℓX1,…,Xn(θ)=argmaxθ∈Θ(1nlogℓX1,…,Xn(θ)). The asymptotic analysis via the law of large numbers reveals entropy as asymptotic contrast
1nlogℓX1,…,Xn(θ)=1nn∑i=1logμ(θ)Xia.s.→n→∞r∑k=1μ(θ∗)klogμ(θ)k=−S(μ(θ∗))⏟const−H(μ(θ∗)∣μ(θ)).
Analysis. The entropy appears naturally as a derivative of the Lp norm of f≥0 as follows:
∂p‖f‖pp=∂p∫fpdμ=∂p∫e−plog(f)dμ=∫fplog(f)dμ=1p∫fplog(fp)dμ. This is at the heart of the Leonard Gross (1931 -- ) theorem relating the hypercontractivity of Markov semigroups with the logarithmic Sobolev inequality for the invariant measure. This can also be used to extract from the William Henry Young (1843 - 1942) convolution inequalities certain entropic uncertainty principles.
Boltzmann-Gibbs measures, variational characterizations, and Helmholtz free energy. We take V:A→R, interpreted as an energy. Maximizing μ↦S(μ) over the constraint of average energy ∫Vdμ=v gives the maximizer μβ:=1Zβe−βVdxwhereZβ:=∫e−βVdx. We use integrals instead of sums to lightnen notation. The notation dx stands for the counting measure on A. The parameter β>0, interpreted as inverse temperature, is dictated by v. Such a probability distribution μβ is known as a Boltzmann-Gibbs distribution, after Ludwig Eduard Boltzmann (1844 - 1906) and Josiah Willard Gibbs (1839 - 1903). We have a variational characterization as a maximum entropy at fixed average energy :
∫Vdμ=∫Vdμβ⇒S(μβ)−S(μ)=H(μ∣μβ). There is a dual point of view in which instead of fixing the average energy, we fix the inverse temperature β and we introduce the Hermann von Helmholtz (1821 - 1894) free energy
F(μ):=∫Vdμ−1βS(μ) This can be seen as a Joseph-Louis Lagrange (1736 - 1813) point of view in which the constraint is added to the functional. We have
F(μβ)=−1βlog(Zβ)sinceS(μβ)=β∫Vdμβ+logZβ. We have then a new variational characterization as a minimum free energy at fixed temperature :
F(μ)−F(μβ)=1βH(μ∣μβ). This explains why H is often called free energy.
Legrendre transform. The relative entropy ν↦H(ν∣μ) is the Legendre transform of the log-Laplace transform, in the sense that
supg{∫gdν−log∫egdμ}=H(ν∣μ). Indeed, for all h such that ∫ehdμ=1, by the Jensen inequality, with f:=dνdμ,
∫hdν=∫flog(f)dμ+∫logehffdμ≤∫flog(f)dμ+log∫ehdμ=∫flog(f)dμ=H(ν∣μ), and equality is achieved for h=logf. It remains to reparametrize with h=g−log∫egdμ. Conversely, the Legendre transform of the relative entropy is the log-Laplace transform :
supν{∫gdν−H(ν∣μ)}=log∫egdμ. This is an instance of the convex duality for the convex functional ν↦H(ν∣μ).
Same story for −S which is convex as a function of the Lebesgue density of its argument.
Heat equation. The heat equation ∂tft=Δft is the gradient flow of entropy :∂t∫ftlog(ft)dx=−∫‖∇ft‖2ftdxwhere we used integration by parts, the right hand side is the Fisher information. In other words, the entropy is a Lyapunov function for the heat equation seen as an infinite dimensional ODE.
Further reading.
- Boltzmann-Gibbs entropic variational principle
On this blog (2022) - Entropy ubiquity
On this blog (2015) - Bosons and fermions
On this blog (2012)