
This tiny post is devoted to a cute reverse Pinsker inequality, extracted from a recent work of my colleague Justin Salez. This reverse Pinsker inequality is at the heart of his investigation of the cutoff phenomenon, in total variation distance, for Markov chains, that we do not discuss here.
Pinsker inequality. If μ and ν are probability measures on the same probability space, it reads
2‖μ−ν‖2TV≤Entropy(ν∣μ) where
Entropy(ν∣μ):=∫dνdμlogdνdμdμ. is the Kullback-Leibler divergence or relative entropy of ν with respect to μ, and where
‖μ−ν‖TV:=supA|μ(A)−ν(A)| is the total variation distance between μ and ν. By convention Entropy(ν∣μ)=+∞ if dνdν does not exist, and in this case the inequality is useless. The constant 2 is optimal as one can check on Bernoulli distributions. The inequality was first proved by Mark S. Pinsker (1925 – 2003) with another constant, and later on and independently by Solomon Kullback (1907 – 1994) and Imre Csiszár (1938 – ). Let us give now a quick proof, due to Johannes H. B. Kemperman (1924 – 2011). Using ‖1A−1Ac‖∞≤1, the L1−L∞ duality, the convexity inequality 3(u−1)2≤(2u+4)(ulogu−u+1),u≥0,and the Cauchy-Schwarz inequality, we get, with α:=αν∣μ:=dνdμ, 2‖μ−ν‖TV≤sup‖f‖∞≤1∫fd(μ−ν)=∫|α−1|dμ≤∫√2α+43√αlogα−α+1dμ≤√∫2α+43dμ√∫αlogαdμ=√2∫αlogαdμ=√2Entropy(ν∣μ). Note that ‖⋅‖TV takes its values in [0,1] while Entropy takes its values in [0,+∞]. Indeed if dν=1Aμ(A)dμ, then ‖μ−ν‖TV=1−μ(A) while Entropy(ν∣μ)=−logμ(A)→+∞ as μ(A)→0. Also we cannot expect a direct reverse Pinsker inequality.
Varentropy. The relative entropy can be seen as the mean of a log :
Entropy(ν∣μ)=∫βν∣μdνwhereβν∣μ:=logdνdμ. The variance of βν∣μ or varentropy is
Varentropy(ν∣μ):=Varν(βν∣μ)=∫(βν∣μ−Entropy(ν∣μ))2dν=∫β2ν∣μdν−Entropy(ν∣μ)2. If dν=1Aμ(A)dμ then βν∣μ is constant on A=supp(ν) and Varentropy(ν∣μ)=0.
A reverse Pinsker inequality. It reads
Entropy(ν∣μ)≤1+√Varentropy(ν∣μ)1−‖μ−ν‖TV. In particular for all ε∈(0,1),
‖μ−ν‖TV≤ε⟹Entropy(ν∣μ)≤1+√Varentropy(ν∣μ)1−ε. Let us give a proof. Let us set ε:=‖μ−ν‖TV. Consider the event
A:={dνdμ≥eθ}whereθ:=Entropy(ν∣μ)−√Varentropy(ν∣μ)1−ε. The Chebyshev inequality givesν(Ac)=ν(βν∣μ<θ)=ν(βν∣μ−Entropy(ν∣μ)<−√Varentropy(ν∣μ)1−ε)≤(1−ε)2, hence ν(A)≥1−(1−ε)2. On the other hand, the definition of A gives
μ(A)≤e−θν(A). These two inequalities together give
ν(A)−μ(A)≥(1−(1−ε)2)(1−e−θ). Now since ‖μ−ν‖TV=ε, we get ε≥(1−(1−ε)2)(1−e−θ), in other words
θ≤log(1+11−ε). Finally, the convexity inequality log(1+u)≤u gives θ≤1/(1−ε), as expected.
If dν=1Aμ(A)dμ then the reverse Pinsker inequality above reads −μ(A)logμ(A)≤1.
Fisher information and Poincaré inequality. The logarithmic Fisher information is defined by
Fisher(ν∣μ):=∫|∇βν∣μ|2dν=∫|∇dνdμ|2dνdμdμ. Also we have, denoting cPI(ν) the Poincaré inequality constant of ν :
Varentropy(ν∣μ)=Varν(βν∣μ)≤cPI(ν)∫|∇βν∣μ|2dν=cPI(ν)Fisher(ν∣μ). Combined with the reverse Pinsker inequality above, this gives
Entropy(ν∣μ)≤1+√cPI(ν)Fisher(ν∣μ)1−‖μ−ν‖TV.
Note. There are plenty of alternate or reverse Pinsker inequalities. For instance the Bretagnolle-Huber inequality, used for hypothesis testing in Statistics, reads ‖μ−ν‖2TV≤1−e−Entropy(ν∣μ),and has the advantage over Pinsker of being informative even when Entropy(ν∣μ)≥2.
Further reading.
- Johannes H. B. Kemperman.
On the optimum rate of transmitting information (in particular pages 2174-2175)
Annals of Mathematical Statistics, 40:2156–2177 (1969) - Justin Salez
Cutoff for non-negatively curved Markov chains (in particular Lemma 8)
To appear in Journal of the European Mathematical Society (2023)
See also arXiv:2102.05597
