About the Hellinger distance

This tiny post is devoted to the Hellinger distance and affinity.

Hellinger. Let \( {\mu} \) and \( {\nu} \) be probability measures with respective densities \( {f} \) and \( {g} \) with respect to the Lebesgue measure \( {\lambda} \) on \( {\mathbb{R}^d} \). Their Hellinger distance is

\[ \mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}. \]

This is well defined since \( {\sqrt{f}} \) and \( {\sqrt{g}} \) belong to \( {\mathrm{L}^2(\lambda)} \). The Hellinger affinity is

\[ \mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu). \]

This gives \( {H(\mu,\nu)^2\in[0,2]} \), \( {A(\mu,\nu)\in[0,1]} \), and the tensor product formula

\[ \mathrm{H}(\mu^{\otimes n},\nu^{\otimes n})^2 =2-2A(\mu^{\otimes n},\nu^{\otimes n}) =2-2A(\mu,\nu)^n =2-2\left(1-\frac{\mathrm{H}(\mu,\nu)^2}{2}\right)^n. \]

Note that \( {\mathrm{H}(\mu,\nu)^2=2} \) iff \( {\mu} \) and \( {\nu} \) have disjoint supports.

Note that if \( {\mu\neq\nu} \) then \( {\lim_{n\rightarrow\infty}\mathrm{H}(\mu^{\otimes n},\nu^{\otimes n})=2} \), a high dimensional phenomenon.

We could also take the following polarized definition

\[ \mathrm{Hellinger}^2(\mu,\nu)=2-2\int\sqrt{\frac{\mathrm{d}\mu}{\mathrm{d}\nu}}\mathrm{d}\nu=2-2\int\sqrt{\frac{\mathrm{d}\nu}{\mathrm{d}\mu}}\mathrm{d}\mu \]

which reveals a freeness with respect to the reference measure. This shows also that the Hellinger distance is a \( {\Phi} \)-entropy, namely

\[ \mathrm{Hellinger}^2(\mu,\nu) =\int\Phi(f)\mathrm{d}\mu-\Phi\Bigr(\int f\mathrm{d}\mu\Bigr) \]

where \( {\Phi:=-2\sqrt{\bullet}} \) and \( {f:=\mathrm{d}\nu/\mathrm{d}\mu} \).

The notions of Hellinger distance and affinity pass to discrete distributions by replacing the Lebesgue measure \( {\lambda} \) by the counting measure. The Hellinger distance is a special case of the \( {\mathrm{L}^p} \) version \( {\Vert f^{1/p}-g^{1/p}\Vert_{\mathrm{L}^p(\lambda)}} \) available for arbitrary \( {p\geq1} \). This is useful in asymptotic statistics, and we refer to the textbooks listed below.

Relation to total variation distance. The Hellinger distance is equivalent topologically and close metrically to the total variation distance, in the sense that

\[ \mathrm{H}^2(\mu,\nu) \leq2\left\Vert\mu-\nu\right\Vert_{\mathrm{TV}} \leq\mathrm{H}(\mu,\nu)\sqrt{4-\mathrm{H}(\mu,\nu)^2} \leq2\mathrm{H}(\mu,\nu) \]

where

\[ \left\Vert\mu-\nu\right\Vert_{\mathrm{TV}} =\sup_A|\mu(A)-\nu(A)| =\frac{1}{2}\int|f-g|\mathrm{d}\lambda. \]

Indeed, the first inequality comes from the following elementary observation

\[ (\sqrt{a}-\sqrt{b})^2 =a+b-2\sqrt{ab} \leq a+b-2(a\wedge b) =|a-b|, \]

valid for all \( {a,b\geq0} \), while the second inequality comes from

\[ |a-b|=|\sqrt{a}^2-\sqrt{b}^2|=|\sqrt{a}-\sqrt{b}|(\sqrt{a}+\sqrt{b}) \]

whiche gives, thanks to the Cauchy-Schwarz inequality,

\[ \int|f-g|\mathrm{d}\lambda \leq\mathrm{H}(\mu,\nu)\sqrt{\int(\sqrt{f}+\sqrt{g})^2\mathrm{d}\lambda} =\mathrm{H}(\mu,\nu)\sqrt{2+2A(\mu,\nu)}. \]

Gaussian explicit formula. The Hellinger distance (or affinity) between two Gaussian distributions can be computed explicitly, just like the square Wasserstein distance and the Kullback-Leibler divergence or relative entropy. Namely

\[ \mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr), \]

equal to \( {1} \) iff \( {(m_1,\sigma_1)=(m_2,\sigma_2)} \). By using the tensor product formula, we have also

\[ \mathrm{A}(\mathcal{N}(m_1,\sigma_1^2)^n,\mathcal{N}(m_2,\sigma_2^2)^n) =\Bigr(2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}\Bigr)^{n/2} \exp\Bigr(-n\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr). \]

Here is a general ``matrix'' formula for Gaussians on \( {\mathbb{R}^d} \), \( {d\geq1} \), with \( {\Delta m=m_2-m_1} \),

\[ \mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr), \]

see for instance Pardo's book, page 51, for a computation.

The Hellinger affinity is also known as the Bhattacharyya coefficient, and enters the definition of the Bhattacharyya distance \( {(\mu,\nu)\mapsto-\log\mathrm{A}(\mu,\nu)} \).

Application to long time behavior of Ornstein-Uhlenbeck. Let \( {{(B_t)}_{t\geq0}} \) be an \( {n} \)-dimensional standard Brownian motion and let \( {{(X^x_t)}_{t\geq0}} \) be the Ornstein-Uhlenbeck process solution of the stochastic differential equation

\[ X_0=x,\quad \mathrm{d}X^x_t=\sqrt{2}\mathrm{d}B_t-X^x_t\mathrm{d}t \]

where \( {x\in\mathbb{R}^n} \). By plugging this equation into the identity \( {\mathrm{d}(\mathrm{e}^tX^x_t)=\mathrm{e}^t\mathrm{d}X^x_t+\mathrm{e}^tX^x_t\mathrm{d}t} \) we get the Mehler formula (the variance comes from the Wiener integral)

\[ X^x_t=x\mathrm{e}^{-t}+\sqrt{2}\int_0^t\mathrm{e}^{s-t}\mathrm{d}B_s \sim \mathcal{N}(x\mathrm{e}^{-t},(1-\mathrm{e}^{-2t})I_n) \underset{t\rightarrow\infty}{\longrightarrow} \mathcal{N}(0,I_n). \]

It follows in particular that for all \( {x,y\in\mathbb{R}^n} \) an \( {t>0} \)

\[ \frac{1}{2}\mathrm{H}^2(\mathrm{Law}(X^x_t),\mathrm{Law}(X^y_t)) =1-\exp\Bigr(-\frac{|x-y|^2\mathrm{e}^{-2t}}{1-\mathrm{e}^{-2t}}\Bigr). \]

Moreover, denoting \( {\mu_t=\mathrm{Law}(X^x_t)} \) and \( {\mu_\infty=\mathcal{N}(0,I_n)} \), it follows that

\[ \mathrm{H}(\mu_t,\mu_\infty)^2 =2-2\Bigr(2\frac{\sqrt{1-\mathrm{e}^{-2t}}}{2-\mathrm{e}^{-2t}}\Bigr)^{1/2} \exp\Bigr(-\frac{|x|^2\mathrm{e}^{-2t}}{4(2-\mathrm{e}^{-2t})}\Bigr). \]

This quantity tends to \( {0} \) as \( {t\rightarrow\infty} \). If \( {|x|^2=x_1^2+\cdots+x_n^2\sim cn} \) then this happens, as \( {n} \) is large, near the critical value \( {t=\frac{1}{2}\log(n)} \), for which \( {\mathrm{e}^{-2t}=1/n} \). More information about cutoffs phenomena for Ornstein-Uhlenbeck and diffusions is available in the papers below.

Further reading

David Pollard
A user's guide to measure theoretic probability
Cambridge University Press (2002)
Béatrice Lachaud
Cut-off and hitting times of a sample of Ornstein-Uhlenbeck processes and its average
Journal of Applied Probability 42(4) 1069-1080 (2005)
Laurent Saloff-Coste
Precise estimates on the rate at which certain diffusions tend to equilibrium
1994 Mathematische Zeitschrift 217(4) 641-677 (1994)
Aad van der Vaart
Asymptotic statistics
Cambridge University Press (1998)
Ildar Abdullovich Ibragimov and Rafail Zalmanovich Khasminskii
Statistical estimation Asymptotic theory
Springer (1981)
Leandro Pardo
Statistical Inference Based on Divergence Measures
Chapman & Hall (2006)
Djalil Chafaï et Florent Malrieu
Recueil de modèles aléatoires
Springer (2015)

Some other posts:

One Comment