This tiny post is devoted to the Hellinger distance and affinity.

Hellinger. Let ${\mu}$ and ${\nu}$ be probability measures with respective densities ${f}$ and ${g}$ with respect to the Lebesgue measure ${\lambda}$ on ${\mathbb{R}^d}$. Their Hellinger distance is

$\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.$

This is well defined since ${\sqrt{f}}$ and ${\sqrt{g}}$ belong to ${\mathrm{L}^2(\lambda)}$. The Hellinger affinity is

$\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).$

This gives ${H(\mu,\nu)^2\in[0,2]}$, ${A(\mu,\nu)\in[0,1]}$, and the tensor product formula

$\mathrm{H}(\mu^{\otimes n},\nu^{\otimes n})^2 =2-2A(\mu^{\otimes n},\nu^{\otimes n}) =2-2A(\mu,\nu)^n =2-2\left(1-\frac{\mathrm{H}(\mu,\nu)^2}{2}\right)^n.$

Note that ${\mathrm{H}(\mu,\nu)^2=2}$ iff ${\mu}$ and ${\nu}$ have disjoint supports.

Note that if ${\mu\neq\nu}$ then ${\lim_{n\rightarrow\infty}\mathrm{H}(\mu^{\otimes n},\nu^{\otimes n})=2}$, a high dimensional phenomenon.

The notions of Hellinger distance and affinity pass to discrete distributions by replacing the Lebesgue measure ${\lambda}$ by the counting measure. The Hellinger distance above is a special case of the ${\mathrm{L}^p}$ version ${\Vert f^{1/p}-g^{1/p}\Vert_{\mathrm{L}^p(\lambda)}}$ available for arbitrary ${p\geq1}$. This is useful in asymptotic statistics, and we refer to the textbooks listed below.

Relation to total variation distance. The Hellinger distance is equivalent topologically and close metrically to the total variation distance, in the sense that

$\mathrm{H}^2(\mu,\nu) \leq\left\Vert\mu-\nu\right\Vert_{\mathrm{TV}} \leq\mathrm{H}(\mu,\nu)\sqrt{4-\mathrm{H}(\mu,\nu)^2} \leq2\mathrm{H}(\mu,\nu)$

where

$\left\Vert\mu-\nu\right\Vert_{\mathrm{TV}} =\sup_A|\mu(A)-\nu(A)| =\int|f-g|\mathrm{d}\lambda.$

Indeed, the first inequality comes from the following elementary observation

$(\sqrt{a}-\sqrt{b})^2 =a+b-2\sqrt{ab} \leq a+b-2(a\wedge b) =|a-b|,$

valid for all ${a,b\geq0}$, while the second inequality comes from

$|a-b|=|\sqrt{a}^2-\sqrt{b}^2|=|\sqrt{a}-\sqrt{b}|(\sqrt{a}+\sqrt{b})$

whiche gives, thanks to the Cauchy-Schwarz inequality,

$\int|f-g|\mathrm{d}\lambda \leq\mathrm{H}(\mu,\nu)\sqrt{\int(\sqrt{f}+\sqrt{g})^2\mathrm{d}\lambda} =\mathrm{H}(\mu,\nu)\sqrt{2+2A(\mu,\nu)}.$

Gaussian explicit formula. The Hellinger distance (or affinity) between two Gaussian distributions can be computed explicitly, just like the square Wasserstein distance and the Kullback-Leibler divergence or relative entropy. Namely

$\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),$

equal to ${1}$ iff ${(m_1,\sigma_1)=(m_2,\sigma_2)}$. By using the tensor product formula, we have also

$\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2)^n,\mathcal{N}(m_2,\sigma_2^2)^n) =\Bigr(2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}\Bigr)^{n/2} \exp\Bigr(-n\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr).$

Here is a general “matrix” formula for Gaussians on ${\mathbb{R}^d}$, ${d\geq1}$, with ${\Delta m=m_2-m_1}$,

$\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr).$

The Hellinger affinity is also known as the Bhattacharyya coefficient, and enters the definition of the Bhattacharyya distance ${(\mu,\nu)\mapsto-\log\mathrm{A}(\mu,\nu)}$.

Application to long time behavior of Ornstein-Uhlenbeck. Let ${{(B_t)}_{t\geq0}}$ be an ${n}$-dimensional standard Brownian motion and let ${{(X^x_t)}_{t\geq0}}$ be the Ornstein-Uhlenbeck process solution of the stochastic differential equation

$X_0=x,\quad \mathrm{d}X^x_t=\sqrt{2}\mathrm{d}B_t-X^x_t\mathrm{d}t$

where ${x\in\mathbb{R}^n}$. By plugging this equation into the identity ${\mathrm{d}(\mathrm{e}^tX^x_t)=\mathrm{e}^t\mathrm{d}X^x_t+\mathrm{e}^tX^x_t\mathrm{d}t}$ we get the Mehler formula (the variance comes from the Wiener integral)

$X^x_t=x\mathrm{e}^{-t}+\sqrt{2}\int_0^t\mathrm{e}^{s-t}\mathrm{d}B_s \sim \mathcal{N}(x\mathrm{e}^{-t},(1-\mathrm{e}^{-2t})I_n) \underset{t\rightarrow\infty}{\longrightarrow} \mathcal{N}(0,I_n).$

It follows in particular that for all ${x,y\in\mathbb{R}^n}$ an ${t>0}$

$\frac{1}{2}\mathrm{H}^2(\mathrm{Law}(X^x_t),\mathrm{Law}(X^y_t)) =1-\exp\Bigr(-\frac{|x-y|^2\mathrm{e}^{-2t}}{1-\mathrm{e}^{-2t}}\Bigr).$

Moreover, denoting ${\mu_t=\mathrm{Law}(X^x_t)}$ and ${\mu_\infty=\mathcal{N}(0,I_n)}$, it follows that

$\mathrm{H}(\mu_t,\mu_\infty)^2 =2-2\Bigr(2\frac{\sqrt{1-\mathrm{e}^{-2t}}}{2-\mathrm{e}^{-2t}}\Bigr)^{1/2} \exp\Bigr(-\frac{|x|^2\mathrm{e}^{-2t}}{4(2-\mathrm{e}^{-2t})}\Bigr).$

This quantity tends to ${0}$ as ${t\rightarrow\infty}$. If ${|x|^2=x_1^2+\cdots+x_n^2\sim cn}$ then this happens, as ${n}$ is large, near the critical value ${t=\frac{1}{2}\log(n)}$, for which ${\mathrm{e}^{-2t}=1/n}$. More information about cutoffs phenomena for Ornstein-Uhlenbeck and diffusions is available in the papers below.