Press "Enter" to skip to content

Month: February 2021

Fisher information between two Gaussians

Photo of Ronald Aylmer Fisher
Ronald A. Fisher (1890 – 1962) in 1951.

Fisher information. The Fisher information or divergence of a positive Borel measure measure \( {\nu} \) with respect to another one \( {\mu} \) on the same space is

\[ \mathrm{Fisher}(\nu\mid\mu) =\int\left|\nabla\log\textstyle\frac{\mathrm{d}\nu}{\mathrm{d}\mu}\right|^2\mathrm{d}\nu =\int\frac{|\nabla\frac{\mathrm{d}\nu}{\mathrm{d}\mu}|^2}{\frac{\mathrm{d}\nu}{\mathrm{d}\mu}}\mathrm{d}\mu =4\int\left|\nabla\sqrt{\textstyle\frac{\mathrm{d}\nu}{\mathrm{d}\mu}}\right|^2\mathrm{d}\mu %=4\int\left|\nabla\sqrt{\varphi}\right|^2\mathrm{d}\mu %=\int_{\{\varphi>0\}}\frac{\left|\nabla \varphi\right|^2}{\varphi}\mathrm{d}\mu \in[0,+\infty] \]

if \( {\nu} \) is absolutey continuous with respect to \( {\mu} \), and \( {\mathrm{Fisher}(\nu\mid\mu)=+\infty} \) otherwise.

It plays a role in the analysis and geometry of statistics, information, partial differential equations, and Markov diffusion stochastic processes. It is named after Ronald Aylmer Fisher (1890 – 1962), a British scientist who is also the Fisher of many other objects and concepts including for instance:

However, he should not be confused with for instance:

Let us denote \( {|x|=\sqrt{x_1^2+\cdots+x_n^2}} \) and \( {x\cdot y=x_1y_1\cdots+x_ny_n} \) for all \( {x,y\in\mathbb{R}^n} \).

Explicit formula for Gaussians. For all \( {n\geq1} \), all vectors \( {m_1,m_2\in\mathbb{R}^n} \), and all \( {n\times n} \) covariance matrices \( {\Sigma_1} \) and \( {\Sigma_2} \), we have

\[ \mathrm{Fisher}(\mathcal{N}(m_1,\Sigma_1)\mid\mathcal{N}(m_2,\Sigma_2)) =|\Sigma_2^{-1}(m_1-m_2)|^2+\mathrm{Tr}(\Sigma_2^{-2}\Sigma_1-2\Sigma_2^{-1}+\Sigma_1^{-1}). \]

When \( {\Sigma_1} \) and \( {\Sigma_2} \) commute, this reduces to the following, closer to the univariate case,

\[ \mathrm{Fisher}(\mathcal{N}(m_1,\Sigma_1)\mid\mathcal{N}(m_2,\Sigma_2)) =|\Sigma_2^{-1}(m_1-m_2)|^2+\mathrm{Tr}(\Sigma_2^{-2}(\Sigma_2-\Sigma_1)^2\Sigma_1^{-1}). \]

In the univariate case, this reads, for all \( {m_1,m_2\in\mathbb{R}} \) and \( {\sigma_1^2,\sigma_2^2\in(0,\infty)} \),

\[ \mathrm{Fisher}(\mathcal{N}(m_1,\sigma_1^2)\mid\mathcal{N}(m_2,\sigma_2^2)) =\frac{(m_1-m_2)^2}{\sigma_2^4}+\frac{(\sigma_2^2-\sigma_1^2)^2}{\sigma_1^2\sigma_2^4}. \]

A proof. If \( {X\sim\mathcal{N}(m,\Sigma)} \) then, for all \( {1\leq i,j\leq n} \),

\[ \mathbb{E}(X_iX_j)=\Sigma_{ij}+m_im_j, \]

hence, for all \( {n\times n} \) symmetric matrices \( {A} \) and \( {B} \),

\[ \begin{array}{rcl} \mathbb{E}(AX\cdot BX) &=&\mathbb{E}\sum_{i,j,k=1}^nA_{ij}X_jB_{ik}X_k\\ &=&\sum_{i,j,k=1}^nA_{ij}B_{ik}\mathbb{E}(X_jX_k)\\ &=&\sum_{i,j,k=1}^nA_{ij}B_{ik}(\Sigma_{jk}+m_jm_k)\\ &=&\mathrm{Trace}(A\Sigma B)+Am\cdot Bm, \end{array} \]

and thus for all \( {n} \)-dimensional vectors \( {a} \) and \( {b} \),

\[ \begin{array}{rcl} \mathbb{E}(A(X-a)\cdot B(X-b)) &=&\mathbb{E}(AX\cdot BX)+A(m-a)\cdot B(m-b)-Am\cdot Bm\\ &=&\mathrm{Trace}(A\Sigma B)+A(m-a)\cdot B(m-b). \end{array} \]

Now, using the notation \( {q_i(x)=\Sigma_i^{-1}(x-m_i)\cdot(x-m_i)} \) and \( {|\Sigma_i|=\det(\Sigma_i)} \),

\[ \begin{array}{rcl} \mathrm{Fisher}(\Gamma_1\mid\Gamma_2) &=&\displaystyle4\frac{\sqrt{|\Sigma_2|}}{\sqrt{|\Sigma_1|}}\int\Bigr|\nabla\mathrm{e}^{-\frac{q_1(x)}{4}+\frac{q_2(x)}{4}}\Bigr|^2\frac{\mathrm{e}^{-\frac{q_2(x)}{2}}}{\sqrt{2\pi|\Sigma_2|}}\mathrm{d}x\\ &=&\displaystyle\int|\Sigma_2^{-1}(x-m_2)-\Sigma_1^{-1}(x-m_1)|^2\frac{\mathrm{e}^{-\frac{q_1(x)}{2}}}{\sqrt{2\pi|\Sigma_1|}}\mathrm{d}x\\ &=&\displaystyle\int(|\Sigma_2^{-1}(x-m_2)|^2\\ &&\qquad-2\Sigma_2^{-1}(x-m_2)\cdot\Sigma_1^{-1}(x-m_1)\\ &&\qquad+|\Sigma_1^{-1}(x-m_1)|^2)\frac{\mathrm{e}^{-\frac{q_1(x)}{2}}}{\sqrt{2\pi|\Sigma_1|}}\mathrm{d}x\\ &=&\mathrm{Trace}(\Sigma_2^{-1}\Sigma_1\Sigma_2^{-1})+|\Sigma_2^{-1}(m_1-m_2)|^2-2\mathrm{Trace}(\Sigma_2^{-1})+\mathrm{Trace}(\Sigma_1^{-1})\\ &=&\mathrm{Trace}(\Sigma_2^{-2}\Sigma_1-2\Sigma_2^{-1}+\Sigma_1^{-1})+|\Sigma_2^{-1}(m_1-m_2)|^2. \end{array} \]

The formula when \( {\Sigma_1\Sigma_2=\Sigma_2\Sigma_1} \) follows immediately.

Other distances. Recall that the Hellinger distance between probability measures \( {\mu} \) and \( {\nu} \) with densities \( {f_\mu} \) and \( {f_\nu} \) with respect to the same reference measure \( {\lambda} \) is

\[ \mathrm{Hellinger}(\mu,\nu) =\Bigr(\int(\sqrt{f_\mu}-\sqrt{f_\nu})^2\mathrm{d}\lambda\Bigr)^{1/2} =\Bigr(2-2\int\sqrt{f_\mu f_\nu}\mathrm{d}\lambda\Bigr)^{1/2} \in[0,\sqrt{2}]. \]

This quantity does not depend on the choice of \( {\lambda} \).

The \( {\chi^2} \) divergence (inappropriately called distance) is defined by

\[ \chi^2(\nu\mid\mu) =\mathrm{Var}_\mu\Bigr(\frac{\mathrm{d}\nu}{\mathrm{d}\mu}\Bigr) =\Bigr\|\frac{\mathrm{d}\nu}{\mathrm{d}\mu}-1\Bigr\|_{\mathrm{L}^2(\mu)} =\Bigr\|\frac{\mathrm{d}\nu}{\mathrm{d}\mu}\Bigr\|_{\mathrm{L}^2(\mu)}-1 \in[0,+\infty]. \]

The KullbackLeibler divergence or relative entropy is defined by

\[ \mathrm{Kullback}(\nu\mid\mu) =\int\log{\textstyle\frac{\mathrm{d}\nu}{\mathrm{d}\mu}}\mathrm{d}\nu =\int{\textstyle\frac{\mathrm{d}\nu}{\mathrm{d}\mu}} \log{\textstyle\frac{\mathrm{d}\nu}{\mathrm{d}\mu}}\mathrm{d}\mu \in[0,+\infty] \]

if \( {\nu} \) is absolutey continuous with respect to \( {\mu} \), and \( {\mathrm{Kullback}(\nu\mid\mu)=+\infty} \) otherwise.

The WassersteinKantorovichMonge transportation distance of order \( {2} \) and with respect to the underlying Euclidean distance is defined for all probability measures \( {\mu} \) and \( {\nu} \) on \( {\mathbb{R}^n} \) by

\[ \mathrm{Wasserstein}(\mu,\nu)=\Bigr(\inf_{(X,Y)}\mathbb{E}(\left|X-Y\right|^2)\Bigr)^{1/2} \in[0,+\infty] \ \ \ \ \ (1) \]

where the inf runs over all couples \( {(X,Y)} \) with \( {X\sim\mu} \) and \( {Y\sim\nu} \).

Now, for all \( {n\geq1} \), \( {m_1,m_2\in\mathbb{R}^n} \), and all \( {n\times n} \) covariance matices \( {\Sigma_1,\Sigma_2} \), denoting

\[ \Gamma_1=\mathcal{N}(\mu_1,\Sigma_1) \quad\mbox{and}\quad \Gamma_2=\mathcal{N}(\mu_2,\Sigma_2), \]

we have, with \( {m=m_1-m_2} \),

\[ \begin{array}{rcl} \mathrm{Hellinger}^2(\Gamma_1,\Gamma_2) &=&2-2\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}}\mathrm{exp}\Bigr(-\frac{1}{4}(\Sigma_1+\Sigma_2)^{-1}m\cdot m\Bigr)\\ \chi^2(\Gamma_1\mid\Gamma_2) &=&\sqrt{\frac{|\Sigma_2|}{|2\Sigma_1-\Sigma_1^2\Sigma_2^{-1}|}}\exp\Bigr(\frac{1}{2}\Sigma_2^{-1}(\mathrm{I}_n+(2\Sigma_1^{-1}\Sigma_2^{-1}-\Sigma_2^{-2}))m\cdot m\Bigr)-1\\ 2\mathrm{Kullback}(\Gamma_1\mid\Gamma_2) &=&\Sigma_2^{-1}m\cdot m+\mathrm{Tr}(\Sigma_2^{-1}\Sigma_1-\mathrm{I}_n)+\log\det(\Sigma_2\Sigma_1^{-1})\\ \mathrm{Fisher}(\Gamma_1\mid\Gamma_2) &=&|\Sigma_2^{-1}m|^2+\mathrm{Tr}(\Sigma_2^{-2}\Sigma_1-2\Sigma_2^{-1}+\Sigma_1^{-1})\\ \mathrm{Wasserstein}^2(\Gamma_1,\Gamma_2) &=&|m|^2+\mathrm{Tr}\Bigr(\Sigma_1+\Sigma_2-2\sqrt{\sqrt{\Sigma_1}\Sigma_2\sqrt{\Sigma_1}}\Bigr), \end{array} \]

and if \( {\Sigma_1} \) and \( {\Sigma_2} \) commute, \( {\Sigma_1\Sigma_2=\Sigma_2\Sigma_1} \), then we find the simpler formulas

\[ \begin{array}{rcl} \mathrm{Fisher}(\Gamma_1\mid\Gamma_2) &=&|\Sigma_2^{-1}(m_1-m_2)|^2+\mathrm{Tr}(\Sigma_2^{-2}(\Sigma_2-\Sigma_1)^2\Sigma_1^{-1})\\ \mathrm{Wasserstein}^2(\Gamma_1,\Gamma_2) &=&|m_1-m_2|^2+\mathrm{Tr}((\sqrt{\Sigma_1}-\sqrt{\Sigma_2})^2). \end{array} \]

Fisher as an infinitesimal Kullback. The BoltzmannShannon entropy is in a sense the opposite of the Kullback divergence with respect to the Lebesgue measure \( {\lambda} \), namely

\[ \mathrm{Entropy}(\mu) =-\int\frac{\mathrm{d}\mu}{\mathrm{d}\lambda} \log\frac{\mathrm{d}\mu}{\mathrm{d}\lambda}\mathrm{d}\lambda =\mathrm{Kullback}(\mu\mid\lambda). \]

It was discovered by Nicolaas Govert de Bruijn (1918 — 2012) that the Fisher information appears as the differential version of the entropy under Gaussian noise. More precisely, it states that if \( {X} \) is a random vector of \( {\mathbb{R}^n} \) with finite entropy and if \( {Z\sim\mathcal{N}(0,I_n)} \) then

\[ \frac{\mathrm{d}}{\mathrm{d}t}\Bigr\vert_{t=0} \mathrm{Entropy}(\mathrm{Law}(X+\sqrt{t}Z)\mid\lambda) =-\mathrm{Fisher}(\mathrm{Law}(X)\mid\lambda). \]

In other words, if \( {\mu_t} \) is the law at time \( {t} \) of an \( {n} \)-dimensional Brownian motion started from a random initial condition \( {X} \) then

\[ \frac{\mathrm{d}}{\mathrm{d}t}\Bigr\vert_{t=0} \mathrm{Entropy}(\mu_t\mid\lambda) =-\mathrm{Fisher}(\mu_0\mid\lambda). \]

The Lebesgue measure is the invariant (and reversible) measure of Brownian motion. More generally, let us consider the stochastic differential equation

\[ \mathrm{d}X_t=\sqrt{2}\mathrm{d}B_t-\nabla V(X_t)\mathrm{d}t \]

on \( {\mathbb{R}^n} \), where \( {V:\mathbb{R}^n\mapsto\mathbb{R}} \) is \( {\mathcal{C}^2} \) and where \( {{(B_t)}_{t\geq0}} \) is a standard Brownian motion. If we assume that \( {V-\frac{\rho}{2}\left|\cdot\right|^2} \) is convex for some \( {\rho\in\mathbb{R}} \) then it admits a solution \( {{(X_t)}_{t\geq0}} \) known as the overdamped Langevin process, which is a Markov diffusion process. If we further assume that \( {\mathrm{e}^{-V}} \) is integrable with respect to the Lebesgue measure, then the probability measure \( {\mu} \) with density proportional to \( {\mathrm{e}^{-V}} \) is invariant and reversible. Now, denoting \( {\mu_t=\mathrm{Law}(X_t)} \), the analogue of the De Bruijn identity reads, for all \( {t\geq0} \),

\[ \frac{\mathrm{d}}{\mathrm{d}t} \mathrm{Kullback}(\mu_t\mid\mu) =-\mathrm{Fisher}(\mu_t\mid\mu) \]

but this requires that \( {\mu_0} \) is chosen in such a way that \( {t\mapsto\mathrm{Kullback}(\mu_t\mid\mu)} \) is well defined and differentiable. This condition is easily checked in the example of the OrnsteinUhlenbeck process which corresponds to \( {V=\frac{1}{2}\left|\cdot\right|^2} \) and for which \( {\mu=\mathcal{N}(0,I_n)} \).

Ornstein–Uhlenbeck. If \( {{(X_t^x)}_{t\geq0}} \) is an \( {n} \)-dimensional Ornstein–Uhlenbeck process solution of the stochastic differential equation

\[ X_0^x=x\in\mathbb{R}^n, \quad\mathrm{d}X^x_t=\sqrt{2}\mathrm{d}B_t-X^x_t\mathrm{d}t \]

where \( {{(B_t)}_{t\geq0}} \) is a standard \( {n} \)-dimensional Brownian motion, then the invariant law is \( {\gamma=\mathcal{N}(0,I_n)} \) and the Mehler formula reads

\[ X^x_t=x\mathrm{e}^{-t}+\int_0^t\mathrm{e}^{s-t}\mathrm{d}B_s\sim\mathcal{N}(x\mathrm{e}^{-t},(1-\mathrm{e}^{-2t})I_n), \]

and the explicit formula for the Fisher information for Gaussians gives

\[ \mathrm{Fisher}(\mathrm{Law}(X^x_t)\mid\gamma) =\mathrm{Fisher}(\mathcal{N}(x\mathrm{e}^{-t},(1-\mathrm{e}^{-2t})I_n)\mid\gamma) =|x|^2\mathrm{e}^{-2t}+n\frac{\mathrm{e}^{-4t}}{1-\mathrm{e}^{-2t}}. \]

Log-Sobolev inequality. The optimal log-Sobolev inequality for \( {\mu=\mathcal{N}(0,\mathrm{I}_n)} \) writes

\[ \mathrm{Kullback}(\nu\mid\mu) \leq\frac{1}{2}\mathrm{Fisher}(\nu\mid\mu) \]

for all probability measure \( {\nu} \) on \( {\mathbb{R}^n} \), and equality is achieved when \( {\log\frac{\mathrm{d}\nu}{\mathrm{d}\mu}} \) is linear, namely when \( {\nu=\mathcal{N}(m,\mathrm{I}_n)} \) for some \( {m\in\mathbb{R}^n} \). By using the Gaussian formulas above for Kullback and Fisher, this log-Sobolev inequality boils down when \( {\nu=\mathcal{N}(m,\Sigma_1)} \) to

\[ \log\det(\Sigma_1^{-1})\leq\mathrm{Tr}(\Sigma_1^{-1}-\mathrm{I}_n). \]

Taking \( {K=\Sigma_1^{-1}} \) shows that this is a matrix version of \( {\log(x)\leq x-1} \), nothing else.

Note. This post was written while working on

Further reading.

2 Comments
Syntax · Style · .