Let ${{(X_t)}_{t\geq0}}$ be a Markov process on a state space ${E}$. Let us define

$P_t(x,\cdot)=\mathrm{Law}(X_t\mid X_0=x),\quad x\in E, t\geq0.$

It follows that if ${X_0\sim\mu}$ then ${X_t\sim\mu P_t}$ where

$\mu P_t = \int P_t(x,\cdot)\mathrm{d}\mu(x).$

In this post a divergence between probability measures ${\mu}$ and ${\nu}$ on ${E}$ is a quantitative way to measure the difference between ${\mu}$ and ${\nu}$. A divergence can be a distance such as the total variation or the Wasserstein distance. We study nice examples later on.

Suppose that for some divergence between probability measures and for some quantity ${\varphi_t(x,y)}$ which depends on ${x,y,t}$ we have, typically by using a coupling,

$\mathrm{div}(P_t(x,\cdot),P_t(y,\cdot)) \leq\varphi_t(x,y).$

In this post, we explain how to deduce that for all probability measures ${\mu,\nu}$,

$\mathrm{div}(\mu P_t,\nu P_t) \leq\int\varphi_t(x,y)\mathrm{d}\mu(x)\mathrm{d}\nu(y).$

The initial inequality corresponds to take ${\mu=\delta_x}$ and ${\nu=\delta_y}$. All this is about notions of couplings, divergences, and functional inequalities.

Coupling. Let ${\mathcal{P}(E)}$ be the set of proability measures on ${E}$. If ${\mu}$ and ${\nu}$ are in ${\mathcal{P}(E)}$, then a coupling of ${\mu}$ and ${\nu}$ is an element ${\pi}$ of ${\mathcal{P}(E\times E)}$ with marginal distributions ${\mu}$ and ${\nu}$. The set of couplings is convex, and is not empty since it contains the product measure ${\mu\otimes\nu}$.

Supremum divergence. Let ${\mathcal{F}}$ be a class of bounded functions ${E\rightarrow[0,+\infty)}$. For all ${\mu,\nu\in\mathcal{P}(E)}$, we define the quantity

$\mathrm{div}_{\mathcal{F}}(\mu,\nu) =\sup_{f\in\mathcal{F}}\int f\mathrm{d}(\mu-\nu)\in(-\infty,+\infty].$

This is not necessarily a distance. We give nice examples later.

Inequality. Let ${P:E\rightarrow\mathcal{P}(E)}$ be a Markov kernel. Recall that for all ${\mu\in\mathcal{P}(\mu)}$, ${\mu P\in\mathcal{P}(E)}$ is defined by ${\mu P=\int P(x,\cdot)\mathrm{d}\mu(x)}$. Then, for all ${\mu}$ and ${\nu}$ in ${\mathcal{P}(E)}$,

$\mathrm{div}_{\mathcal{F}}(\mu P,\nu P) \leq \inf_\pi\int \mathrm{div}_{\mathcal{F}}(P(x,\cdot),P(y,\cdot))\mathrm{d}\pi(x,y)$

where the infimum runs over all couplings of ${\mu}$ and ${\nu}$. Taking ${\pi=\mu\otimes\nu}$ gives

$\mathrm{div}_{\mathcal{F}}(\mu P,\nu P) \leq \int \mathrm{div}_{\mathcal{F}}(P(x,\cdot),P(y,\cdot))\mathrm{d}\mu(x)\mathrm{d}\nu(x),$

and in particular

$\mathrm{div}_{\mathcal{F}}(\mu P,\nu P) \leq \sup_{x,y}\mathrm{div}_{\mathcal{F}}(P(x,\cdot),P(y,\cdot)).$

A proof. The idea is to introduce a coupling and then to proceed by conditioning or desintegration. Namely, if ${\pi}$ is a coupling of ${\mu}$ and ${\nu}$, for instance ${\mu\otimes\nu}$, then

$\int f\mathrm{d}(\mu P-\nu P) =\int\Bigr(\int f\mathrm{d}(P(x,\cdot)-P(y,\cdot))\Bigr)\mathrm{d}\pi(x,y).$

As a consequence,

$\sup_{f\in\mathcal{F}}\int f\mathrm{d}(\mu P-\nu P) \leq \int\Bigr(\sup_{f\in\mathcal{F}}\int f\mathrm{d}(P(x,\cdot)-P(y,\cdot))\Bigr)\mathrm{d}\pi(x,y).$

This gives the desired inequality.

Infimum divergence. For a given map ${c:E\times E\rightarrow[0,+\infty]}$ that we call a cost, we define, for all ${\mu}$ and ${\nu}$ in ${\mathcal{P}(E)}$,

$\mathrm{div}_c(\mu,\nu)=\inf_\pi\int c(x,y)\mathrm{d}\pi(x,y)\in[0,+\infty]$

where the infimum runs over all couplings of ${\mu}$ and ${\nu}$. This is also known as the transportation or coupling distance, even if it is not necessarily a distance. We give nice examples later on.

Inequality. For all Markov kernel ${P:E\mapsto\mathcal{P}(E)}$ and all ${\mu}$ and ${\nu}$ in ${\mathcal{P}(E)}$,

$\mathrm{div}_c(\mu P,\nu P) \leq\inf_\pi\int \mathrm{div}_c(P(x,\cdot),P(y,\cdot))\mathrm{d}\pi(x,y),$

where the infimum runs over all couplings of ${\mu}$ and ${\nu}$. Taking ${\pi=\mu\otimes\nu}$ gives

$\mathrm{div}_c(\mu P,\nu P) \leq\int \mathrm{div}_c(P(x,\cdot),P(y,\cdot))\mathrm{d}\mu(x)\nu(y),$

and in particular

$\mathrm{div}_c(\mu P,\nu P) \leq\sup_{x,y} \mathrm{div}_c(P(x,\cdot),P(y,\cdot)).$

A proof. Let ${\pi_{x,y}}$ be a coupling of ${P(x,\cdot)}$ and ${P(y,\cdot)}$. Then ${\int\pi_{x,y}(\cdot,\cdot)\mathrm{d}\mu(x)\mathrm{d}\nu(y)}$ is a coupling of ${\mu P}$ and ${\nu P}$. Indeed, for instance for the first marginal, we have

$\begin{array}{rcl} \int_{y’}\int_{x,y}\pi_{x,y}(\cdot,\mathrm{d}y’)\mathrm{d}\mu(x)\mathrm{d}\nu(y) &=&\int_{x,y}\int_{y’}\pi_{x,y}(\cdot,\mathrm{d}y’)\mathrm{d}\mu(x)\mathrm{d}\nu(y)\\ &=&\int_{x,y}P_x(\cdot)\mathrm{d}\mu(x)\mathrm{d}\nu(y)\\ &=&\mu P. \end{array}$

Now, for all ${\varepsilon>0}$ there exists a coupling ${\pi_{x,y}}$ of ${P(x,\cdot)}$ and ${P(y,\cdot)}$ such that

$\begin{array}{rcl} \int_{x’,y’} c(x’,y’)\mathrm{d}\pi_{x,y}(x’,y’)-\varepsilon &\leq&\inf_{\pi}\int c(x’,y’)\mathrm{d}\pi_{x,y}(x’,y’)\\ &=&\mathrm{div}_c(P(x,\cdot),P(y,\cdot)), \end{array}$

and thus

$\mathrm{div}_c(\mu P,\nu P) -\varepsilon \leq\int \mathrm{div}_c(P(x,\cdot),P(y,\cdot))\mathrm{d}\mu(x)\mathrm{d}\nu(y).$

This gives the desired inequality.

Playing with Markov kernels. Let us consider the identity Markov kernel defined by

$P(x,\cdot)=\delta_x\quad\mbox{for all}\quad x\in E.$

Then ${\mu P=\mu}$ for all ${\mu\in\mathcal{P}(E)}$, hence the name. Next, since ${\mathrm{div}_c(\delta_x,\delta_y)=c(x,y)}$, the inequality above for the infimum divergence gives in this case the tautology ${\mathrm{div}_c(\mu,\nu)=\mathrm{div}_c(\mu,\nu)}$. In contrast, the inequality for the supremum divergence gives

$\mathrm{div}_{\mathcal{F}}(\mu,\nu) \leq \inf_\pi\int c(x,y)\mathrm{d}\pi(x,y) =\mathrm{div}_c(\mu,\nu)$

where the infimum runs over all couplings of ${\mu}$ and ${\nu}$ and where the cost is

$c(x,y) =\mathrm{div}_{\mathcal{F}}(\delta_x,\delta_y) =\sup_{f\in\mathcal{F}}(f(x)-f(y)).$

Kantorovich-Rubinstein duality. When the cost ${(x,y)\mapsto c(x,y)}$ is a distance making ${E}$ a metric space, this duality theorem states that

$\mathrm{div}_c =\mathrm{div}_{\mathcal{F}}$

where ${\mathcal{F}}$ is the class of functions ${f:E\rightarrow\mathbb{R}}$ such that

$\left\Vert f\right\Vert_{\mathrm{Lip}} =\sup_{x\neq y}\frac{|f(x)-f(y)|}{c(x,y)} \leq1.$

In the case of the discrete distance ${c(x,y)=\mathbf{1}_{x\neq y}}$, this identity becomes

$\inf_{\substack{(X,Y)\\X\sim\mu\\Y\sim\nu}}\mathbb{P}(X\neq Y) =\sup_{\substack{f:E\rightarrow\mathbb{R}\\\left\Vert f\right\Vert_\infty\leq 1/2}}\int f\mathrm{d}(\mu-\nu)$

and this matches the total variation distance

$\left\Vert \mu-\nu\right\Vert_{\mathrm{TV}} =\sup_{B\subset E}|\mu(B)-\mu(B)|$

(all right, ${\geq}$ is immediate, while ${\leq}$ requires approximation/structure on ${E}$).

Bounded-Lipschitz or Fortet-Mourier distance. Still when ${E}$ is a metric space, it corresponds to ${\mathrm{div}_{\mathcal{F}}}$ when ${\mathcal{F}}$ is the class of ${f:E\rightarrow\mathbb{E}}$ such that

$\left\Vert f\right\Vert_{\mathrm{Lip}}\leq1\quad\mbox{(implies continuity)}\quad \mbox{and}\quad\left\Vert f\right\Vert_\infty\leq1.$

(Monge-Kantorovich-)Wasserstein distances. When ${E}$ is a metric space equipped with a distance ${d}$, and when ${p\in[1,\infty)}$, the ${W_p}$ distance is defined by

$W_p(\mu,\nu)=\mathrm{div}_c(\mu,\nu)^{1/p} \quad\mbox{with}\quad c(x,y)=d(x,y)^p.$

It is finite when ${\mu}$ and ${\nu}$ have finite ${p}$-th order moment in the sense that for some (and thus any) ${x\in E}$ we have ${\int d(x,y)^p\mathrm{d}\mu(y)<\infty}$ and ${\int d(x,y)^p\mathrm{d}\nu(y)<\infty}$. On this subset of ${\mathcal{P}(E)}$, ${W_p}$ turns out indeed to be a true distance.

In the case ${p=1}$, the Kantorovich-Rubinstein duality can be used for ${W_1=\mathrm{div}_c}$ with ${c(x,y)=d(x,y)}$ since it is a distance on ${E}$, giving ${W_1=\mathrm{div}_{\mathcal{F}}}$ where ${\mathcal{F}}$ is the class of bounded (this condition can be relaxed) and Lipschitz functions ${f:E\rightarrow\mathbb{R}}$ with ${\left\Vert f\right\Vert_{\mathrm{Lip}}\leq1}$.

When ${p\neq 1}$, the cost is no longer a distance, but we have still the variational formula

$W_p(\mu,\nu)=\sup\left(\int f\mathrm{d}\mu-\int g\mathrm{d}\nu\right)^{1/p}$

where the supremum runs over all bounded and Lipschitz ${f,g:E\rightarrow\mathbb{R}}$ such that ${f(x)-g(y)\leq d(x,y)^p}$. In other words

$W_p(\mu,\nu)=\sup\left(\int Q(f)\mathrm{d}\mu-\int f\mathrm{d}\nu\right)^{1/p}$

where the supremum runs over bounded Lipschitz ${f:E\rightarrow\mathbb{R}}$ and where ${Q(f)}$ is the infimum convolution of ${f}$ with ${\left|\cdot\right|^p}$ defined by

$Q(f)(x)=\inf_{y\in E}\Bigr(f(y)+d(x,y)^p\Bigr).$

Note tha ${W_p}$ defines the same topology than the Zolotarev distance ${\mathrm{div}_{\mathcal{F}}^{1/p}}$ where ${\mathcal{F}}$ is the class of functions with growth at most like ${d^p(x,\cdot)}$ for some arbitrary ${x}$. They coincide when ${p=1}$ and differ metrically when ${p\neq1}$.

Trend to the equilibrium. In the study of the trend to the equilibrium / long time behavior of the Markov process ${X}$, we have typically ${\lim_{t\rightarrow\infty}\varphi_t(x,y)=0}$ for all ${x,y}$. Also, if ${\nu}$ is invariant, meaning that ${\nu P_t=\nu}$ for all ${t}$, then

$\mathrm{div}(\mu P_t,\nu)\leq\int\varphi_t(x,y)\mathrm{d}\mu(x)\mathrm{d}\nu(y) \underset{t\rightarrow\infty}{\longrightarrow}0$

provided that ${\sup_t\varphi_t}$ is ${\mu\otimes\nu}$ integrable (dominated convergence).