Coupling, divergences, and Markov kernels

Let \( {{(X_t)}_{t\geq0}} \) be a Markov process on a state space \( {E} \). Let us define

\[ P_t(x,\cdot)=\mathrm{Law}(X_t\mid X_0=x),\quad x\in E, t\geq0. \]

It follows that if \( {X_0\sim\mu} \) then \( {X_t\sim\mu P_t} \) where

\[ \mu P_t = \int P_t(x,\cdot)\mathrm{d}\mu(x). \]

In this post a divergence between probability measures \( {\mu} \) and \( {\nu} \) on \( {E} \) is a quantitative way to measure the difference between \( {\mu} \) and \( {\nu} \). A divergence can be a distance such as the total variation or the Wasserstein distance. We study nice examples later on.

Suppose that for some divergence between probability measures and for some quantity \( {\varphi_t(x,y)} \) which depends on \( {x,y,t} \) we have, typically by using a coupling,

\[ \mathrm{div}(P_t(x,\cdot),P_t(y,\cdot)) \leq\varphi_t(x,y). \]

In this post, we explain how to deduce that for all probability measures \( {\mu,\nu} \),

\[ \mathrm{div}(\mu P_t,\nu P_t) \leq\int\varphi_t(x,y)\mathrm{d}\mu(x)\mathrm{d}\nu(y). \]

The initial inequality corresponds to take \( {\mu=\delta_x} \) and \( {\nu=\delta_y} \). All this is about notions of couplings, divergences, and functional inequalities.

Coupling. Let \( {\mathcal{P}(E)} \) be the set of proability measures on \( {E} \). If \( {\mu} \) and \( {\nu} \) are in \( {\mathcal{P}(E)} \), then a coupling of \( {\mu} \) and \( {\nu} \) is an element \( {\pi} \) of \( {\mathcal{P}(E\times E)} \) with marginal distributions \( {\mu} \) and \( {\nu} \). The set of couplings is convex, and is not empty since it contains the product measure \( {\mu\otimes\nu} \).

Supremum divergence. Let \( {\mathcal{F}} \) be a class of bounded functions \( {E\rightarrow[0,+\infty)} \). For all \( {\mu,\nu\in\mathcal{P}(E)} \), we define the quantity

\[ \mathrm{div}_{\mathcal{F}}(\mu,\nu) =\sup_{f\in\mathcal{F}}\int f\mathrm{d}(\mu-\nu)\in(-\infty,+\infty]. \]

This is not necessarily a distance. We give nice examples later.

Inequality. Let \( {P:E\rightarrow\mathcal{P}(E)} \) be a Markov kernel. Recall that for all \( {\mu\in\mathcal{P}(\mu)} \), \( {\mu P\in\mathcal{P}(E)} \) is defined by \( {\mu P=\int P(x,\cdot)\mathrm{d}\mu(x)} \). Then, for all \( {\mu} \) and \( {\nu} \) in \( {\mathcal{P}(E)} \),

\[ \mathrm{div}_{\mathcal{F}}(\mu P,\nu P) \leq \inf_\pi\int \mathrm{div}_{\mathcal{F}}(P(x,\cdot),P(y,\cdot))\mathrm{d}\pi(x,y) \]

where the infimum runs over all couplings of \( {\mu} \) and \( {\nu} \). Taking \( {\pi=\mu\otimes\nu} \) gives

\[ \mathrm{div}_{\mathcal{F}}(\mu P,\nu P) \leq \int \mathrm{div}_{\mathcal{F}}(P(x,\cdot),P(y,\cdot))\mathrm{d}\mu(x)\mathrm{d}\nu(x), \]

and in particular

\[ \mathrm{div}_{\mathcal{F}}(\mu P,\nu P) \leq \sup_{x,y}\mathrm{div}_{\mathcal{F}}(P(x,\cdot),P(y,\cdot)). \]

A proof. The idea is to introduce a coupling and then to proceed by conditioning or desintegration. Namely, if \( {\pi} \) is a coupling of \( {\mu} \) and \( {\nu} \), for instance \( {\mu\otimes\nu} \), then

\[ \int f\mathrm{d}(\mu P-\nu P) =\int\Bigr(\int f\mathrm{d}(P(x,\cdot)-P(y,\cdot))\Bigr)\mathrm{d}\pi(x,y). \]

As a consequence,

\[ \sup_{f\in\mathcal{F}}\int f\mathrm{d}(\mu P-\nu P) \leq \int\Bigr(\sup_{f\in\mathcal{F}}\int f\mathrm{d}(P(x,\cdot)-P(y,\cdot))\Bigr)\mathrm{d}\pi(x,y). \]

This gives the desired inequality.

Infimum divergence. For a given map \( {c:E\times E\rightarrow[0,+\infty]} \) that we call a cost, we define, for all \( {\mu} \) and \( {\nu} \) in \( {\mathcal{P}(E)} \),

\[ \mathrm{div}_c(\mu,\nu)=\inf_\pi\int c(x,y)\mathrm{d}\pi(x,y)\in[0,+\infty] \]

where the infimum runs over all couplings of \( {\mu} \) and \( {\nu} \). This is also known as the transportation or coupling distance, even if it is not necessarily a distance. We give nice examples later on.

Inequality. For all Markov kernel \( {P:E\mapsto\mathcal{P}(E)} \) and all \( {\mu} \) and \( {\nu} \) in \( {\mathcal{P}(E)} \),

\[ \mathrm{div}_c(\mu P,\nu P) \leq\inf_\pi\int \mathrm{div}_c(P(x,\cdot),P(y,\cdot))\mathrm{d}\pi(x,y), \]

where the infimum runs over all couplings of \( {\mu} \) and \( {\nu} \). Taking \( {\pi=\mu\otimes\nu} \) gives

\[ \mathrm{div}_c(\mu P,\nu P) \leq\int \mathrm{div}_c(P(x,\cdot),P(y,\cdot))\mathrm{d}\mu(x)\nu(y), \]

and in particular

\[ \mathrm{div}_c(\mu P,\nu P) \leq\sup_{x,y} \mathrm{div}_c(P(x,\cdot),P(y,\cdot)). \]

A proof. Let \( {\pi_{x,y}} \) be a coupling of \( {P(x,\cdot)} \) and \( {P(y,\cdot)} \). Then \( {\int\pi_{x,y}(\cdot,\cdot)\mathrm{d}\mu(x)\mathrm{d}\nu(y)} \) is a coupling of \( {\mu P} \) and \( {\nu P} \). Indeed, for instance for the first marginal, we have

\[ \begin{array}{rcl} \int_{y’}\int_{x,y}\pi_{x,y}(\cdot,\mathrm{d}y’)\mathrm{d}\mu(x)\mathrm{d}\nu(y) &=&\int_{x,y}\int_{y’}\pi_{x,y}(\cdot,\mathrm{d}y’)\mathrm{d}\mu(x)\mathrm{d}\nu(y)\\ &=&\int_{x,y}P_x(\cdot)\mathrm{d}\mu(x)\mathrm{d}\nu(y)\\ &=&\mu P. \end{array} \]

Now, for all \( {\varepsilon>0} \) there exists a coupling \( {\pi_{x,y}} \) of \( {P(x,\cdot)} \) and \( {P(y,\cdot)} \) such that

\[ \begin{array}{rcl} \int_{x’,y’} c(x’,y’)\mathrm{d}\pi_{x,y}(x’,y’)-\varepsilon &\leq&\inf_{\pi}\int c(x’,y’)\mathrm{d}\pi_{x,y}(x’,y’)\\ &=&\mathrm{div}_c(P(x,\cdot),P(y,\cdot)), \end{array} \]

and thus

\[ \mathrm{div}_c(\mu P,\nu P) -\varepsilon \leq\int \mathrm{div}_c(P(x,\cdot),P(y,\cdot))\mathrm{d}\mu(x)\mathrm{d}\nu(y). \]

This gives the desired inequality.

Playing with Markov kernels. Let us consider the identity Markov kernel defined by

\[ P(x,\cdot)=\delta_x\quad\mbox{for all}\quad x\in E. \]

Then \( {\mu P=\mu} \) for all \( {\mu\in\mathcal{P}(E)} \), hence the name. Next, since \( {\mathrm{div}_c(\delta_x,\delta_y)=c(x,y)} \), the inequality above for the infimum divergence gives in this case the tautology \( {\mathrm{div}_c(\mu,\nu)=\mathrm{div}_c(\mu,\nu)} \). In contrast, the inequality for the supremum divergence gives

\[ \mathrm{div}_{\mathcal{F}}(\mu,\nu) \leq \inf_\pi\int c(x,y)\mathrm{d}\pi(x,y) =\mathrm{div}_c(\mu,\nu) \]

where the infimum runs over all couplings of \( {\mu} \) and \( {\nu} \) and where the cost is

\[ c(x,y) =\mathrm{div}_{\mathcal{F}}(\delta_x,\delta_y) =\sup_{f\in\mathcal{F}}(f(x)-f(y)). \]

Kantorovich-Rubinstein duality. When the cost \( {(x,y)\mapsto c(x,y)} \) is a distance making \( {E} \) a metric space, this duality theorem states that

\[ \mathrm{div}_c =\mathrm{div}_{\mathcal{F}} \]

where \( {\mathcal{F}} \) is the class of functions \( {f:E\rightarrow\mathbb{R}} \) such that

\[ \left\Vert f\right\Vert_{\mathrm{Lip}} =\sup_{x\neq y}\frac{|f(x)-f(y)|}{c(x,y)} \leq1. \]

In the case of the discrete distance \( {c(x,y)=\mathbf{1}_{x\neq y}} \), this identity becomes

\[ \inf_{\substack{(X,Y)\\X\sim\mu\\Y\sim\nu}}\mathbb{P}(X\neq Y) =\sup_{\substack{f:E\rightarrow\mathbb{R}\\\left\Vert f\right\Vert_\infty\leq 1/2}}\int f\mathrm{d}(\mu-\nu) \]

and this matches the total variation distance

\[ \left\Vert \mu-\nu\right\Vert_{\mathrm{TV}} =\sup_{B\subset E}|\mu(B)-\mu(B)| \]

(all right, \( {\geq} \) is immediate, while \( {\leq} \) requires approximation/structure on \( {E} \)).

Bounded-Lipschitz or Fortet-Mourier distance. Still when \( {E} \) is a metric space, it corresponds to \( {\mathrm{div}_{\mathcal{F}}} \) when \( {\mathcal{F}} \) is the class of \( {f:E\rightarrow\mathbb{E}} \) such that

\[ \left\Vert f\right\Vert_{\mathrm{Lip}}\leq1\quad\mbox{(implies continuity)}\quad \mbox{and}\quad\left\Vert f\right\Vert_\infty\leq1. \]

(Monge-Kantorovich-)Wasserstein distances. When \( {E} \) is a metric space equipped with a distance \( {d} \), and when \( {p\in[1,\infty)} \), the \( {W_p} \) distance is defined by

\[ W_p(\mu,\nu)=\mathrm{div}_c(\mu,\nu)^{1/p} \quad\mbox{with}\quad c(x,y)=d(x,y)^p. \]

It is finite when \( {\mu} \) and \( {\nu} \) have finite \( {p} \)-th order moment in the sense that for some (and thus any) \( {x\in E} \) we have \( {\int d(x,y)^p\mathrm{d}\mu(y)<\infty} \) and \( {\int d(x,y)^p\mathrm{d}\nu(y)<\infty} \). On this subset of \( {\mathcal{P}(E)} \), \( {W_p} \) turns out indeed to be a true distance.

In the case \( {p=1} \), the Kantorovich-Rubinstein duality can be used for \( {W_1=\mathrm{div}_c} \) with \( {c(x,y)=d(x,y)} \) since it is a distance on \( {E} \), giving \( {W_1=\mathrm{div}_{\mathcal{F}}} \) where \( {\mathcal{F}} \) is the class of bounded (this condition can be relaxed) and Lipschitz functions \( {f:E\rightarrow\mathbb{R}} \) with \( {\left\Vert f\right\Vert_{\mathrm{Lip}}\leq1} \).

When \( {p\neq 1} \), the cost is no longer a distance, but we have still the variational formula

\[ W_p(\mu,\nu)=\sup\left(\int f\mathrm{d}\mu-\int g\mathrm{d}\nu\right)^{1/p} \]

where the supremum runs over all bounded and Lipschitz \( {f,g:E\rightarrow\mathbb{R}} \) such that \( {f(x)-g(y)\leq d(x,y)^p} \). In other words

\[ W_p(\mu,\nu)=\sup\left(\int Q(f)\mathrm{d}\mu-\int f\mathrm{d}\nu\right)^{1/p} \]

where the supremum runs over bounded Lipschitz \( {f:E\rightarrow\mathbb{R}} \) and where \( {Q(f)} \) is the infimum convolution of \( {f} \) with \( {\left|\cdot\right|^p} \) defined by

\[ Q(f)(x)=\inf_{y\in E}\Bigr(f(y)+d(x,y)^p\Bigr). \]

Note tha \( {W_p} \) defines the same topology than the Zolotarev distance \( {\mathrm{div}_{\mathcal{F}}^{1/p}} \) where \( {\mathcal{F}} \) is the class of functions with growth at most like \( {d^p(x,\cdot)} \) for some arbitrary \( {x} \). They coincide when \( {p=1} \) and differ metrically when \( {p\neq1} \).

Trend to the equilibrium. In the study of the trend to the equilibrium / long time behavior of the Markov process \( {X} \), we have typically \( {\lim_{t\rightarrow\infty}\varphi_t(x,y)=0} \) for all \( {x,y} \). Also, if \( {\nu} \) is invariant, meaning that \( {\nu P_t=\nu} \) for all \( {t} \), then

\[ \mathrm{div}(\mu P_t,\nu)\leq\int\varphi_t(x,y)\mathrm{d}\mu(x)\mathrm{d}\nu(y) \underset{t\rightarrow\infty}{\longrightarrow}0 \]

provided that \( {\sup_t\varphi_t} \) is \( {\mu\otimes\nu} \) integrable (dominated convergence).

Further reading.

Some other posts: