
Let (Xt)t≥0 be a Markov process on a state space E. Let us define
Pt(x,⋅)=Law(Xt∣X0=x),x∈E,t≥0.
It follows that if X0∼μ then Xt∼μPt where
μPt=∫Pt(x,⋅)dμ(x).
In this post a divergence between probability measures μ and ν on E is a quantitative way to measure the difference between μ and ν. A divergence can be a distance such as the total variation or the Wasserstein distance. We study nice examples later on.
Suppose that for some divergence between probability measures and for some quantity φt(x,y) which depends on x,y,t we have, typically by using a coupling,
div(Pt(x,⋅),Pt(y,⋅))≤φt(x,y).
In this post, we explain how to deduce that for all probability measures μ,ν,
div(μPt,νPt)≤∫φt(x,y)dμ(x)dν(y).
The initial inequality corresponds to take μ=δx and ν=δy. All this is about notions of couplings, divergences, and functional inequalities.
Coupling. Let P(E) be the set of proability measures on E. If μ and ν are in P(E), then a coupling of μ and ν is an element π of P(E×E) with marginal distributions μ and ν. The set of couplings is convex, and is not empty since it contains the product measure μ⊗ν.
Supremum divergence. Let F be a class of bounded functions E→[0,+∞). For all μ,ν∈P(E), we define the quantity
divF(μ,ν)=supf∈F∫fd(μ−ν)∈(−∞,+∞].
This is not necessarily a distance. We give nice examples later.
Inequality. Let P:E→P(E) be a Markov kernel. Recall that for all μ∈P(μ), μP∈P(E) is defined by μP=∫P(x,⋅)dμ(x). Then, for all μ and ν in P(E),
divF(μP,νP)≤infπ∫divF(P(x,⋅),P(y,⋅))dπ(x,y)
where the infimum runs over all couplings of μ and ν. Taking π=μ⊗ν gives
divF(μP,νP)≤∫divF(P(x,⋅),P(y,⋅))dμ(x)dν(x),
and in particular
divF(μP,νP)≤supx,ydivF(P(x,⋅),P(y,⋅)).
A proof. The idea is to introduce a coupling and then to proceed by conditioning or desintegration. Namely, if π is a coupling of μ and ν, for instance μ⊗ν, then
∫fd(μP−νP)=∫(∫fd(P(x,⋅)−P(y,⋅)))dπ(x,y).
As a consequence,
supf∈F∫fd(μP−νP)≤∫(supf∈F∫fd(P(x,⋅)−P(y,⋅)))dπ(x,y).
This gives the desired inequality.
Infimum divergence. For a given map c:E×E→[0,+∞] that we call a cost, we define, for all μ and ν in P(E),
divc(μ,ν)=infπ∫c(x,y)dπ(x,y)∈[0,+∞]
where the infimum runs over all couplings of μ and ν. This is also known as the transportation or coupling distance, even if it is not necessarily a distance. We give nice examples later on.
Inequality. For all Markov kernel P:E↦P(E) and all μ and ν in P(E),
divc(μP,νP)≤infπ∫divc(P(x,⋅),P(y,⋅))dπ(x,y),
where the infimum runs over all couplings of μ and ν. Taking π=μ⊗ν gives
divc(μP,νP)≤∫divc(P(x,⋅),P(y,⋅))dμ(x)ν(y),
and in particular
divc(μP,νP)≤supx,ydivc(P(x,⋅),P(y,⋅)).
A proof. Let πx,y be a coupling of P(x,⋅) and P(y,⋅). Then ∫πx,y(⋅,⋅)dμ(x)dν(y) is a coupling of μP and νP. Indeed, for instance for the first marginal, we have
∫y′∫x,yπx,y(⋅,dy′)dμ(x)dν(y)=∫x,y∫y′πx,y(⋅,dy′)dμ(x)dν(y)=∫x,yPx(⋅)dμ(x)dν(y)=μP.
Now, for all ε>0 there exists a coupling πx,y of P(x,⋅) and P(y,⋅) such that
∫x′,y′c(x′,y′)dπx,y(x′,y′)−ε≤infπ∫c(x′,y′)dπx,y(x′,y′)=divc(P(x,⋅),P(y,⋅)),
and thus
divc(μP,νP)−ε≤∫divc(P(x,⋅),P(y,⋅))dμ(x)dν(y).
This gives the desired inequality.
Playing with Markov kernels. Let us consider the identity Markov kernel defined by
P(x,⋅)=δxfor allx∈E.
Then μP=μ for all μ∈P(E), hence the name. Next, since divc(δx,δy)=c(x,y), the inequality above for the infimum divergence gives in this case the tautology divc(μ,ν)=divc(μ,ν). In contrast, the inequality for the supremum divergence gives
divF(μ,ν)≤infπ∫c(x,y)dπ(x,y)=divc(μ,ν)
where the infimum runs over all couplings of μ and ν and where the cost is
c(x,y)=divF(δx,δy)=supf∈F(f(x)−f(y)).
Kantorovich-Rubinstein duality. When the cost (x,y)↦c(x,y) is a distance making E a metric space, this duality theorem states that
divc=divF
where F is the class of functions f:E→R such that
‖f‖Lip=supx≠y|f(x)−f(y)|c(x,y)≤1.
In the case of the discrete distance c(x,y)=1x≠y, this identity becomes
inf(X,Y)X∼μY∼νP(X≠Y)=supf:E→R‖f‖∞≤1/2∫fd(μ−ν)
and this matches the total variation distance
‖μ−ν‖TV=supB⊂E|μ(B)−μ(B)|
(all right, ≥ is immediate, while ≤ requires approximation/structure on E).
Bounded-Lipschitz or Fortet-Mourier distance. Still when E is a metric space, it corresponds to divF when F is the class of f:E→E such that
‖f‖Lip≤1(implies continuity)and‖f‖∞≤1.
(Monge-Kantorovich-)Wasserstein distances. When E is a metric space equipped with a distance d, and when p∈[1,∞), the Wp distance is defined by
Wp(μ,ν)=divc(μ,ν)1/pwithc(x,y)=d(x,y)p.
It is finite when μ and ν have finite p-th order moment in the sense that for some (and thus any) x∈E we have ∫d(x,y)pdμ(y)<∞ and ∫d(x,y)pdν(y)<∞. On this subset of P(E), Wp turns out indeed to be a true distance.
In the case p=1, the Kantorovich-Rubinstein duality can be used for W1=divc with c(x,y)=d(x,y) since it is a distance on E, giving W1=divF where F is the class of bounded (this condition can be relaxed) and Lipschitz functions f:E→R with ‖f‖Lip≤1.
When p≠1, the cost is no longer a distance, but we have still the variational formula
Wp(μ,ν)=sup(∫fdμ−∫gdν)1/p
where the supremum runs over all bounded and Lipschitz f,g:E→R such that f(x)−g(y)≤d(x,y)p. In other words
Wp(μ,ν)=sup(∫Q(f)dμ−∫fdν)1/p
where the supremum runs over bounded Lipschitz f:E→R and where Q(f) is the infimum convolution of f with |⋅|p defined by
Q(f)(x)=infy∈E(f(y)+d(x,y)p).
Note tha Wp defines the same topology than the Zolotarev distance div1/pF where F is the class of functions with growth at most like dp(x,⋅) for some arbitrary x. They coincide when p=1 and differ metrically when p≠1.
Trend to the equilibrium. In the study of the trend to the equilibrium / long time behavior of the Markov process X, we have typically limt→∞φt(x,y)=0 for all x,y. Also, if ν is invariant, meaning that νPt=ν for all t, then
div(μPt,ν)≤∫φt(x,y)dμ(x)dν(y)⟶t→∞0
provided that suptφt is μ⊗ν integrable (dominated convergence).
Further reading.
- Rachev - Probability metrics and the stability of stochastic models. (1991)
- Rachev and Rüschendorf - Mass transportation problems I and II. (1998)
- Villani - Topics in optimal transportation. (2003)
- Villani - Optimal transport. Old and new. (2009)
- Santambrogio - Optimal transport for applied mathematicians. Calculus of variations, PDEs, and modeling. (2015)
- Gozlan and Léonard - Transport inequalities. A survey. (2010)
- Lindvall - Lectures on the coupling method. Revised and corrected edition. (2002)