Processing math: 100%
Press "Enter" to skip to content

Coupling, divergences, and Markov kernels

Coupling (British TV series)
Coupling (British TV series)

Let (Xt)t0 be a Markov process on a state space E. Let us define

Pt(x,)=Law(XtX0=x),xE,t0.

It follows that if X0μ then XtμPt where

μPt=Pt(x,)dμ(x).

In this post a divergence between probability measures μ and ν on E is a quantitative way to measure the difference between μ and ν. A divergence can be a distance such as the total variation or the Wasserstein distance. We study nice examples later on.

Suppose that for some divergence between probability measures and for some quantity φt(x,y) which depends on x,y,t we have, typically by using a coupling,

div(Pt(x,),Pt(y,))φt(x,y).

In this post, we explain how to deduce that for all probability measures μ,ν,

div(μPt,νPt)φt(x,y)dμ(x)dν(y).

The initial inequality corresponds to take μ=δx and ν=δy. All this is about notions of couplings, divergences, and functional inequalities.

Coupling. Let P(E) be the set of proability measures on E. If μ and ν are in P(E), then a coupling of μ and ν is an element π of P(E×E) with marginal distributions μ and ν. The set of couplings is convex, and is not empty since it contains the product measure μν.

Supremum divergence. Let F be a class of bounded functions E[0,+). For all μ,νP(E), we define the quantity

divF(μ,ν)=supfFfd(μν)(,+].

This is not necessarily a distance. We give nice examples later.

Inequality. Let P:EP(E) be a Markov kernel. Recall that for all μP(μ), μPP(E) is defined by μP=P(x,)dμ(x). Then, for all μ and ν in P(E),

divF(μP,νP)infπdivF(P(x,),P(y,))dπ(x,y)

where the infimum runs over all couplings of μ and ν. Taking π=μν gives

divF(μP,νP)divF(P(x,),P(y,))dμ(x)dν(x),

and in particular

divF(μP,νP)supx,ydivF(P(x,),P(y,)).

A proof. The idea is to introduce a coupling and then to proceed by conditioning or desintegration. Namely, if π is a coupling of μ and ν, for instance μν, then

fd(μPνP)=(fd(P(x,)P(y,)))dπ(x,y).

As a consequence,

supfFfd(μPνP)(supfFfd(P(x,)P(y,)))dπ(x,y).

This gives the desired inequality.

Infimum divergence. For a given map c:E×E[0,+] that we call a cost, we define, for all μ and ν in P(E),

divc(μ,ν)=infπc(x,y)dπ(x,y)[0,+]

where the infimum runs over all couplings of μ and ν. This is also known as the transportation or coupling distance, even if it is not necessarily a distance. We give nice examples later on.

Inequality. For all Markov kernel P:EP(E) and all μ and ν in P(E),

divc(μP,νP)infπdivc(P(x,),P(y,))dπ(x,y),

where the infimum runs over all couplings of μ and ν. Taking π=μν gives

divc(μP,νP)divc(P(x,),P(y,))dμ(x)ν(y),

and in particular

divc(μP,νP)supx,ydivc(P(x,),P(y,)).

A proof. Let πx,y be a coupling of P(x,) and P(y,). Then πx,y(,)dμ(x)dν(y) is a coupling of μP and νP. Indeed, for instance for the first marginal, we have

yx,yπx,y(,dy)dμ(x)dν(y)=x,yyπx,y(,dy)dμ(x)dν(y)=x,yPx()dμ(x)dν(y)=μP.

Now, for all ε>0 there exists a coupling πx,y of P(x,) and P(y,) such that

x,yc(x,y)dπx,y(x,y)εinfπc(x,y)dπx,y(x,y)=divc(P(x,),P(y,)),

and thus

divc(μP,νP)εdivc(P(x,),P(y,))dμ(x)dν(y).

This gives the desired inequality.

Playing with Markov kernels. Let us consider the identity Markov kernel defined by

P(x,)=δxfor allxE.

Then μP=μ for all μP(E), hence the name. Next, since divc(δx,δy)=c(x,y), the inequality above for the infimum divergence gives in this case the tautology divc(μ,ν)=divc(μ,ν). In contrast, the inequality for the supremum divergence gives

divF(μ,ν)infπc(x,y)dπ(x,y)=divc(μ,ν)

where the infimum runs over all couplings of μ and ν and where the cost is

c(x,y)=divF(δx,δy)=supfF(f(x)f(y)).

Kantorovich-Rubinstein duality. When the cost (x,y)c(x,y) is a distance making E a metric space, this duality theorem states that

divc=divF

where F is the class of functions f:ER such that

fLip=supxy|f(x)f(y)|c(x,y)1.

In the case of the discrete distance c(x,y)=1xy, this identity becomes

inf(X,Y)XμYνP(XY)=supf:ERf1/2fd(μν)

and this matches the total variation distance

μνTV=supBE|μ(B)μ(B)|

(all right, is immediate, while requires approximation/structure on E).

Bounded-Lipschitz or Fortet-Mourier distance. Still when E is a metric space, it corresponds to divF when F is the class of f:EE such that

fLip1(implies continuity)andf1.

(Monge-Kantorovich-)Wasserstein distances. When E is a metric space equipped with a distance d, and when p[1,), the Wp distance is defined by

Wp(μ,ν)=divc(μ,ν)1/pwithc(x,y)=d(x,y)p.

It is finite when μ and ν have finite p-th order moment in the sense that for some (and thus any) xE we have d(x,y)pdμ(y)< and d(x,y)pdν(y)<. On this subset of P(E), Wp turns out indeed to be a true distance.

In the case p=1, the Kantorovich-Rubinstein duality can be used for W1=divc with c(x,y)=d(x,y) since it is a distance on E, giving W1=divF where F is the class of bounded (this condition can be relaxed) and Lipschitz functions f:ER with fLip1.

When p1, the cost is no longer a distance, but we have still the variational formula

Wp(μ,ν)=sup(fdμgdν)1/p

where the supremum runs over all bounded and Lipschitz f,g:ER such that f(x)g(y)d(x,y)p. In other words

Wp(μ,ν)=sup(Q(f)dμfdν)1/p

where the supremum runs over bounded Lipschitz f:ER and where Q(f) is the infimum convolution of f with ||p defined by

Q(f)(x)=infyE(f(y)+d(x,y)p).

Note tha Wp defines the same topology than the Zolotarev distance div1/pF where F is the class of functions with growth at most like dp(x,) for some arbitrary x. They coincide when p=1 and differ metrically when p1.

Trend to the equilibrium. In the study of the trend to the equilibrium / long time behavior of the Markov process X, we have typically limtφt(x,y)=0 for all x,y. Also, if ν is invariant, meaning that νPt=ν for all t, then

div(μPt,ν)φt(x,y)dμ(x)dν(y)t0

provided that suptφt is μν integrable (dominated convergence).

Further reading.

    Leave a Reply

    Your email address will not be published.

    This site uses Akismet to reduce spam. Learn how your comment data is processed.

    Syntax · Style · .