Processing math: 100%
Press "Enter" to skip to content

Back to basics - Total variation distance

Today, part of my teaching concerned basic properties of the total variation on discrete spaces. Let E be a possibly infinite countable set equipped with its discrete topology and σ-field. Let P be the set of probability measures on E. We equip P with the total variation distance defined for every μ,νP by

dTV(μ,ν)=supAE|μ(A)ν(A)|.

It takes its values in [0,1] and 1 is achieved when μ and ν have disjoint supports.

Theorem 1 (Alternative expressions) For every μ,νP we have

dTV(μ,ν)=12supf:E[1,1]|fdμfdν|=12xE|μ(x)ν(x)|

Moreover, the supremum in the original definition of dTV is achieved for the set

A={xE:μ(x)ν(x)}

while the supremum in the functional expression of dTV is achieved for

f=1A1Ac.

Proof: The second equality follows from the inequality

|fdμfdν|xE|f(x)||μ(x)ν(x)|supxE|f(x)|xE|μ(x)ν(x)|

which is saturated for f=1A1Ac. For the first equality, we write

|μ(A)ν(A)|=12|fAdμfAdν|

where f=1A1Ac, which gives

|μ(A)ν(A)|supf:E[1,1]|fdμfdν|=12xE|μ(x)ν(x)|

which is saturated for A=A since

2|μ(A)ν(A)|=xA|μ(x)ν(x)|+xAc|μ(x)ν(x)|.

◻

Theorem 2 (Equivalent criteria for convergence in law) Let (Xn)nN be a sequence of random variables on E and let μn be the law of Xn. For any μ in P, the following properties are equivalent:

  1. (Xn)nN converges in law to μ
  2. limnfdμn=fdμ for all bounded f:ER
  3. limnμn(x)=μ(x) for all xE
  4. limndTV(μn,μ)=0

Proof: To deduce 1. from 4. it suffices to use the functional variational expression of dTV. The properties 1. and 2. are equivalent since every f:ER is continuous for the discrete topology. To deduce 3. from 2. one can take f=1{x}. To deduce 4. from 3. we start by observing that for an arbitrary AE,

xE|μn(x)μ(x)|=xA|μn(x)μ(x)|+xAc|μn(x)μ(x)|.

Thanks to 3. for every ε>0 there exists an integer N=N(A,ε) such that the first term of the right hand side is bounded above by card(A)ε for all nN. For the second term of the right hand side, we write

xAc|μn(x)μ(x)|xAcμn(x)+xAcμ(x).

Since we have

xAcμn(x)=xAμ(x)xAμn(x)+xAcμ(x)

we obtain

xAc|μn(x)μ(x)|xA|μn(x)μ(x)|+2xAcμ(x).

Since νP, for any ε>0, we can select A finite such that μ(Ac)ε.◻

Theorem 3 (Yet another expression and the extremal case) For every μ,νP we have

dTV(μ,ν)=1xE(μ(x)ν(x)).

In particular, dTV(μ,ν)=1 if and only if μ and ν have disjoint supports.

Proof: It suffices to write

xE(μ(x)ν(x))=12xE(μ(x)+ν(x)|μ(x)ν(x)|)=1dTV(μ,ν).

◻

Theorem 4 (Coupling) For every μ,νP we have

dTV(μ,ν)=inf(X,Y)P(XY)

where the infimum runs over all the couples of random variables on E×E with marginal laws μ and ν. Moreover, there exists a couple of this type for which the equality is achieved.

Proof: Let (X,Y) be a couple of random variables on E×E with marginal laws μ nd ν. Since P(X=x,Y=x)μ(x)ν(x) for every xE we have

1dTV(μ,ν)=xE(μ(x)ν(x))xEP(X=x,Y=x)=P(X=Y).

Therefore, it suffices now to construct a couple (X,Y) for which the equality is achieved. Define

p=1dTV(μ,ν)[0,1].

Case where p=0. Then dTV(μ,ν)=1 and the supports of μ et ν are disjoints. This gives xEμ(x)ν(x)=0. We consider in this case (X,Y) where Xμ and Yν are independent:

P(X=Y)=xEμ(x)ν(x)=0.

Case where p=1. Then dTV(μ,ν)=0 and thus μ=ν. We can take (X,X) where Xμ.

Case where 0<p<1. Let (U,V,W) be a triple of random variables with laws

p1(μν),(1p)1(μ(μν)),(1p)1(ν(μν))

(recall that p=xE(μ(x)ν(x))). Let B be a Bernoulli random variable, independent of (U,V,W), such that P(B=1)=1P(B=0)=p. If we define (X,Y)=(U,U) if B=1 while (X,Y)=(V,W) if B=0, then Xμ and Yν, and since the laws of V and W have disjoint supports, we have P(V=W)=0, and thus

P(X=Y)=P(B=1)=p.

◻

Since P(XY)=E(d(X,Y)) for the atomic distance d(x,y)=1xy we have

dTV(μ,ν)=minπΠ(μ,ν)E×Ed(x,y)dπ(x,y)

where Π(μ,ν) is the convex set of probability measures on E×E with marginal laws μ and ν. From this point of view, dTV is the Wasserstein W1 distance on P associated to the atomic distance on E.

One Comment

  1. Luc 2015-11-22

    Bonjour,

    For general polish space E, can we construct explicitly an optimal coupling as you did in the discrete setting ? Any reference/idea would be enough.

    Thanks for your time.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Syntax · Style · .