Today, part of my teaching concerned basic properties of the total variation on discrete spaces. Let E be a possibly infinite countable set equipped with its discrete topology and σ-field. Let P be the set of probability measures on E. We equip P with the total variation distance defined for every μ,ν∈P by
dTV(μ,ν)=supA⊂E|μ(A)−ν(A)|.
It takes its values in [0,1] and 1 is achieved when μ and ν have disjoint supports.
Theorem 1 (Alternative expressions) For every μ,ν∈P we have
dTV(μ,ν)=12supf:E→[−1,1]|∫fdμ−∫fdν|=12∑x∈E|μ(x)−ν(x)|
Moreover, the supremum in the original definition of dTV is achieved for the set
A∗={x∈E:μ(x)≥ν(x)}
while the supremum in the functional expression of dTV is achieved for
f=1A∗−1Ac∗.
Proof: The second equality follows from the inequality
|∫fdμ−∫fdν|≤∑x∈E|f(x)||μ(x)−ν(x)|≤supx∈E|f(x)|∑x∈E|μ(x)−ν(x)|
which is saturated for f=1A∗−1Ac∗. For the first equality, we write
|μ(A)−ν(A)|=12|∫fAdμ−∫fAdν|
where f=1A−1Ac, which gives
|μ(A)−ν(A)|≤supf:E→[−1,1]|∫fdμ−∫fdν|=12∑x∈E|μ(x)−ν(x)|
which is saturated for A=A∗ since
2|μ(A∗)−ν(A∗)|=∑x∈A∗|μ(x)−ν(x)|+∑x∈Ac∗|μ(x)−ν(x)|.
Theorem 2 (Equivalent criteria for convergence in law) Let (Xn)n∈N be a sequence of random variables on E and let μn be the law of Xn. For any μ in P, the following properties are equivalent:
- (Xn)n∈N converges in law to μ
- limn→∞∫fdμn=∫fdμ for all bounded f:E→R
- limn→∞μn(x)=μ(x) for all x∈E
- limn→∞dTV(μn,μ)=0
Proof: To deduce 1. from 4. it suffices to use the functional variational expression of dTV. The properties 1. and 2. are equivalent since every f:E→R is continuous for the discrete topology. To deduce 3. from 2. one can take f=1{x}. To deduce 4. from 3. we start by observing that for an arbitrary A⊂E,
∑x∈E|μn(x)−μ(x)|=∑x∈A|μn(x)−μ(x)|+∑x∈Ac|μn(x)−μ(x)|.
Thanks to 3. for every ε′>0 there exists an integer N=N(A,ε′) such that the first term of the right hand side is bounded above by card(A)ε′ for all n≥N. For the second term of the right hand side, we write
∑x∈Ac|μn(x)−μ(x)|≤∑x∈Acμn(x)+∑x∈Acμ(x).
Since we have
∑x∈Acμn(x)=∑x∈Aμ(x)−∑x∈Aμn(x)+∑x∈Acμ(x)
we obtain
∑x∈Ac|μn(x)−μ(x)|≤∑x∈A|μn(x)−μ(x)|+2∑x∈Acμ(x).
Since ν∈P, for any ε″>0, we can select A finite such that μ(Ac)≤ε″.
Theorem 3 (Yet another expression and the extremal case) For every μ,ν∈P we have
dTV(μ,ν)=1−∑x∈E(μ(x)∧ν(x)).
In particular, dTV(μ,ν)=1 if and only if μ and ν have disjoint supports.
Proof: It suffices to write
∑x∈E(μ(x)∧ν(x))=12∑x∈E(μ(x)+ν(x)−|μ(x)−ν(x)|)=1−dTV(μ,ν).
Theorem 4 (Coupling) For every μ,ν∈P we have
dTV(μ,ν)=inf(X,Y)P(X≠Y)
where the infimum runs over all the couples of random variables on E×E with marginal laws μ and ν. Moreover, there exists a couple of this type for which the equality is achieved.
Proof: Let (X,Y) be a couple of random variables on E×E with marginal laws μ nd ν. Since P(X=x,Y=x)≤μ(x)∧ν(x) for every x∈E we have
1−dTV(μ,ν)=∑x∈E(μ(x)∧ν(x))≥∑x∈EP(X=x,Y=x)=P(X=Y).
Therefore, it suffices now to construct a couple (X,Y) for which the equality is achieved. Define
p=1−dTV(μ,ν)∈[0,1].
Case where p=0. Then dTV(μ,ν)=1 and the supports of μ et ν are disjoints. This gives ∑x∈Eμ(x)ν(x)=0. We consider in this case (X,Y) where X∼μ and Y∼ν are independent:
P(X=Y)=∑x∈Eμ(x)ν(x)=0.
Case where p=1. Then dTV(μ,ν)=0 and thus μ=ν. We can take (X,X) where X∼μ.
Case where 0<p<1. Let (U,V,W) be a triple of random variables with laws
p−1(μ∧ν),(1−p)−1(μ−(μ∧ν)),(1−p)−1(ν−(μ∧ν))
(recall that p=∑x∈E(μ(x)∧ν(x))). Let B be a Bernoulli random variable, independent of (U,V,W), such that P(B=1)=1−P(B=0)=p. If we define (X,Y)=(U,U) if B=1 while (X,Y)=(V,W) if B=0, then X∼μ and Y∼ν, and since the laws of V and W have disjoint supports, we have P(V=W)=0, and thus
P(X=Y)=P(B=1)=p.
Since P(X≠Y)=E(d(X,Y)) for the atomic distance d(x,y)=1x≠y we have
dTV(μ,ν)=minπ∈Π(μ,ν)∫E×Ed(x,y)dπ(x,y)
where Π(μ,ν) is the convex set of probability measures on E×E with marginal laws μ and ν. From this point of view, dTV is the Wasserstein W1 distance on P associated to the atomic distance on E.
Bonjour,
For general polish space E, can we construct explicitly an optimal coupling as you did in the discrete setting ? Any reference/idea would be enough.
Thanks for your time.