Press "Enter" to skip to content

Back to basics: total variation distance

Today, part of my teaching concerned basic properties of the total variation on discrete spaces. Let \( {E} \) be a possibly infinite countable set equipped with its discrete topology and \( {\sigma} \)-field. Let \( {\mathcal{P}} \) be the set of probability measures on \( {E} \). We equip \( {\mathcal{P}} \) with the total variation distance defined for every \( {\mu,\nu\in\mathcal{P}} \) by

\[ d_{TV}(\mu,\nu)=\sup_{A\subset{E}}|\mu(A)-\nu(A)|. \]

It takes its values in \( {[0,1]} \) and \( {1} \) is achieved when \( {\mu} \) and \( {\nu} \) have disjoint supports.

Theorem 1 (Alternative expressions) For every \( {\mu,\nu\in\mathcal{P}} \) we have

\[ d_{TV}(\mu,\nu) =\frac{1}{2}\sup_{f:E\rightarrow[-1,1]}\left|\int\!f\,d\mu-\int\!f\,d\nu\right| =\frac{1}{2}\sum_{x\in E}|\mu(x)-\nu(x)| \]

Moreover, the supremum in the original definition of \( {d_{TV}} \) is achieved for the set

\[ A_*=\{x\in E:\mu(x)\geq\nu(x)\} \]

while the supremum in the functional expression of \( {d_{TV}} \) is achieved for

\[ f=\mathbf{1}_{A_*}-\mathbf{1}_{A_*^c}. \]

Proof: The second equality follows from the inequality

\[ \left|\int\!f\,d\mu-\int\!f\,d\nu\right| \leq \sum_{x\in E}|f(x)||\mu(x)-\nu(x)| \leq \sup_{x\in E}|f(x)|\sum_{x\in E}|\mu(x)-\nu(x)| \]

which is saturated for \( {f=\mathbf{1}_{A_*}-\mathbf{1}_{A_*^c}} \). For the first equality, we write

\[ |\mu(A)-\nu(A)| =\frac{1}{2}\left|\int\!f_A\,d\mu-\int\,f_A\,d\nu\right| \]

where \( {f=\mathbf{1}_A-\mathbf{1}_{A^c}} \), which gives

\[ |\mu(A)-\nu(A)| \leq \sup_{f:E\rightarrow[-1,1]}\left|\int\!f\,d\mu-\int\!f\,d\nu\right| = \frac{1}{2}\sum_{x\in E}|\mu(x)-\nu(x)| \]

which is saturated for \( {A=A_*} \) since

\[ 2|\mu(A_*)-\nu(A_*)|=\sum_{x\in A_*}|\mu(x)-\nu(x)|+\sum_{x\in A_*^c}|\mu(x)-\nu(x)|. \]

$latex \Box$

Theorem 2 (Equivalent criteria for convergence in law) Let \( {(X_n)_{n\in\mathbb{N}}} \) be a sequence of random variables on \( {E} \) and let \( {\mu_n} \) be the law of \( {X_n} \). For any \( {\mu} \) in \( {\mathcal{P}} \), the following properties are equivalent:

  1. \( {(X_n)_{n\in\mathbb{N}}} \) converges in law to \( {\mu} \)
  2. \( {\lim_{n\rightarrow\infty}\int\!f\,d\mu_n=\int\!f\,d\mu} \) for all bounded \( {f:E\rightarrow\mathbb{R}} \)
  3. \( {\lim_{n\rightarrow\infty}\mu_n(x)=\mu(x)} \) for all \( {x\in E} \)
  4. \( {\lim_{n\rightarrow\infty}d_{TV}(\mu_n,\mu)=0} \)

Proof: To deduce 1. from 4. it suffices to use the functional variational expression of \( {d_{TV}} \). The properties 1. and 2. are equivalent since every \( {f:E\rightarrow\mathbb{R}} \) is continuous for the discrete topology. To deduce 3. from 2. one can take \( {f=\mathbf{1}_{\{x\}}} \). To deduce 4. from 3. we start by observing that for an arbitrary \( {A\subset E} \),

\[ \sum_{x\in E}|\mu_n(x)-\mu(x)| =\sum_{x\in A}|\mu_n(x)-\mu(x)| +\sum_{x\in A^c}|\mu_n(x)-\mu(x)|. \]

Thanks to 3. for every \( {\varepsilon’>0} \) there exists an integer \( {N=N(A,\varepsilon’)} \) such that the first term of the right hand side is bounded above by \( {\mathrm{card}(A)\varepsilon’} \) for all \( {n\geq N} \). For the second term of the right hand side, we write

\[ \sum_{x\in A^c}|\mu_n(x)-\mu(x)| \leq \sum_{x\in A^c}\mu_n(x)+\sum_{x\in A^c}\mu(x). \]

Since we have

\[ \sum_{x\in A^c}\mu_n(x) =\sum_{x\in A}\mu(x)-\sum_{x\in A}\mu_n(x)+\sum_{x\in A^c}\mu(x) \]

we obtain

\[ \sum_{x\in A^c}|\mu_n(x)-\mu(x)| \leq \sum_{x\in A}|\mu_n(x)-\mu(x)|+2\sum_{x\in A^c}\mu(x). \]

Since \( {\nu\in\mathcal{P}} \), for any \( {\varepsilon”>0} \), we can select \( {A} \) finite such that \( {\mu(A^c)\leq\varepsilon”} \). $latex \Box$

Theorem 3 (Yet another expression and the extremal case) For every \( {\mu,\nu\in\mathcal{P}} \) we have

\[ d_{TV}(\mu,\nu)=1-\sum_{x\in E}(\mu(x)\wedge\nu(x)). \]

In particular, \( {d_{TV}(\mu,\nu)=1} \) if and only if \( {\mu} \) and \( {\nu} \) have disjoint supports.

Proof: It suffices to write

\[ \sum_{x\in E}(\mu(x)\wedge\nu(x)) =\frac{1}{2}\sum_{x\in E}(\mu(x)+\nu(x)-|\mu(x)-\nu(x)|)=1-d_{TV}(\mu,\nu). \]

$latex \Box$

Theorem 4 (Coupling) For every \( {\mu,\nu\in\mathcal{P}} \) we have

\[ d_{TV}(\mu,\nu) =\inf_{(X,Y)}\mathbb{P}(X\neq Y) \]

where the infimum runs over all the couples of random variables on \( {E\times E} \) with marginal laws \( {\mu} \) and \( {\nu} \). Moreover, there exists a couple of this type for which the equality is achieved.

Proof: Let \( {(X,Y)} \) be a couple of random variables on \( {E\times E} \) with marginal laws \( {\mu} \) nd \( {\nu} \). Since \( {\mathbb{P}(X=x,Y=x)\leq \mu(x)\wedge\nu(x)} \) for every \( {x\in E} \) we have

\[ 1-d_{TV}(\mu,\nu)=\sum_{x\in E}(\mu(x)\wedge\nu(x)) \geq \sum_{x\in E}\mathbb{P}(X=x,Y=x) =\mathbb{P}(X=Y). \]

Therefore, it suffices now to construct a couple \( {(X,Y)} \) for which the equality is achieved. Define

\[ p=1-d_{TV}(\mu,\nu)\in[0,1]. \]

Case where \( {p=0} \). Then \( {d_{TV}(\mu,\nu)=1} \) and the supports of \( {\mu} \) et \( {\nu} \) are disjoints. This gives \( {\sum_{x\in E}\mu(x)\nu(x)=0} \). We consider in this case \( {(X,Y)} \) where \( {X\sim\mu} \) and \( {Y\sim\nu} \) are independent:

\[ \mathbb{P}(X=Y)=\sum_{x\in E}\mu(x)\nu(x)=0. \]

Case where \( {p=1} \). Then \( {d_{TV}(\mu,\nu)=0} \) and thus \( {\mu=\nu} \). We can take \( {(X,X)} \) where \( {X\sim\mu} \).

Case where \( {0<p<1} \). Let \( {(U,V,W)} \) be a triple of random variables with laws

\[ p^{-1}(\mu\wedge\nu),\quad (1-p)^{-1}(\mu-(\mu\wedge\nu)),\quad (1-p)^{-1}(\nu-(\mu\wedge\nu)) \]

(recall that \( {p=\sum_{x\in E}(\mu(x)\wedge\nu(x))} \)). Let \( {B} \) be a Bernoulli random variable, independent of \( {(U,V,W)} \), such that \( {\mathbb{P}(B=1)=1-\mathbb{P}(B=0)=p} \). If we define \( {(X,Y)=(U,U)} \) if \( {B=1} \) while \( {(X,Y)=(V,W)} \) if \( {B=0} \), then \( {X\sim\mu} \) and \( {Y\sim\nu} \), and since the laws of \( {V} \) and \( {W} \) have disjoint supports, we have \( {\mathbb{P}(V=W)=0} \), and thus

\[ \mathbb{P}(X=Y)=\mathbb{P}(B=1)=p. \]

$latex \Box$

Since \( {\mathbb{P}(X\neq Y)=\mathbb{E}(d(X,Y))} \) for the atomic distance \( {d(x,y)=1_{x\neq y}} \) we have

\[ d_{TV}(\mu,\nu)=\min_{\pi\in\Pi(\mu,\nu)}\int_{E\times E}\!d(x,y)\,d\pi(x,y) \]

where \( {\Pi(\mu,\nu)} \) is the convex set of probability measures on \( {E\times E} \) with marginal laws \( {\mu} \) and \( {\nu} \). From this point of view, \( {d_{TV}} \) is the Wasserstein \( {W_1} \) distance on \( {\mathcal{P}} \) associated to the atomic distance on \( {E} \).

One Comment

  1. Luc 2015-11-22

    Bonjour,

    For general polish space $E$, can we construct explicitly an optimal coupling as you did in the discrete setting ? Any reference/idea would be enough.

    Thanks for your time.

Leave a Reply

Your email address will not be published. Required fields are marked *