Press "Enter" to skip to content

Month: October 2010

Back to basics: coupon collector

Let \( {X_1,X_2,\ldots} \) be i.i.d. random variables uniformly distributed on \( {\{1,\ldots,r\}} \). The classical coupon collector problem consists in looking at the random time

\[ T=\min\{n\geq1:\{X_1,\ldots,X_n\}=\{1,\ldots,r\}\}. \]

Since we have \( {T=G_1+\cdots+G_r} \) where \( {G_1,\ldots,G_r} \) are independent geometric random variables on \( {\{1,2,\ldots\}} \) with \( {\mathbb{E}(G_i)=r/(r-i+1)} \), we obtain

\[ \mathbb{E}(T)=r\sum_{i=1}^r\frac{1}{i}=r(\log(r)+\gamma+o_{r\rightarrow\infty}(1)). \]

One may compute similarly the variance and obtain a deviation bound around the mean via Chebyshev’s inequality (exercise!). A simple reasoning shows that for any \( {n\geq r} \),

\[ \mathbb{P}(T=n)=\frac{r!}{r^n}S(n-1,r-1) \]

where \( {S(n-1,r-1)} \) is the Stirling number of the second kind (the number of ways to partition \( {n-1} \) objects into \( {r-1} \) non empty blocs). It is difficult to get the tail of \( {T} \) from this formula.

Another possibility is to write, for every \( {n\geq r} \),

\[ \mathbb{P}(T>n) = \mathbb{P}(E_{n,1}\cup\cdots\cup E_{n,r}) \quad\text{where}\quad E_{n,i} = \{X_1\neq i,\ldots,X_n\neq i\}. \]

Note that if \( {1\leq i_1,\ldots,i_k\leq r} \) are distincs, then with \( {R:=\{1,\ldots,r\}\setminus\{i_1,\ldots,i_k\}} \),

\[ \mathbb{P}(E_{n,i_1}\cap\cdots\cap E_{n,i_k}) =\mathbb{P}(X_1\in R)\cdots\mathbb{P}(X_n\in R) =\left(\frac{r-k}{r}\right)^n=\left(1-\frac{k}{r}\right)^n. \]

Now, by the inclusion-exclusion principle (also known as the Poincaré sieve), for every \( {n\geq r } \),

\[ \mathbb{P}(T>n) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k}\left(1-\frac{k}{r}\right)^n. \]

Unfortunately, the signs are alternating. Actually, if we write

\[ \mathbb{P}(T>n)\leq \sum_{i=1}^r \mathbb{P}(E_{n,i}) \]

and if we note that

\[ \mathbb{P}(E_{n,i})=\left(1-\frac{1}{r}\right)^n\leq e^{-n/r} \]

then, taking \( {n=\lceil t r\log(r)\rceil} \), we obtain that for all real \( {t>0} \),

\[ \mathbb{P}(T> \lceil tr\log(r)\rceil) \leq r^{-t+1}. \]

It turns out that the asymptotic fluctuations of \( {T} \) are Gumbel.

Theorem 1 (Gumbel fluctuations) We have

\[ \frac{T-r\log(r)}{r} \underset{r\rightarrow\infty}{\overset{\mathrm{law}}{\longrightarrow}} G \]

where \( {G} \) is the Gumbel law on \( {\mathbb{R}} \) with density \( {t\mapsto e^{-e^{-t}-t}} \).

Proof: Let us fix \( {t\in\mathbb{R}} \) and let us show that

\[ \lim_{r\rightarrow\infty}\mathbb{P}(T> r\log(r)+tr)=S(t)=1-e^{-e^{-t}}. \]

Fix \( {t\in\mathbb{R}} \) and take \( {r} \) large enough such that \( {r\log(r)+tr>r} \). Set \( {n_{t,r}=r\log(r)+tr} \) if \( {r\log(r)+tr\in\mathbb{N}} \) and \( {n_{t,r}=\lceil r\log(r)+tr\rceil} \) if not. We already know that

\[ \mathbb{P}(T> r\log(r)+tr) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}}. \]

Now, since \( {\binom{r}{k}\leq r^kk!} \) and \( {(1-u)\leq e^{-u}} \) for all \( {u\geq0} \), we have

\[ \binom{r}{k}\left(1-\frac{k}{r}\right)^{n_{t,r}} \overset{\leq}{\underset{r\rightarrow\infty}{\longrightarrow}} \frac{e^{-kt}}{k!}. \]

By dominated convergence, we obtain finally

\[ \lim_{r\rightarrow\infty} \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}} =\sum_{k=1}^\infty(-1)^{k-1}\frac{e^{-kt}}{k!}=S(t). \]

$latex \Box$

Actually, the coupon collector is a special instance of a more general model. Namely, if \( {(X_n)_{n\geq0}} \) is a sequence of random variables, not necessarily independent or of the same law, taking their values in a finite set \( {E} \), then we define the cover time by

\[ T=\inf\{n\geq0:\{X_0,\ldots,X_n\}=E\}. \]

The cover time of finite Markov chains or random walks on finite graphs was extensively studied. For recent results, see e.g. Ding, Lee, Peres (2010).

2 Comments

Moments de solitude

Il y a quelques mois, un exposé de seconde année se termina fort justement par la convergence d’une variable aléatoire vers la loi normale.  Lorsque  le jury demanda en quel sens avait lieu la convergence, la réponse fut « de gauche à droite ». Il est vrai qu’en mathématiques comme en politique, les convergences de droite à gauche sont  rares. Plus récemment, en cours de première année, après avoir détaillé la preuve de

\[\lim_{x\to0}\frac{\sin(x)}{x}=1\]

avec un raisonnement géométrique élémentaire, quelqu’un assis au premier rang, très attentif, a demandé « on ne peut pas simplifier par \(x\) dans cette fraction ? ».  Quelle tristesse de devoir expliquer que cette audace était vaine (bien que sur un développement limité…). Ces mésaventures  ordinaires font méditer  sur ses propres faiblesses, et  sur la relativité des choses de l’esprit.  On pense avec sourire à l’humour de Pierre Desproges : « Un psychotique, c’est quelqu’un qui croit dur comme fer que 2 et 2 font 5, et qui en est pleinement satisfait.  Un névrosé, c’est quelqu’un qui sait pertinemment que 2 et 2 font 4, et ça le rend malade ! ».

9 Comments

Back to basics: total variation distance

Today, part of my teaching concerned basic properties of the total variation on discrete spaces. Let \( {E} \) be a possibly infinite countable set equipped with its discrete topology and \( {\sigma} \)-field. Let \( {\mathcal{P}} \) be the set of probability measures on \( {E} \). We equip \( {\mathcal{P}} \) with the total variation distance defined for every \( {\mu,\nu\in\mathcal{P}} \) by

\[ d_{TV}(\mu,\nu)=\sup_{A\subset{E}}|\mu(A)-\nu(A)|. \]

It takes its values in \( {[0,1]} \) and \( {1} \) is achieved when \( {\mu} \) and \( {\nu} \) have disjoint supports.

Theorem 1 (Alternative expressions) For every \( {\mu,\nu\in\mathcal{P}} \) we have

\[ d_{TV}(\mu,\nu) =\frac{1}{2}\sup_{f:E\rightarrow[-1,1]}\left|\int\!f\,d\mu-\int\!f\,d\nu\right| =\frac{1}{2}\sum_{x\in E}|\mu(x)-\nu(x)| \]

Moreover, the supremum in the original definition of \( {d_{TV}} \) is achieved for the set

\[ A_*=\{x\in E:\mu(x)\geq\nu(x)\} \]

while the supremum in the functional expression of \( {d_{TV}} \) is achieved for

\[ f=\mathbf{1}_{A_*}-\mathbf{1}_{A_*^c}. \]

Proof: The second equality follows from the inequality

\[ \left|\int\!f\,d\mu-\int\!f\,d\nu\right| \leq \sum_{x\in E}|f(x)||\mu(x)-\nu(x)| \leq \sup_{x\in E}|f(x)|\sum_{x\in E}|\mu(x)-\nu(x)| \]

which is saturated for \( {f=\mathbf{1}_{A_*}-\mathbf{1}_{A_*^c}} \). For the first equality, we write

\[ |\mu(A)-\nu(A)| =\frac{1}{2}\left|\int\!f_A\,d\mu-\int\,f_A\,d\nu\right| \]

where \( {f=\mathbf{1}_A-\mathbf{1}_{A^c}} \), which gives

\[ |\mu(A)-\nu(A)| \leq \sup_{f:E\rightarrow[-1,1]}\left|\int\!f\,d\mu-\int\!f\,d\nu\right| = \frac{1}{2}\sum_{x\in E}|\mu(x)-\nu(x)| \]

which is saturated for \( {A=A_*} \) since

\[ 2|\mu(A_*)-\nu(A_*)|=\sum_{x\in A_*}|\mu(x)-\nu(x)|+\sum_{x\in A_*^c}|\mu(x)-\nu(x)|. \]

$latex \Box$

Theorem 2 (Equivalent criteria for convergence in law) Let \( {(X_n)_{n\in\mathbb{N}}} \) be a sequence of random variables on \( {E} \) and let \( {\mu_n} \) be the law of \( {X_n} \). For any \( {\mu} \) in \( {\mathcal{P}} \), the following properties are equivalent:

  1. \( {(X_n)_{n\in\mathbb{N}}} \) converges in law to \( {\mu} \)
  2. \( {\lim_{n\rightarrow\infty}\int\!f\,d\mu_n=\int\!f\,d\mu} \) for all bounded \( {f:E\rightarrow\mathbb{R}} \)
  3. \( {\lim_{n\rightarrow\infty}\mu_n(x)=\mu(x)} \) for all \( {x\in E} \)
  4. \( {\lim_{n\rightarrow\infty}d_{TV}(\mu_n,\mu)=0} \)

Proof: To deduce 1. from 4. it suffices to use the functional variational expression of \( {d_{TV}} \). The properties 1. and 2. are equivalent since every \( {f:E\rightarrow\mathbb{R}} \) is continuous for the discrete topology. To deduce 3. from 2. one can take \( {f=\mathbf{1}_{\{x\}}} \). To deduce 4. from 3. we start by observing that for an arbitrary \( {A\subset E} \),

\[ \sum_{x\in E}|\mu_n(x)-\mu(x)| =\sum_{x\in A}|\mu_n(x)-\mu(x)| +\sum_{x\in A^c}|\mu_n(x)-\mu(x)|. \]

Thanks to 3. for every \( {\varepsilon’>0} \) there exists an integer \( {N=N(A,\varepsilon’)} \) such that the first term of the right hand side is bounded above by \( {\mathrm{card}(A)\varepsilon’} \) for all \( {n\geq N} \). For the second term of the right hand side, we write

\[ \sum_{x\in A^c}|\mu_n(x)-\mu(x)| \leq \sum_{x\in A^c}\mu_n(x)+\sum_{x\in A^c}\mu(x). \]

Since we have

\[ \sum_{x\in A^c}\mu_n(x) =\sum_{x\in A}\mu(x)-\sum_{x\in A}\mu_n(x)+\sum_{x\in A^c}\mu(x) \]

we obtain

\[ \sum_{x\in A^c}|\mu_n(x)-\mu(x)| \leq \sum_{x\in A}|\mu_n(x)-\mu(x)|+2\sum_{x\in A^c}\mu(x). \]

Since \( {\nu\in\mathcal{P}} \), for any \( {\varepsilon”>0} \), we can select \( {A} \) finite such that \( {\mu(A^c)\leq\varepsilon”} \). $latex \Box$

Theorem 3 (Yet another expression and the extremal case) For every \( {\mu,\nu\in\mathcal{P}} \) we have

\[ d_{TV}(\mu,\nu)=1-\sum_{x\in E}(\mu(x)\wedge\nu(x)). \]

In particular, \( {d_{TV}(\mu,\nu)=1} \) if and only if \( {\mu} \) and \( {\nu} \) have disjoint supports.

Proof: It suffices to write

\[ \sum_{x\in E}(\mu(x)\wedge\nu(x)) =\frac{1}{2}\sum_{x\in E}(\mu(x)+\nu(x)-|\mu(x)-\nu(x)|)=1-d_{TV}(\mu,\nu). \]

$latex \Box$

Theorem 4 (Coupling) For every \( {\mu,\nu\in\mathcal{P}} \) we have

\[ d_{TV}(\mu,\nu) =\inf_{(X,Y)}\mathbb{P}(X\neq Y) \]

where the infimum runs over all the couples of random variables on \( {E\times E} \) with marginal laws \( {\mu} \) and \( {\nu} \). Moreover, there exists a couple of this type for which the equality is achieved.

Proof: Let \( {(X,Y)} \) be a couple of random variables on \( {E\times E} \) with marginal laws \( {\mu} \) nd \( {\nu} \). Since \( {\mathbb{P}(X=x,Y=x)\leq \mu(x)\wedge\nu(x)} \) for every \( {x\in E} \) we have

\[ 1-d_{TV}(\mu,\nu)=\sum_{x\in E}(\mu(x)\wedge\nu(x)) \geq \sum_{x\in E}\mathbb{P}(X=x,Y=x) =\mathbb{P}(X=Y). \]

Therefore, it suffices now to construct a couple \( {(X,Y)} \) for which the equality is achieved. Define

\[ p=1-d_{TV}(\mu,\nu)\in[0,1]. \]

Case where \( {p=0} \). Then \( {d_{TV}(\mu,\nu)=1} \) and the supports of \( {\mu} \) et \( {\nu} \) are disjoints. This gives \( {\sum_{x\in E}\mu(x)\nu(x)=0} \). We consider in this case \( {(X,Y)} \) where \( {X\sim\mu} \) and \( {Y\sim\nu} \) are independent:

\[ \mathbb{P}(X=Y)=\sum_{x\in E}\mu(x)\nu(x)=0. \]

Case where \( {p=1} \). Then \( {d_{TV}(\mu,\nu)=0} \) and thus \( {\mu=\nu} \). We can take \( {(X,X)} \) where \( {X\sim\mu} \).

Case where \( {0<p<1} \). Let \( {(U,V,W)} \) be a triple of random variables with laws

\[ p^{-1}(\mu\wedge\nu),\quad (1-p)^{-1}(\mu-(\mu\wedge\nu)),\quad (1-p)^{-1}(\nu-(\mu\wedge\nu)) \]

(recall that \( {p=\sum_{x\in E}(\mu(x)\wedge\nu(x))} \)). Let \( {B} \) be a Bernoulli random variable, independent of \( {(U,V,W)} \), such that \( {\mathbb{P}(B=1)=1-\mathbb{P}(B=0)=p} \). If we define \( {(X,Y)=(U,U)} \) if \( {B=1} \) while \( {(X,Y)=(V,W)} \) if \( {B=0} \), then \( {X\sim\mu} \) and \( {Y\sim\nu} \), and since the laws of \( {V} \) and \( {W} \) have disjoint supports, we have \( {\mathbb{P}(V=W)=0} \), and thus

\[ \mathbb{P}(X=Y)=\mathbb{P}(B=1)=p. \]

$latex \Box$

Since \( {\mathbb{P}(X\neq Y)=\mathbb{E}(d(X,Y))} \) for the atomic distance \( {d(x,y)=1_{x\neq y}} \) we have

\[ d_{TV}(\mu,\nu)=\min_{\pi\in\Pi(\mu,\nu)}\int_{E\times E}\!d(x,y)\,d\pi(x,y) \]

where \( {\Pi(\mu,\nu)} \) is the convex set of probability measures on \( {E\times E} \) with marginal laws \( {\mu} \) and \( {\nu} \). From this point of view, \( {d_{TV}} \) is the Wasserstein \( {W_1} \) distance on \( {\mathcal{P}} \) associated to the atomic distance on \( {E} \).

1 Comment

Few words on Poisson approximation

Lucien Le Cam
Lucien Le Cam

This post concerns a nice elementary inequality regarding Poisson approximation, due to Lucien Le Cam (1924 – 2000). The total variation distance between two probability measures \( {\mu} \) and \( {\nu} \) on \( {\mathbb{N}} \) is given by

\[ d_{TV}(\mu,\nu) =\sup_{A\subset\mathbb{N}}|\mu(A)-\nu(A)| =\frac{1}{2}\sum_{k\in\mathbb{N}}|\mu_k-\nu_k|. \]

The weak convergence of probability measures on \( {\mathbb{N}} \) is equivalent to the convergence in total variation. On non countable spaces, the convergence in total variation is in general stronger. If \( {\mu_1,\ldots,\mu_n} \) and \( {\nu_1,\ldots,\nu_n} \) are probability measures on \( {\mathbb{N}} \) then by induction we get the sub-additive bound

\[ d_{TV}(\mu_1*\cdots*\mu_n,\nu_1*\cdots*\nu_n) \leq d_{TV}(\mu_1,\nu_1)+\cdots+d_{TV}(\mu_n,\nu_n). \]

Le Cam’s inequality allows to bound the total variation distance between the law of the sum of independent Bernoulli random variables and the Poisson law of the same mean. In all the sequel, \( {\beta_p} \) is the Bernoulli distribution \( {p\delta_1+(1-p)\delta_0} \) with \( {0\leq p\leq 1} \). Let us denote by \( {\pi_{\lambda}} \) the Poisson law of mean \( {\lambda} \). The elementary bound \( {d_{TV}(\beta_p,\pi_p)\leq p^2} \) used with the sub-additive bound above leads to Le Cam’s inequality:

Theorem 1 (Le Cam’s inequality)

\[ d_{TV}(\beta_{p_1}*\cdots*\beta_{p_n},\pi_{p_1+\cdots+p_n}) \leq p_1^2+\cdots+p_n^2. \]

Le Cam’s inequality generalizes the Poisson approximation of the binomial distribution since if \( {p_1=\cdots=p_n=p} \) then \( {\beta_{p_1}*\cdots*\beta_{p_n}=\beta_p^{*n}=\mathrm{Binom}(n,p)} \). More generally, using Le Cam’s inequality with the \( {L^2-L^1-L^\infty} \) bound \( {p_1^2+\cdots+p_n^2\leq (p_1+\cdots+p_n)\max_{1\leq i\leq n}p_i} \) provides immediately the so called law of small numbers: if we have

  • convergence of the mean: \( {\lambda:=\sum_{n=1}^\infty p_n<\infty} \)
  • dispersion of the bits: \( {\lim_{n\rightarrow\infty}\max_{1\leq i\leq n}p_i=0} \)

then \( {\beta_{p_1}*\cdots*\beta_{p_n}} \) converges weakly as \( {n\rightarrow\infty} \) to \( {\pi_\lambda} \). The convergence holds also in the Wasserstein distance \( {W_1} \) i.e. weak convergence + convergence of the first moment. In the same spirit, we have Chen’s inequality:

Theorem 2 (Chen’s inequality)

\[ d_{TV}(\beta_{p_1}*\cdots*\beta_{p_n},\pi_{p_1+\cdots+p_n}) \leq (1-e^{-(p_1+\cdots+p_n)})\frac{p_1^2+\cdots+p_n^2}{p_1+\cdots+p_n}. \]

The inequality above can be found in the works of Barbour and Eagleson, based on the work of Louis Chen, a former student of Charles Stein. The proof relies on Poisson integration by parts, following the method developped by Stein for the Gaussian in the central limit theorem. For more details, one may take a look at the paper by Steele and the book by Barbour, Holst, and Janson. You may also read the recent article by Chen and Roellin. Chen’s inequality is stronger than Le Cam’s inequality since \( {1-e^{-u} \leq u} \) for all \( {u\geq0} \).

Chen’s inequality is useful even in situations where the mean \( {p_1+\cdots+p_n} \) of \( {\beta_{p_1}*\cdots*\beta_{p_n}} \) does not converge as \( {n\rightarrow\infty} \). Namely, if for instance \( {p_n=1/n} \) for every \( {n\geq1} \) then the right hand side in Chen’s inequality is \( {O(1/\log(n))} \) and thus \( {\beta_{p_1}*\cdots*\beta_{p_n}} \) gets closer and closer to a Poisson law of mean \( {O(\log(n))} \) as \( {n} \) increases. This is helpful for instance for the study of various aspects of large random graphs, and for the derivation of the asymptotic normality of the number of occupied tables in the chineese restaurant process, a famous model of random partitions.

Various versions of Chen’s inequality are available, where the total variation is replaced by the relative entropy or by a Wasserstein distance. The presence of a norm ratio in the right hand side of Chen’s inequality is not surprising. Norms ratios allow to quantify sparsity or localization, as for instance in Berry-Esseen theorems.

Leave a Comment
Syntax · Style · Tracking & Privacy.