Press "Enter" to skip to content

Libres pensées d'un mathématicien ordinaire Posts

Back to basics – Coupon collector

Let \( {X_1,X_2,\ldots} \) be i.i.d. random variables uniformly distributed on \( {\{1,\ldots,r\}} \). The classical coupon collector problem consists in looking at the random time

\[ T=\min\{n\geq1:\{X_1,\ldots,X_n\}=\{1,\ldots,r\}\}. \]

Since we have \( {T=G_1+\cdots+G_r} \) where \( {G_1,\ldots,G_r} \) are independent geometric random variables on \( {\{1,2,\ldots\}} \) with \( {\mathbb{E}(G_i)=r/(r-i+1)} \), we obtain

\[ \mathbb{E}(T)=r\sum_{i=1}^r\frac{1}{i}=r(\log(r)+\gamma+o_{r\rightarrow\infty}(1)). \]

One may compute similarly the variance and obtain a deviation bound around the mean via Chebyshev’s inequality (exercise!). A simple reasoning shows that for any \( {n\geq r} \),

\[ \mathbb{P}(T=n)=\frac{r!}{r^n}S(n-1,r-1) \]

where \( {S(n-1,r-1)} \) is the Stirling number of the second kind (the number of ways to partition \( {n-1} \) objects into \( {r-1} \) non empty blocs). It is difficult to get the tail of \( {T} \) from this formula.

Another possibility is to write, for every \( {n\geq r} \),

\[ \mathbb{P}(T>n) = \mathbb{P}(E_{n,1}\cup\cdots\cup E_{n,r}) \quad\text{where}\quad E_{n,i} = \{X_1\neq i,\ldots,X_n\neq i\}. \]

Note that if \( {1\leq i_1,\ldots,i_k\leq r} \) are distincs, then with \( {R:=\{1,\ldots,r\}\setminus\{i_1,\ldots,i_k\}} \),

\[ \mathbb{P}(E_{n,i_1}\cap\cdots\cap E_{n,i_k}) =\mathbb{P}(X_1\in R)\cdots\mathbb{P}(X_n\in R) =\left(\frac{r-k}{r}\right)^n=\left(1-\frac{k}{r}\right)^n. \]

Now, by the inclusion-exclusion principle (also known as the Poincaré sieve), for every \( {n\geq r } \),

\[ \mathbb{P}(T>n) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k}\left(1-\frac{k}{r}\right)^n. \]

Unfortunately, the signs are alternating. Actually, if we write

\[ \mathbb{P}(T>n)\leq \sum_{i=1}^r \mathbb{P}(E_{n,i}) \]

and if we note that

\[ \mathbb{P}(E_{n,i})=\left(1-\frac{1}{r}\right)^n\leq e^{-n/r} \]

then, taking \( {n=\lceil t r\log(r)\rceil} \), we obtain that for all real \( {t>0} \),

\[ \mathbb{P}(T> \lceil tr\log(r)\rceil) \leq r^{-t+1}. \]

It turns out that the asymptotic fluctuations of \( {T} \) are Gumbel.

Theorem 1 (Gumbel fluctuations) We have

\[ \frac{T-r\log(r)}{r} \underset{r\rightarrow\infty}{\overset{\mathrm{law}}{\longrightarrow}} G \]

where \( {G} \) is the Gumbel law on \( {\mathbb{R}} \) with density \( {t\mapsto e^{-e^{-t}-t}} \).

Proof: Let us fix \( {t\in\mathbb{R}} \) and let us show that

\[ \lim_{r\rightarrow\infty}\mathbb{P}(T> r\log(r)+tr)=S(t)=1-e^{-e^{-t}}. \]

Fix \( {t\in\mathbb{R}} \) and take \( {r} \) large enough such that \( {r\log(r)+tr>r} \). Set \( {n_{t,r}=r\log(r)+tr} \) if \( {r\log(r)+tr\in\mathbb{N}} \) and \( {n_{t,r}=\lceil r\log(r)+tr\rceil} \) if not. We already know that

\[ \mathbb{P}(T> r\log(r)+tr) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}}. \]

Now, since \( {\binom{r}{k}\leq r^kk!} \) and \( {(1-u)\leq e^{-u}} \) for all \( {u\geq0} \), we have

\[ \binom{r}{k}\left(1-\frac{k}{r}\right)^{n_{t,r}} \overset{\leq}{\underset{r\rightarrow\infty}{\longrightarrow}} \frac{e^{-kt}}{k!}. \]

By dominated convergence, we obtain finally

\[ \lim_{r\rightarrow\infty} \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}} =\sum_{k=1}^\infty(-1)^{k-1}\frac{e^{-kt}}{k!}=S(t). \]

$\Box$

Actually, the coupon collector is a special instance of a more general model. Namely, if \( {(X_n)_{n\geq0}} \) is a sequence of random variables, not necessarily independent or of the same law, taking their values in a finite set \( {E} \), then we define the cover time by

\[ T=\inf\{n\geq0:\{X_0,\ldots,X_n\}=E\}. \]

The cover time of finite Markov chains or random walks on finite graphs was extensively studied. For recent results, see e.g. Ding, Lee, Peres (2010).

2 Comments
Syntax · Style · .