Back to basics – Coupon collector – Libres pensées d'un mathématicien ordinaire

Let $ {X_1,X_2,\ldots} $ be i.i.d. random variables uniformly distributed on $ {\{1,\ldots,r\}} $. The classical coupon collector problem consists in looking at the random time

\[ T=\min\{n\geq1:\{X_1,\ldots,X_n\}=\{1,\ldots,r\}\}. \]

Since we have $ {T=G_1+\cdots+G_r} $ where $ {G_1,\ldots,G_r} $ are independent geometric random variables on $ {\{1,2,\ldots\}} $ with $ {\mathbb{E}(G_i)=r/(r-i+1)} $, we obtain

\[ \mathbb{E}(T)=r\sum_{i=1}^r\frac{1}{i}=r(\log(r)+\gamma+o_{r\rightarrow\infty}(1)). \]

One may compute similarly the variance and obtain a deviation bound around the mean via Chebyshev’s inequality (exercise!). A simple reasoning shows that for any $ {n\geq r} $,

\[ \mathbb{P}(T=n)=\frac{r!}{r^n}S(n-1,r-1) \]

where $ {S(n-1,r-1)} $ is the Stirling number of the second kind (the number of ways to partition $ {n-1} $ objects into $ {r-1} $ non empty blocs). It is difficult to get the tail of $ {T} $ from this formula.

Another possibility is to write, for every $ {n\geq r} $,

\[ \mathbb{P}(T>n) = \mathbb{P}(E_{n,1}\cup\cdots\cup E_{n,r}) \quad\text{where}\quad E_{n,i} = \{X_1\neq i,\ldots,X_n\neq i\}. \]

Note that if $ {1\leq i_1,\ldots,i_k\leq r} $ are distincs, then with $ {R:=\{1,\ldots,r\}\setminus\{i_1,\ldots,i_k\}} $,

\[ \mathbb{P}(E_{n,i_1}\cap\cdots\cap E_{n,i_k}) =\mathbb{P}(X_1\in R)\cdots\mathbb{P}(X_n\in R) =\left(\frac{r-k}{r}\right)^n=\left(1-\frac{k}{r}\right)^n. \]

Now, by the inclusion-exclusion principle (also known as the Poincaré sieve), for every $ {n\geq r } $,

\[ \mathbb{P}(T>n) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k}\left(1-\frac{k}{r}\right)^n. \]

Unfortunately, the signs are alternating. Actually, if we write

\[ \mathbb{P}(T>n)\leq \sum_{i=1}^r \mathbb{P}(E_{n,i}) \]

and if we note that

\[ \mathbb{P}(E_{n,i})=\left(1-\frac{1}{r}\right)^n\leq e^{-n/r} \]

then, taking $ {n=\lceil t r\log(r)\rceil} $, we obtain that for all real $ {t>0} $,

\[ \mathbb{P}(T> \lceil tr\log(r)\rceil) \leq r^{-t+1}. \]

It turns out that the asymptotic fluctuations of $ {T} $ are Gumbel.

Theorem 1 (Gumbel fluctuations) We have
\[ \frac{T-r\log(r)}{r} \underset{r\rightarrow\infty}{\overset{\mathrm{law}}{\longrightarrow}} G \]
where $ {G} $ is the Gumbel law on $ {\mathbb{R}} $ with density $ {t\mapsto e^{-e^{-t}-t}} $.

Proof: Let us fix $ {t\in\mathbb{R}} $ and let us show that

\[ \lim_{r\rightarrow\infty}\mathbb{P}(T> r\log(r)+tr)=S(t)=1-e^{-e^{-t}}. \]

Fix $ {t\in\mathbb{R}} $ and take $ {r} $ large enough such that $ {r\log(r)+tr>r} $. Set $ {n_{t,r}=r\log(r)+tr} $ if $ {r\log(r)+tr\in\mathbb{N}} $ and $ {n_{t,r}=\lceil r\log(r)+tr\rceil} $ if not. We already know that

\[ \mathbb{P}(T> r\log(r)+tr) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}}. \]

Now, since $ {\binom{r}{k}\leq r^kk!} $ and $ {(1-u)\leq e^{-u}} $ for all $ {u\geq0} $, we have

\[ \binom{r}{k}\left(1-\frac{k}{r}\right)^{n_{t,r}} \overset{\leq}{\underset{r\rightarrow\infty}{\longrightarrow}} \frac{e^{-kt}}{k!}. \]

By dominated convergence, we obtain finally

\[ \lim_{r\rightarrow\infty} \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}} =\sum_{k=1}^\infty(-1)^{k-1}\frac{e^{-kt}}{k!}=S(t). \]

$\Box$

The classical coupon collector problem is considered in e.g. Feller (volume I)
Another alternating signs formula for the tail can be found in e.g. Fulman (2009)
The proof of the Gumbel fluctuation is taken from Motwani and Raghavan (section 3.2)
The non uniform coupon collector is nicely studied by e.g. Holst (1986)

Actually, the coupon collector is a special instance of a more general model. Namely, if $ {(X_n)_{n\geq0}} $ is a sequence of random variables, not necessarily independent or of the same law, taking their values in a finite set $ {E} $, then we define the cover time by

\[ T=\inf\{n\geq0:\{X_0,\ldots,X_n\}=E\}. \]

The cover time of finite Markov chains or random walks on finite graphs was extensively studied. For recent results, see e.g. Ding, Lee, Peres (2010).

Some other posts:

2 Comments

Arthur 2010-10-20
Damned, je ne connassais pas toutes ces références,
ça aurait pu améliorer ma petite étude qui me semble soudain bien fade,
http://freakonometrics.blog.free.fr/index.php?post/2010/04/01/Acheter-des-pochettes-Pokemon-et-puis-quoi-encore-%21
merci…..
Djalil Chafaï 2010-10-20
Mais non, mais non, ton étude est plutôt parlante et pleine de vie ! J’aime bien ce modèle du collectionneur de coupon. En remplaçant la structure i.i.d. de la suite $X_n$ par une chaîne de Markov finie, on obtient le temps de recouvrement (cover time en anglais). Il faudrait que j’écrive un billet là dessus un jour…

This site uses Akismet to reduce spam. Learn how your comment data is processed.