Let \( {X_1,X_2,\ldots} \) be i.i.d. random variables uniformly distributed on \( {\{1,\ldots,r\}} \). The classical coupon collector problem consists in looking at the random time
\[ T=\min\{n\geq1:\{X_1,\ldots,X_n\}=\{1,\ldots,r\}\}. \]
Since we have \( {T=G_1+\cdots+G_r} \) where \( {G_1,\ldots,G_r} \) are independent geometric random variables on \( {\{1,2,\ldots\}} \) with \( {\mathbb{E}(G_i)=r/(r-i+1)} \), we obtain
\[ \mathbb{E}(T)=r\sum_{i=1}^r\frac{1}{i}=r(\log(r)+\gamma+o_{r\rightarrow\infty}(1)). \]
One may compute similarly the variance and obtain a deviation bound around the mean via Chebyshev’s inequality (exercise!). A simple reasoning shows that for any \( {n\geq r} \),
\[ \mathbb{P}(T=n)=\frac{r!}{r^n}S(n-1,r-1) \]
where \( {S(n-1,r-1)} \) is the Stirling number of the second kind (the number of ways to partition \( {n-1} \) objects into \( {r-1} \) non empty blocs). It is difficult to get the tail of \( {T} \) from this formula.
Another possibility is to write, for every \( {n\geq r} \),
\[ \mathbb{P}(T>n) = \mathbb{P}(E_{n,1}\cup\cdots\cup E_{n,r}) \quad\text{where}\quad E_{n,i} = \{X_1\neq i,\ldots,X_n\neq i\}. \]
Note that if \( {1\leq i_1,\ldots,i_k\leq r} \) are distincs, then with \( {R:=\{1,\ldots,r\}\setminus\{i_1,\ldots,i_k\}} \),
\[ \mathbb{P}(E_{n,i_1}\cap\cdots\cap E_{n,i_k}) =\mathbb{P}(X_1\in R)\cdots\mathbb{P}(X_n\in R) =\left(\frac{r-k}{r}\right)^n=\left(1-\frac{k}{r}\right)^n. \]
Now, by the inclusion-exclusion principle (also known as the Poincaré sieve), for every \( {n\geq r } \),
\[ \mathbb{P}(T>n) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k}\left(1-\frac{k}{r}\right)^n. \]
Unfortunately, the signs are alternating. Actually, if we write
\[ \mathbb{P}(T>n)\leq \sum_{i=1}^r \mathbb{P}(E_{n,i}) \]
and if we note that
\[ \mathbb{P}(E_{n,i})=\left(1-\frac{1}{r}\right)^n\leq e^{-n/r} \]
then, taking \( {n=\lceil t r\log(r)\rceil} \), we obtain that for all real \( {t>0} \),
\[ \mathbb{P}(T> \lceil tr\log(r)\rceil) \leq r^{-t+1}. \]
It turns out that the asymptotic fluctuations of \( {T} \) are Gumbel.
Theorem 1 (Gumbel fluctuations) We have
\[ \frac{T-r\log(r)}{r} \underset{r\rightarrow\infty}{\overset{\mathrm{law}}{\longrightarrow}} G \]
where \( {G} \) is the Gumbel law on \( {\mathbb{R}} \) with density \( {t\mapsto e^{-e^{-t}-t}} \).
Proof: Let us fix \( {t\in\mathbb{R}} \) and let us show that
\[ \lim_{r\rightarrow\infty}\mathbb{P}(T> r\log(r)+tr)=S(t)=1-e^{-e^{-t}}. \]
Fix \( {t\in\mathbb{R}} \) and take \( {r} \) large enough such that \( {r\log(r)+tr>r} \). Set \( {n_{t,r}=r\log(r)+tr} \) if \( {r\log(r)+tr\in\mathbb{N}} \) and \( {n_{t,r}=\lceil r\log(r)+tr\rceil} \) if not. We already know that
\[ \mathbb{P}(T> r\log(r)+tr) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}}. \]
Now, since \( {\binom{r}{k}\leq r^kk!} \) and \( {(1-u)\leq e^{-u}} \) for all \( {u\geq0} \), we have
\[ \binom{r}{k}\left(1-\frac{k}{r}\right)^{n_{t,r}} \overset{\leq}{\underset{r\rightarrow\infty}{\longrightarrow}} \frac{e^{-kt}}{k!}. \]
By dominated convergence, we obtain finally
\[ \lim_{r\rightarrow\infty} \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}} =\sum_{k=1}^\infty(-1)^{k-1}\frac{e^{-kt}}{k!}=S(t). \]
$\Box$
- The classical coupon collector problem is considered in e.g. Feller (volume I)
- Another alternating signs formula for the tail can be found in e.g. Fulman (2009)
- The proof of the Gumbel fluctuation is taken from Motwani and Raghavan (section 3.2)
- The non uniform coupon collector is nicely studied by e.g. Holst (1986)
Actually, the coupon collector is a special instance of a more general model. Namely, if \( {(X_n)_{n\geq0}} \) is a sequence of random variables, not necessarily independent or of the same law, taking their values in a finite set \( {E} \), then we define the cover time by
\[ T=\inf\{n\geq0:\{X_0,\ldots,X_n\}=E\}. \]
The cover time of finite Markov chains or random walks on finite graphs was extensively studied. For recent results, see e.g. Ding, Lee, Peres (2010).
2 Comments