Let ${X_1,X_2,\ldots}$ be i.i.d. random variables uniformly distributed on ${\{1,\ldots,r\}}$. The classical coupon collector problem consists in looking at the random time

$T=\min\{n\geq1:\{X_1,\ldots,X_n\}=\{1,\ldots,r\}\}.$

Since we have ${T=G_1+\cdots+G_r}$ where ${G_1,\ldots,G_r}$ are independent geometric random variables on ${\{1,2,\ldots\}}$ with ${\mathbb{E}(G_i)=r/(r-i+1)}$, we obtain

$\mathbb{E}(T)=r\sum_{i=1}^r\frac{1}{i}=r(\log(r)+\gamma+o_{r\rightarrow\infty}(1)).$

One may compute similarly the variance and obtain a deviation bound around the mean via Chebyshev’s inequality (exercise!). A simple reasoning shows that for any ${n\geq r}$,

$\mathbb{P}(T=n)=\frac{r!}{r^n}S(n-1,r-1)$

where ${S(n-1,r-1)}$ is the Stirling number of the second kind (the number of ways to partition ${n-1}$ objects into ${r-1}$ non empty blocs). It is difficult to get the tail of ${T}$ from this formula.

Another possibility is to write, for every ${n\geq r}$,

$\mathbb{P}(T>n) = \mathbb{P}(E_{n,1}\cup\cdots\cup E_{n,r}) \quad\text{where}\quad E_{n,i} = \{X_1\neq i,\ldots,X_n\neq i\}.$

Note that if ${1\leq i_1,\ldots,i_k\leq r}$ are distincs, then with ${R:=\{1,\ldots,r\}\setminus\{i_1,\ldots,i_k\}}$,

$\mathbb{P}(E_{n,i_1}\cap\cdots\cap E_{n,i_k}) =\mathbb{P}(X_1\in R)\cdots\mathbb{P}(X_n\in R) =\left(\frac{r-k}{r}\right)^n=\left(1-\frac{k}{r}\right)^n.$

Now, by the inclusion-exclusion principle (also known as the Poincaré sieve), for every ${n\geq r }$,

$\mathbb{P}(T>n) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k}\left(1-\frac{k}{r}\right)^n.$

Unfortunately, the signs are alternating. Actually, if we write

$\mathbb{P}(T>n)\leq \sum_{i=1}^r \mathbb{P}(E_{n,i})$

and if we note that

$\mathbb{P}(E_{n,i})=\left(1-\frac{1}{r}\right)^n\leq e^{-n/r}$

then, taking ${n=\lceil t r\log(r)\rceil}$, we obtain that for all real ${t>0}$,

$\mathbb{P}(T> \lceil tr\log(r)\rceil) \leq r^{-t+1}.$

It turns out that the asymptotic fluctuations of ${T}$ are Gumbel.

Theorem 1 (Gumbel fluctuations) We have

$\frac{T-r\log(r)}{r} \underset{r\rightarrow\infty}{\overset{\mathrm{law}}{\longrightarrow}} G$

where ${G}$ is the Gumbel law on ${\mathbb{R}}$ with density ${t\mapsto e^{-e^{-t}-t}}$.

Proof: Let us fix ${t\in\mathbb{R}}$ and let us show that

$\lim_{r\rightarrow\infty}\mathbb{P}(T> r\log(r)+tr)=S(t)=1-e^{-e^{-t}}.$

Fix ${t\in\mathbb{R}}$ and take ${r}$ large enough such that ${r\log(r)+tr>r}$. Set ${n_{t,r}=r\log(r)+tr}$ if ${r\log(r)+tr\in\mathbb{N}}$ and ${n_{t,r}=\lceil r\log(r)+tr\rceil}$ if not. We already know that

$\mathbb{P}(T> r\log(r)+tr) = \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}}.$

Now, since ${\binom{r}{k}\leq r^kk!}$ and ${(1-u)\leq e^{-u}}$ for all ${u\geq0}$, we have

$\binom{r}{k}\left(1-\frac{k}{r}\right)^{n_{t,r}} \overset{\leq}{\underset{r\rightarrow\infty}{\longrightarrow}} \frac{e^{-kt}}{k!}.$

By dominated convergence, we obtain finally

$\lim_{r\rightarrow\infty} \sum_{k=1}^r(-1)^{k-1}\binom{r}{k} \left(1-\frac{k}{r}\right)^{n_{t,r}} =\sum_{k=1}^\infty(-1)^{k-1}\frac{e^{-kt}}{k!}=S(t).$

$latex \Box$

Actually, the coupon collector is a special instance of a more general model. Namely, if ${(X_n)_{n\geq0}}$ is a sequence of random variables, not necessarily independent or of the same law, taking their values in a finite set ${E}$, then we define the cover time by

$T=\inf\{n\geq0:\{X_0,\ldots,X_n\}=E\}.$

The cover time of finite Markov chains or random walks on finite graphs was extensively studied. For recent results, see e.g. Ding, Lee, Peres (2010).

Mais non, mais non, ton étude est plutôt parlante et pleine de vie ! J’aime bien ce modèle du collectionneur de coupon. En remplaçant la structure i.i.d. de la suite $X_n$ par une chaîne de Markov finie, on obtient le temps de recouvrement (cover time en anglais). Il faudrait que j’écrive un billet là dessus un jour…