Suppose that we would like to describe mathematically the convergence of a sequence ${(X_n)}_n$ of random variables towards a limiting random variable $X_\infty$, as $n\to\infty$. We have to select a notion of convergence. If we decide to use almost sure convergence, we need to define all the $X_n$’s as well as the limit $X_\infty$ on a common probability space in order to give a meaning to $$\mathbb{P}(\lim_{n\to\infty}X_n=X_\infty)=1.$$ This means that we need to couple the random variables. If we decide to use convergence in probability or in $L^p$, we have to define, for all $n$, both $X_n$ and $X_\infty$ in the same probability space in order to give a meaning to $\mathbb{P}(|X_n-X_\infty|>\varepsilon)$ and $\mathbb{E}(|X_n-X_\infty|^p)$ respectively, and therefore we end up to define all the $X_n$’s as well as $X_\infty$ on a common probability space. However, if we decide to use convergence in law (i.e. in distribution), then we do not need at all to define the random variables on a common probability space.
In the special case where $X_\infty$ is deterministic, the convergence in probability or in $L^p$ no longer impose to define the random variables on the same probability space. However, the almost sure convergence still requires the same probability space. Moreover if we impose that the almost sure convergence holds regardless of the way we define the random variables on the same probability space (i.e. for arbitrary couplings), then we end up with the important notion of complete convergence, which is equivalent, thanks to Borel-Cantelli lemmas, to a summable convergence in probability. Note that when the limit is deterministic, we also know that the convergence in law is equivalent to the convergence in probability. Moreover, we know in general from the Borel-Cantelli lemma that a summable convergence in probability implies almost sure convergence. Furthermore, the convergence in probability becomes easily summable under moment conditions.
Following Hsu & Robbins, if we consider $X_n=\frac{1}{n}(Z_1+\cdots+Z_n)$ where $Z_1,\ldots,Z_n$ are independent copies of some $Z$ of mean $m$, then the sequence ${(X_n)}_n$ converges completely towards $m$ as soon as $Z$ has a finite second moment, and this condition is almost necessary. This sheds an interesting light on the law of large numbers for triangular arrays.
Some people refuse to consider the almost sure convergence as a true mode of convergence in the sense that it is not associated to a metric, contrary to the other modes of convergence. In some sense, it appears as a critical notion in the law of large numbers, when we lower the concentration typically via integrability (moments conditions). Of course there are plenty of concrete situations for instance with martingales in which the coupling is in fact imposed and for which the almost sure convergence towards a non-constant random variable holds very naturally. A famous example is for instance the one of Pólya urns and of Galton-Watson branching processes. The Marchenko-Pastur theorem in random matrix theory provides an example of natural coupling with a limiting object which is deteterministic, and the convergence is complete via concentration of measure provided that the ingredients have enough finite moments.
Note. The idea of writing this tiny post came from a discussion with my friend Adrien Hardy.
Georges de la Tour – Les joueurs de dés, vers 1640.
Recently, during a coffee break, emerged a discussion about the presence of probability and statistics in top journals such as Annals of mathematics, Acta Mathematica, Inventiones Mathematicae, or Journal of the AMS. Well, the question has an interest from the point of view of the sociology and history of science. Let us use the Primary and Secondary Mathematical Subject Classification (MSC) codes of each article in order to detect Probability (60x) or Statistics (62x). Here is the data from MathSciNet/zbMath:
Annals of Mathematics published 4464 papers in total from 1938 to 2019. Among them, 76 (1.7%) have Primary MSC 60x [PDF] Among them, 112 (2.5%) have Primary or Secondary MSC 60x [PDF] Moreover only 2 have Primary or Secondary MSC 62x [PDF]
Acta Mathematica published 1297 papers in total from 1938 to 2017. Among them, 44 (3.4%) have Primary MSC 60x [PDF] Among them, 63 (4.9%) have Primary or Secondary MSC 60x [PDF] Moreover only 4 have Primary or Secondary MSC 62x [PDF]
Inventiones Mathematicae published 4311 papers in total from 1966 to 2019. Among them, 52 (1.2%) have Primary MSC 60x [PDF] Among them, 95 (2.2%) have Primary or Secondary MSC 60x [PDF] Moreover only 2 have Primary or Secondary MSC 62x [PDF]
Journal of the AMS published 963 papers in total from 1988 to 2019. Among them, 28 (2.9%) have Primary MSC 60x [PDF] Among them, 49 (5.1%) have Primary or Secondary MSC 60x [PDF] Moreover only 5 have Primary or Secondary MSC 62x [PDF]
The presence of probability is low, while the one of statistics is microscopic. A scandal.
AO(P|S). Annals of Probability (AOP) and Annals of Statistics (AOS) were founded only in 1973.
1938. Annals of Mathematics is historically American whereas Acta Mathematica is European. They started respectively in 1892 and 1882. According to MathSciNet, it seems that the first article classified 60x in these journals was published in 1938. The MSC by itself was introduced at the end of the thirties and many articles in MathSciNet are not classified before 1940 at the time of writing. Note that N. Wiener published in the twenties while A. N. Kolmogorov published in the thirties.
Why. The phenomenon has probably multiple explanations, among them we could mention for instance the possible effects of utilitarism and anti-utilitarism in the mathematical elite, in particular during the fifties and sixties, and the possible overweight of some kind of “snobish pure mathematics or mathematicians” in top journals boards. We could also see AOP and AOS as some sort of mathematical ghettos and think about self-censorship. We could moreover think about generational effects. Finally we have to keep in mind that some probability papers were published without any primary or secondary 60x code, such as for instance this one or that one.
Here is some additional data provided by MathSciNet for Annals of Mathematics:
Graphics for Annals of mathematics.Graphics for Acta Mathematica.Graphics for Inventiones Mathematicae.Graphics for Journal of the AMS.
JMPA. We could think that a journal such as Journal de mathématiques pures et appliquées, founded in 1872, is in the same time relatively prestigious, generalist, and more open to applied mathematics in general and to probability and statistics in particular. Here is the data for all MSC codes, taken from MathSciNet. We see an obvious overweight for partial differential equations. In the mean time, the situation of probability is better than before, while the presence of statistics is still microscopic.
CPAM. Finally, here is the same data for Communication on Pure and Applied Mathematics. This journal, established in 1948, is truly open to applied mathematics in general and to probability theory in particular. However, the presence of statistics is still extremely low.
MSC
Description
Count
35
Partial differential equations
898
76
Fluid mechanics
234
58
Global analysis, analysis on manifolds
182
60
Probability theory and stochastic processes
177
53
Differential geometry
97
65
Numerical analysis
92
82
Statistical mechanics, structure of matter
92
34
Ordinary differential equations
85
47
Operator theory
65
49
Calculus of variations and optimal control; optimization
64
37
Dynamical systems and ergodic theory
58
78
Optics, electromagnetic theory
58
20
Group theory and generalizations
49
46
Functional analysis
48
81
Quantum theory
43
10
Number theory
39
73
Mechanics of solids
37
30
Functions of a complex variable
29
36
Other
25
32
Several complex variables and analytic spaces
24
57
Manifolds and cell complexes
24
11
Number theory
23
74
Mechanics of deformable solids
23
94
Information and communication, circuits
22
42
Harmonic analysis on Euclidean spaces
20
03
Mathematical logic and foundations
15
31
Potential theory
15
45
Integral equations
15
62
Statistics
15
14
Algebraic geometry
14
55
Algebraic topology
14
70
Mechanics of particles and systems
14
83
Relativity and gravitational theory
14
92
Biology and other natural sciences
14
01
History and biography
13
15
Linear and multilinear algebra; matrix theory
13
52
Convex and discrete geometry
13
44
Integral transforms, operational calculus
12
22
Topological groups, Lie groups
11
26
Real functions
9
80
Classical thermodynamics, heat transfer
9
85
Astronomy and astrophysics
8
86
Geophysics
8
05
Combinatorics
7
43
Abstract harmonic analysis
7
12
Field theory and polynomials
6
28
Measure and integration
6
90
Operations research, mathematical programming
6
00
General
5
39
Difference and functional equations
5
68
Computer science
5
93
Systems theory; control
5
41
Approximations and expansions
4
91
Game theory, economics, social and behavioral sciences
If you say “Let $X$ be a random variable, bla bla bla, and let $Y$ be another random variable independent of $X$…“, then you might be in trouble because $X$ is defined on some uncontrolled and implicit probability space $(\Omega,\mathcal{A},\mathbb{P})$ and this space is not necessarily large enough to allow the definition of $Y$. The definition of $Y$ may require the enlargement of the initial probability space. This implicitly and sneakily breaks the flow of the mathematical reasoning. Of course this is not a problem in general, and we are often interested in (joint) distributions rather that in probability spaces. But this may produce serious bugs sometimes. The funny thing is that this is done silently everywhere and many are not aware of the danger.
Regarding probability spaces glitches, another common subtlety is the misuse of the Skorokhod representation theorem. This nice theorem states that if $(X_n)$ is a sequence of random variables taking values on say a metric space and such that $X_n\to X$ in law, then there exists a probability space $\Omega^*$ carrying $(X^*_n)$ and $X^*$, such that $X^*_n$ has the law of $X_n$ for all $n$ and $X^*$ has the law of $X$, and $X^*_n\to X^*$ almost surely. This theorem is dangerous because it does not control the law of the sequence $(X^*_n)$ itself, in other words the correlations between the $X^*_n$. Its proof plays with these correlations in order to produce almost sure convergence! In particular $(X_1,\ldots,X_n)$ and $(X^*_1,\ldots,X^*_n)$ do not have the same law in general when $n>1$. Moreover even if the initial $X_n$ are independent, the $X^*_n$ are not independent in general. It is customary to say that if you prove something with the Skorokhod representation theorem, then it is likely that either your statement is wrong or you can find another proof.
Note. The idea behind the proof of the Skorokhod representation theorem is that the proximity of distributions implies the existence of a coupling close to the diagonal. For instance it can be easily checked that if $\mu$ and $\nu$ are probability measures on say $\mathbb{Z}$ then $$\mathrm{d}_{\mathrm{TV}}(\mu,\nu)=\inf_{(X,Y)}\mathbb{P}(X\neq Y)$$ where the inf runs over all couples of random variables $(X,Y)$ with $X\sim\mu$ and $Y\sim\nu$.
Note. The idea of writing this micro-post came from a discussion with a PhD student.