# Month: November 2014

I have used recently the tool sloccount on the source code of the latest stable version of Open Journal Systems (OJS 2.4.5). The result is given below. It suggests that the development from scratch of such a software is a non trivial expensive task: 18M USD.

SLOC    Directory    SLOC-by-Language (Sorted)
193863  lib             php=115818,xml=77854,sh=160,perl=31
120061  plugins         xml=61444,php=58126,sh=453,perl=38
86143   locale          xml=86143
31919   classes         php=31919
21965   help            xml=21965
21619   rt              xml=21619
14607   pages           php=14607
5198    dbscripts       xml=5198
472     tools           php=472
419     registry        xml=419
373     controllers     php=373
1       top_dir         php=1
0       cache           (none)
0       docs            (none)
0       js              (none)
0       public          (none)
0       styles          (none)
0       templates       (none)

Totals grouped by language (dominant language first):
xml:         274642 (55.30%)
php:         221316 (44.56%)
sh:             613 (0.12%)
perl:            69 (0.01%)

Total Physical Source Lines of Code (SLOC)                = 496,640
Development Effort Estimate, Person-Years (Person-Months) = 135.48 (1,625.75)
(Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months)                         = 3.46 (41.51)
(Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule)  = 39.17
Total Estimated Cost to Develop                           = $18,301,446 (average salary =$56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to
redistribute it under certain conditions as specified by the GNU GPL license;
see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."

The source code of common operating systems kernels (Microsoft Windows, Apple Mac OS, Linux, etc) has dozens of millions of lines of code, mostly in C and in Assembler while for OJS it is approximately half a million of lines of code in PHP and in XML.

Certain probabilists and statisticians like to repeat that the Metropolis-Hastings algorithm is one of the most important algorithms on earth, and drop the name of Diaconis to reinforce the statement. Well, beyond earth, this might be true in the whole universe, who knows?

Let us recall that the Metropolis-Hastings algorithm is a remarkable recipe to produce a Markov kernel which can be simulated and which admits a prescribed probability distribution, known up to a multiplicative constant, as its unique reversible invariant distribution. The idea behind the Metropolis-Hastings algorithm is very simple, and can be traced back to an article published in 1953 entitled Equation of State Calculations by Fast Computing Machines, written by Nicholas Metropolis, Arianna & Marshall Rosenbluth, and Augusta & Edward Teller, and to an article published in 1970 entitled Monte Carlo Sampling Methods Using Markov Chains and Their Applications, written by W. Keith Hastings. It mixes the idea of rejection sampling of Georges-Louis Leclerc Comte de Buffon and John von Neumann with the Markov chain concept. By the way, Edward Teller is among other things credited as being the mother of the US H-bomb together with the father Stanislaw Ulam (who was by the way fond of the Monte-Carlo method). Nowadays, the Metropolis-Hastings algorithm is available in many versions and is at the heart of the so-called Monte-Carlo-Markov-Chain (MCMC) numerical algorithms.

Yes the Metropolis-Hastings algorithm is a remarkable widely used non exact algorithm for stochastic simulation. However, the most important algorithms used in the world are probably more related to information/signal processing,  number crunching, linear algebra, and optimization, such as the Huffman coding algorithm, the Lempel-Ziv-Welch coding algorithm, the Hamming coding algorithm, the Reed-Muller coding algorithm, the Gallager coding algorithm, the Gram-Schmidt algorithm, the QR algorithm, the LU algorithms,  the Gauss elimination algorithm, the Euclidean algorithm, the Newton algorithm, the Simplex algorithm, the Hungarian algorithm, the Fast Fourier Transform algorithm, etc.

Recently, a friend of mine, Arnaud Guyader, discovered by accident during a teaching session that the geometric distribution generator of GNU R, implemented in the rgeom function, is sub-optimal. The command help rgeom says rgeom uses the derivation as an exponential mixture of Poissons, see Devroye, L. (1986) Non-Uniform Random Variate Generation. Springer-Verlag, New York. Page 480. Moreover, page 480 of this book says nothing about the geometric distribution.

/* Source code of rgeom.c in latest version of GNU R (2014-11-15).
*  Mathlib : A C Library of Special Functions
*  Copyright (C) 1998 Ross Ihaka and the R Core Team.
*  Copyright (C) 2000 The R Core Team
*
*  This program is free software; you can redistribute it and/or modify
*  the Free Software Foundation; either version 2 of the License, or
*  (at your option) any later version.
*
*  This program is distributed in the hope that it will be useful,
*  but WITHOUT ANY WARRANTY; without even the implied warranty of
*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
*  GNU General Public License for more details.
*
*  You should have received a copy of the GNU General Public License
*  along with this program; if not, a copy is available at
*
*  SYNOPSIS
*
*    #include <Rmath.h>
*    double rgeom(double p);
*
*  DESCRIPTION
*
*    Random variates from the geometric distribution.
*
*  NOTES
*
*    We generate lambda as exponential with scale parameter
*    p / (1 – p).  Return a Poisson deviate with mean lambda.
*
*  REFERENCE
*
*    Devroye, L. (1986).
*    Non-Uniform Random Variate Generation.
*    New York: Springer-Verlag.
*    Page 480.
*/

#include “nmath.h”

double rgeom(double p)
{
if (!R_FINITE(p) || p <= 0 || p > 1) ML_ERR_return_NAN;

return rpois(exp_rand() * ((1 – p) / p));
}

The fact that the geometric distribution is an exponential mixture of Poisson is a special case of the more general fact that the negative binomial distribution, i.e. the convolution power of the geometric distribution also known as the Pascal distribution, is a gamma mixture of Poisson. Namely if ${X\sim\mathrm{Gamma}(n,\ell)}$ with ${\ell=p/(1-p)}$, and if ${Y|X\sim\mathrm{Poisson}(X)}$ then ${Y\sim\mathrm{NegativeBinomial}(n,p)}$, since for every ${k\geq n}$, ${\mathbb{P}(Y=k)}$ writes

$\int_0^\infty\!e^{-\lambda} \frac{\lambda^k}{k!} \frac{\ell^n\lambda^{n-1}}{\Gamma(n)}e^{-\lambda\ell}\,d\lambda =\frac{\ell^n}{k!\Gamma(n)} \int_0^\infty\!\lambda^{n+k-1}e^{-\lambda(\ell+1)}\,d\lambda =\frac{\Gamma(n+k)}{k!\Gamma(n)}(1-p)^kp^n.$

For ${n=1}$ we recover the geometric distribution. This mixture representation is useful for the simulation of negative binomial distributions with high integer parameter, but is not the best choice for the simulation of the special case of the geometric distribution.

The simulation of the geometric distribution is better done by using the so-called inversion method (generalized inverse of the cdf), which boils down to take the integer part of the standard exponential distribution generator: ceil(log(rand())/log(1-p)) since if ${X\sim\mathrm{Exponential}(\lambda)}$ with ${\lambda=-\log(1-p)}$ then

$\mathbb{P}(k\leq X\leq k+1) =e^{-k\lambda}-e^{-(k+1)\lambda} =e^{-\lambda k}(1-e^{-\lambda}) =(1-p)^kp.$

Patch. Here is a possible patch to the rgeom.c of GNU R.

31,32c31
<  *    We generate lambda as exponential with scale parameter
<  *    p / (1 – p).  Return a Poisson deviate with mean lambda.

>  *    returns 1+ integer part of exponential with scale parameter -1 / log(1 – p).
39c38
<  *    Page 480.

>  *    Section X.2 page 498.
48c47
<     return rpois(exp_rand() * ((1 – p) / p));

>     return ceil(log(rand()) / log(1 – p));

Further reading. The simulation of the geometric distribution is discussed in Section X.2 page 498 of the book of Devroye: the inversion method is considered, but no words on the exponential mixture of Poisson. The simulation of the negative binomial distribution is discussed in Section X.4.7 page 543, and the Gamma mixture of Poisson is considered.

Other software. The random generator for the geometric distribution proposed presently by the GNU Scientific Library, by the GNU Octave, and by the Scilab software are based on the inversion method. The Distributions.jl package for the Julia software also uses the inversion method. The good news with open source software is that even if the help does not mention the algorithm that is used, you can always check the source code.

In a finite graph ${G=(V,E)}$, the degree of a vertex ${i\in V}$ is the number ${d(i)}$ of vertices connected directly to ${i}$ by an edge:

$d(i)=\mathrm{card}\{j\in V:\{i,j\}\in E\}.$

We say that the graph ${G}$ is ${d}$-regular where ${d\geq0}$ is a fixed integer when ${d(i)=d}$ for all ${i\in V}$. The ${0}$-regular graphs are formed by isolated vertices and do not have any edge. The ${1}$-regular graphs are formed by disconnected edges. The ${2}$-regular graphs are formed by disconnected cycles and infinite chains.

The Erdős-Gallai theorem states that for any integers ${n\geq1}$ and ${d_1\geq\cdots\geq d_n\geq0}$, there exists a graph with ${n\geq1}$ vertices with respective degrees ${d_1,\ldots,d_n}$ if and only if the following two conditions hold true:

1. the integer ${d_1+\cdots+d_n}$ is even;
2. for any ${1\leq k\leq n}$, ${d_{1}+\cdots+d_{k}\leq k(k-1)+ \min(k,d_{k+1})+\cdots+\min(k,d_{n})}$.

In this case we say that ${d_1,\ldots,d_n}$ is graphic. The fact that the sum ${d_1+\cdots+d_n}$ is even comes from the fact that each edge counts twice. The quantity ${k(k-1)+\min(d_{k+1},k)+\cdots+\min(d_{n},k)}$ is the maximum contribution to ${d_1+\cdots+d_k}$ of the edges connected to the vertices ${1}$ to ${k}$: at most ${\binom{k}{2}=k(k-1)/2}$ edges (count twice) between the vertices ${1}$ to ${k}$, and at most ${\min(k,d_{k+i})}$ edges between the vertex ${i>k}$ and the vertices ${1}$ to ${k}$. We say that ${d_1,\ldots,d_n}$ is the degree sequence of the graph.

In particular, for a ${d}$-regular graph with ${n}$ vertices, we have ${d_1=\cdots=d_n=d}$ and the Erdős-Gallai theorem states that ${d}$ is even and ${kd\leq k(k-1)+(n-k)\min(k,d)}$ for every ${1\leq k\leq n}$, which gives in particular ${d\leq n-1}$ and equality is achieved for the complete graph.

The Erdős-Gallai theorem was proved by Paul Erdős and Tibor Gallai in a paper published in 1960 in hungarian. Nowadays, many proofs are available, including one by Claude Berge. A short and constructive proof can be found in a paper by Amitabha Tripathi, Sushmita Venugopalan, and Douglas West.

Further reading. Lecture notes on random graphs by Remco van der Hofstad, and by Charles Bordenave, none of which contains a proof of the Erdős-Gallai theorem.