A Few Interesting Connections BetweenStatistics and Computer Science

Palash Sarkar

Applied Statistics UnitIndian Statistical Institute, Kolkata

[email protected]

Indian Statistical Institute,North-East Centre, Tezpur

24th July 2011

“Statistics is the universal tool of inductive inference,research in natural and social sciences, and technologicalapplications.

Statistics, therefore, must always have purpose, either inthe pursuit of knowledge or in the promotion of humanwelfare.”

– Prasanta Chandra Mahalanobis(2nd December, 1956)

Computer Science: Two View Points

“It is unworthy of excellent men to lose hours like slaves inthe labour of calculation which could be relegated to anyoneelse if machines were used.”

– Gottfried von Leibniz

Computer Science: Two View Points

“It is unworthy of excellent men to lose hours like slaves inthe labour of calculation which could be relegated to anyoneelse if machines were used.”

– Gottfried von Leibniz

“A huge gap exists between what we know is possible withtoday’s machines and what we have so far been able tofinish.”

– Donald Knuth

Tennyson wrote:“Every moment dies a man, Every moment one is born.”

Tennyson wrote:“Every moment dies a man, Every moment one is born.”Babbage wrote back:“Every moment dies a man, Every moment 1 1/16 is born.”

Tennyson wrote:“Every moment dies a man, Every moment one is born.”Babbage wrote back:“Every moment dies a man, Every moment 1 1/16 is born.”With the comment: “1 1/16 will be sufficiently accurate for poetry.”

Statistics and Computer Science: Some Connections

Jute survey and the travelling salesman problem (TSP).

Order statistics and clustering.

Design of experiments, coding theory and computer science.

A sampling problem.

Jute Survey and the Travelling Salesman Problem.

Variance and total cost are given by

V =k∑


Aivi/yi T =k∑


Ai ti .

V =k∑


Aivi/yi T =k∑


Ai ti .

The problem that Mahalnobis considers is to fix the cost to a certainvalue and then choose (xi , yi) such that V is minimised.

Order statistics and clustering.

Palash Sarkar (ISI, Kolkata) Stat-CS Connections Tezpur, 2011 13 / 31

Palash Sarkar (ISI, Kolkata) Stat-CS Connections Tezpur, 2011 15 / 31

M. Blum, R.W. Floyd, V. Pratt, R. Rivest and R. Tarjan, “Time boundsfor selection,” J. Comput. System Sci. 7 (1973) 448–461.

Problem: given a list of data items, partition it into “similar” groups.

A natural question when there is a list of data items.

Applications to data mining, information retrieval, imageprocessing and web search.

Palash Sarkar (ISI, Kolkata) Stat-CS Connections Tezpur, 2011 16 / 31


Problem: given a list of data items, partition it into “similar” groups.

A natural question when there is a list of data items.

Applications to data mining, information retrieval, imageprocessing and web search.

k -means clustering: a widely used definition.Given a set of points P, the k-means clustering problems seeks to finda set K of k centers, such that


d(p,K )2

is minimised.

The centroid of a set of points can be very well approximated bysampling a constant number of points and finding the centroid ofthe sample (Inaba-Katoh-Imai, 1994).

Sampling O(k) points and considering all constant size subsets ofthe sample can give the centres of the largest clusters.

A more careful strategy is required to find the centres of thesmaller clusters.

Design of experiments, coding theory andcomputer science.

Palash Sarkar (ISI, Kolkata) Stat-CS Connections Tezpur, 2011 21 / 31

Palash Sarkar (ISI, Kolkata) Stat-CS Connections Tezpur, 2011 22 / 31

Palash Sarkar (ISI, Kolkata) Stat-CS Connections Tezpur, 2011 23 / 31

A sampling problem

Internet Traffic Analysis: a Streaming Scenario

Consider an internet router.

A huge number of internet packets are flowing through the router.Only a sample of the packets can be stored.

Reservoir sampling, Knuth 1969.

Each packet has a weight which is the number of bytes in thepacket.Analysis based on the sample.

Estimate weights of arbitrary subsets of the packets that have goneby.The subsets whose weights are to be analysed are not known atthe time of sampling.

Chain Store Data: a Non-Streaming Scenario

All sales data are stored.Later analysis.

A sample is checked against weather records to estimate thenumber of days of rain before a boom in rain-gear sales.The questions for the analysis may not be known at the time ofsampling.

Can be related to reservoir sampling.A small reservoir of samples can be easily shared over the internetby analysts at different locations.

Priority Sampling

N. Duffield, C. Lund and M. Thorup. Priority sampling for estimation ofarbitrary subset sums. Journal of the ACM, 2007.

Palash Sarkar (ISI, Kolkata) Stat-CS Connections Tezpur, 2011 28 / 31

Thank you for your attention!

