Efficient estimation of the cardinality of large data sets

DMTCS Proceedings, Fourth Colloquium on Mathematics and Computer Science Algorithms, Trees, Combinatorics and Probabilities

Efficient estimation of the cardinality of large data sets

Philippe Chassaing, Lucas Gerin

Abstract

Giroire [Gi] has recently proposed an algorithm which returns the approximate number of distincts elements in a large sequence of words, under strong constraints coming from the analysis of large data bases. His estimation is based on statistical properties of uniform random variables in [0,1]. In this note we propose an optimal estimation, using Kullback information and estimation theory.

Full Text: PDF