DMTCS Proceedings, Fourth Colloquium on Mathematics and Computer Science Algorithms, Trees, Combinatorics and Probabilities

Font Size:  Small  Medium  Large

Efficient estimation of the cardinality of large data sets

Philippe Chassaing, Lucas Gerin

Abstract


Giroire [Gi] has recently proposed an algorithm which returns the approximate number of distincts elements in a large sequence of words, under strong constraints coming from the analysis of large data bases. His estimation is based on statistical properties of uniform random variables in [0,1]. In this note we propose an optimal estimation, using Kullback information and estimation theory.

Full Text: PDF

Valid XHTML 1.0 Transitional