### q-gram analysis and urn models

*Pierre Nicodème*

#### Abstract

Words of fixed size q are commonly referred to as q-grams. We consider the problem of q-gram filtration, a method commonly used to speed up sequence comparison. We are interested in the statistics of the number of q-grams common to two random texts (where multiplicities are not counted) in the non uniform Bernoulli model. In the exact and dependent model, when omitting border effects, a q-gram in a random sequence depends on the q-1 preceding q-grams. In an approximate and independent model, we draw randomly a q-gram at each position, independently of the others positions. Using ball and urn models, we analyze the independent model. Numerical simulations show that this model is an excellent first order approximation to the dependent model. We provide an algorithm to compute the moments.

Full Text: GZIP Compressed PostScript PostScript PDF original HTML abstract page