Zipf Law Revisited: A Model of Emergence and Manifestation

Lev B. Levitin
Boston University

Full text: PDF

Abstract
Zipf’s law is a famous empirical law that is observed in the behavior of many complex systems of surprisingly different nature. Zipf found a remarkable rank-frequency relationship in linguistics.
If we consider a long text and assign ranks r to all words that occur in the text in the order of decreasing frequencies, then the frequency f of the word satisfies the empirical law:
f=ar**(-b) ,where a and b are constants and b ≈ 1. Zipf’s law has been discovered independently in such diverse situations as distribution of populations of biological species, distribution of income, distribution of city population, distribution of families of paralog genes, distribution of numbers of postings at websites, distribution of numbers of scientific publications,distribution of numbers of requests on the Internet, etc..

Most theoretical explanations of Zipf’s law are based on the principle of the “least effort”, “minimum cost”, “minimum energy”, or on other very specific assumptions which, in our opinion, call for further explanations.

This paper presents a model of the development of an evolutionary system in a form of a non-stationary branching Markov random process. We formulate the model in the language of evolutionary dynamics. Here a “species” means a class of objects (“individuals”) recognized as corresponding to a certain concept.

This analysis provides the expected values of species populations in order of their first appearance. It should be born in mind, however, that the empirically observed data correspond to a single realization of a set of ranked random variables (“order statistics”) rather than chronologically ordered ones. The effect of this ranking has been analyzed as well, and it has been shown that the distribution of the ranked random variables is much narrower than this of original (chronological) variables. Therefore, the observed realizations yield a good representation of the entire statistical ensemble. In the first approximation, the expected values of the ranked populations follow the same Zipf law. But, in contrast with the expected values of unranked variables, they demonstrate a characteristic “staircase” behavior: the curve displays alternating flat and steep intervals. Thus, our model suggests the emergence of a second-tier structure of “superclasses” – groups of classes with almost equal populations. These deviations from the "ideal" Zipf law are specific for our model and provide an opportunity to verify (or to falsify) empirically the proposed model. Certain empirical data (related to the statistics of requests on the Internet) apparently support the theory.