Word frequencies and Zipf’s law

There is a well-known relationship between the frequency of words in a text and the ranking of those frequencies. The relationship is known as Zipf’s law and is one example of a relationship called a power law. Power laws crop up in many other settings. For example, they arise in investigating the distribution of populations between countries or between cities.

Power laws contain a variable that is raised to an exponent (power). This post summarises a paper that explains why commonly used methods over-estimate that exponent.

Zipf’s law

If we rank each word in a book by how many times that word appears, the number of times each word appears tends to be roughly inversely proportional to that word’s rank. Thus, the most frequent word appears about twice as often as the 2nd most frequent word, about 3 times as often as the 3rd most frequent around and so on.

The common name for this relationship is Zipf’s law. This is a power law relationship between (a) a word’s frequency and (b) the word’s rank by frequency. Equation (1) gives the general form of power laws. The equation says that, for a word of empirical frequency re, the number of times that word appears (n(re)) is proportional to that frequency re, raised to the negative power of an exponent (γ).  

(1) n(re) ∝ e

For the example given above, the exponent (γ) is 1. If the exponent were 2, then the most frequent word would occur about 4 times as often as the 2nd most frequent word and about 9 times as often as the 3rd most frequent word.

Estimating the exponent

Researchers wanting to apply Zipf’s law to texts written in natural languages try to estimate the exponent from the frequencies actually observed in those texts. In their paper Bias in Zipf’s law estimators (Scientific Reports 11, 17309 (2021, https://www.nature.com/articles/s41598-021-96214-w) Charlie Pilgrim and Thomas T Hills argue that such estimates typically over-estimate the size of the exponent. The over-estimate is particularly strong if the estimate is less than about 1.5.

The observed empirical frequency rankings of data are generated by an underlying probability distribution. Pilgrim and Hills argue that the problem arises because the underlying probabilities—and the related ranks—are unknown. The authors contend that the equation to be estimated is not (1) but (2). Equation (2) is about the probability of a word occurring in a text. It is not about the actual empirically observed frequency of words in a particular text.

(2) p(rp) ∝ pλ

In equation (2), rp is a word’s rank in the underlying probability distribution, p(rp) is the probability of that word occurring and λ is the exponent of the power law governing the underlying probability distribution. The probabilities (p(rp)), ranks (rp), and exponent (λ) are all unknowable and unobservable.

Source of bias

The root of the bias is that the existing estimators assume that the nth most frequent word (in the distribution actually found) is also the nth most likely word (in the underlying probability distribution). This assumption is not necessarily justified.

In some cases, there is a natural way to rank events, for example ranking earthquakes by size and frequency. But for other rank-frequency distributions, the rank is estimated empirically from the observed frequency distribution. As a result, the observed frequency is correlated with the observed rank and both depend on the same underlying mechanism. The authors say that the same data can look very different depending on whether we know its true rank, as shown in their figure 2 (not reproduced here).

The problem arises from low sampling in the tail. The observed empirical ranks tend to ‘bunch up’ above the line of the true probability distribution. They then decay sharply at the end of the observed tail because many events in the tail are not observed at all.

Not enough data for the tail

I think the authors are saying that, for example, each of the least probable items will (by chance) be observed either once or not at all. Among those events:

  • the ones observed have an observed frequency (and observed ranking) higher than their intrinsic probability (and intrinsic ranking).
  • the ones not observed at all have an observed frequency (and observed ranking) lower than their intrinsic probability (and intrinsic ranking).

Compounding the effect of that problem, each of those observed low probability items will have the same observed frequency of 1 (and the same observed ranking). And each of those unobserved low probability items will have the same observed frequency of 0 (and the same observed ranking). Those rankings and probabilities will not reflect differences in intrinsic probability (and in intrinsic probability ranking) within those 2 classes (of observed and unobserved).

Solving the problem?

Commonly used estimates of Zipf’s law exponents are based on empirically observed rank-frequency data. The authors show that those estimates are biased because they use an inappropriate likelihood function. This bias is particularly strong for frequency distributions of words in natural language. Exponents for natural language typically turn out close to 1.

The authors present one approach for reducing this bias. The approach considers all the possible rankings in the underlying probability distribution that could have generated the observed empirical frequencies. Thus, this method does not make the false assumption that ranking by empirical frequency is the same as by probability. The authors say this approach works well in an idealised situation where the true model is known and generates a power law distribution.

The authors say this approach works less well for natural language (eg word counts in books). That is because generating natural language is more complex than drawing words from a simple power law probability distribution.

Pronouncing Zipf’s name

The Wikipedia article says Zipf’s name is pronounced /zɪf/ in English and [ts͡ɪpf] in German). I would have expected the pronunciation /zɪpf/ in English, even though the combination /pf/ does not occur in native English words. But maybe I’m just being biased by the German pronunciation.

Leave a comment

Your email address will not be published. Required fields are marked *