Structure of numbers in Indo-European

How are numerals formed in Indo-European languages today, and how were they formed in the ancestral language Proto-Indo-European (PIE)? And do ordering patterns of components within numerals align with other word order patterns in the same languages?

Andreea S. Calude and Annemarie Verkerk considered those questions in a paper looking at how 81 present and past Indo-European languages encode numerals above 10 (10–99, 100s, 1,000s). Their paper The typology and diachrony of higher numerals in Indo-European: a phylogenetic comparative study appeared in the Journal of Language Evolution in 2016.

The rest of this post looks at:

  • structure of numerals
  • numerals in Indo-European languages
  • numerals in Proto-Indo-European
  • correlations with other word-order patterns

Structure of numerals

The authors divide numerals into 2 types:

  • atoms—‘items which receive simple lexical expression’
  • syntagms—forms ‘derived from lower numerals’.

A syntagm involves three components:

  • an atoms, already used for a lower number.
  • a base, used serially to derive larger numerals. The base is typically an existing atom, but sometimes a purpose-made root, morphologically unrelated to any atom in the language. Common bases are decimal (10) and vigesimal (20, from Latin vīgintī (‘20’) and vīcēsimus (‘20th’)).
  • an arithmetic operation (typically addition or multiplication, occasionally subtraction or division). For numerals above 99, the most common operation is exponentiation (ie forming 100 and 1,000 as powers of 10.

For example:

  • English numeral twenty-eight is derived as (2×10) + 8: 8 is the atom, 10 is the base, and the operation is addition.
  • French numeral quatre-vingt neuf (89) is derived as (4×20) + 9: 9 is the atom, 20 is the base and the operation is addition).

Some languages use a mixture of bases. For example, Albanian has a decimal-vigesimal system:

  •  30–39 and 50–99 use base 10 (30 tridhjeteü, 50 peseüdhjeteüdhjeteü, 60 gjasteüdhjeteü, 70 shtateüdhjeteü, 80 teteüdhjeteü, 90 dhjeteü from 10 dhjeteü)
  • 40–49 use base 20 (40 dyzet from 20 njeüzet).

Other comments

The authors make some other comments:

  • Atoms are highly stable lexical items over time, probably because they are in frequent use. For example, all the forms for 2 in the Indo-European languages are related to the ancestral PIE reconstructed form *dwóh1, *dwó.
  • The morphological operations used to form syntagms are likewise stable. Closely related languages use similar strategies in the patterns used to form teens, crowns, and running-numbers.
  • Borrowing may play a role in some number system. For example, Romanian and its relative Vlach encode numerals in ways that differ from other Romance languages and are more similar with neighbouring Slavonic languages. Similarly, some researchers report that Baluchi has borrowed numerals from Persian and Faroese has borrowed Danish numerals.
  • It is common for the two (or more) counting systems to coexist side-by-side for a period of time, particularly if a prestige or standard language exists.

Numerals in Indo-European languages

The authors split the numerals above 10 into 4 groups, which they call:

  • teens, numerals 11–19
  • crowns, numerals 20, 30, 40, 50, 60, 70, 80, 90.
  • running-numbers, in-between numerals, namely 21–29, 31–29, 41–49, 51–59, 61–69, 71–79, 81–89, and 91–99.
  • hundreds (100, 200, 300, …) and thousands (1,000, 2,000, 3,000, …).

The authors say that this split fits the Indo-European languages well. That is because this language family mainly uses decimal and/or vigesimal bases. Other splits might be more useful for other language families.

The following subsections summarises the authors’ results, by category (teens, crowns, running-numbers, hundreds and thousands, overall comments) 

Results: teens

The teens are the most diverse, both across and within languages.

  • The transition from atom to syntagm always happens during the teens—usually after numeral 11 (with 11 being the first syntagmatic form). For a few (such as Catalan and Marwari), it is after 12. For 15 languages, it occurs after 13. No language in the sample uses atoms to code numerals 13–99. But many do use atoms again for 20, 100, and 1,000.
  • Most languages form teens by using either the order of atom-then-10 (eg English, Kurdish, Macedonian), or conversely 10-then-atom (only 6 languages: Singhalese, Wakhi, Modern Greek, Modern Armenian, and Tocharian A and B).
  • Some languages change the order during the teens (Greek from 13, Sardinian, Ladin, Italian, Portuguese all from 17). The change is always from atom-then-10 to 10-then-atom, never the other way around. There is one exception: numeral 15 in Singhalese (atom [‘5’])+ base).
  • Indo-Iranian languages use subtraction rather than addition to represent 19 as 1–20 (rather than 10+9). The other languages use only addition.
  • Sanskrit, Ladin, and Welsh have alternative patterns for marking some teens. Either addition or subtraction can be used in Sanskrit (for 19) and in Ladin (for 18 and 19). In Welsh, there are: alternative orders for 11–15; an alternative base (15) for 16, 17, and 19; and for 18 an alternative using multiplication instead of addition (2×9), as shown below.
  • The other language using multiplication is Breton (another Celtic language), which uses it as the only structure for 18 as 3×6. The data for Welsh and Breton is consistent with a suggestion by Greenberg (1978): that a language system uses multiplication only if it also uses addition.
  • The majority of the Indo-European languages use either a base derived from the word ‘ten’ (eg English, Rumantsch, Italian, Riksmal) or a purpose-made base (eg Slovak, Czech, Lusatian).
  • Five of the Romance languages in the sample (Catalan, Spanish, Provencal, French, and Walloon) change during the teens from using a purpose-made base to one derived from the word for 10. No language changes from a base derived from the atom for 10 to a new base.

The form of numerals can change over time in a way that obscures their original structure. For example, the French numerals 11-16 (onze, douze, treize, quatorze, quinze, seize) seem to contain 2 components, making them syntagms. The first component is clearly linked to the atoms for 1-6 (un, deux, trois, quatre, cinq, six). The second component (-ze) is the same in each case. It is related to the atom for 10 (modern dix), but that link is not obvious to modern speakers.

Welsh numerals 11-19

One of the most complex systems is in Welsh. In Welsh, all numerals 11-19 can be formed using a decimal system of the form ‘one ten’ (un deg) + ‘atom’. Each of these numerals also has an alternative form, which:

  • for 11–15, is in the order atom-then-10.
  • for 16, 17, 19 is in the order atom (1, 2, 4) + purpose-made base bymtheg (meaning ‘15’, derived from pymtheg by consonant mutation). For more on consonant mutation in the Celtic languages, please see Consonant mutation in Manx – Language Miscellany
  • for 18, multiplies the atoms ‘two’ and ‘nine’.
NumberDecimal formAlternative
15un deg pumppymtheg
16un deg chwechun ar bymtheg (‘1 on 15)
17un deg saithdau ar bymtheg
18un deg wythdeunaw (‘2 9s’)
19un deg nawpedwar ar bymtheg
Table 1. Welsh numerals 15-19.
Forms for 17 and 19 are masculine.
Feminine forms contain dwy (instead of dau) and pedair (instead of pedwar)

Results: crowns

  • The crowns are typically formed by multiplying an atom and a base, or by using a purpose-made form, bearing no connections to either atom or base.
  • The base (for multiplication) derives from the form for either 10 (decimal systems) or 20 (vigesimal systems).
    One modern language retaining some vestiges of a vigesimal system (based on 20) is Danish, see Scandinavian language challenge day 16 – Language Miscellany
  • Many languages encode 20 as a purpose-made atomic form.
  • Beyond 20, languages continue the pattern of using as base either (1) an existing atom or (2) a purpose-made form together with an atom. The only exceptions are Byelorussian, Russian and Ukrainian (for 40), Gujarati and Singhalese (for 80), and Sindhi, Singhalese, and Ossetic (for 90), which all switch to purpose-made base forms for these numerals. For example, Russian uses the brand-new base sorok for 40 (and in forming the related running-numbers 41-49).  
  • Many languages use different bases as they move along the range of crown words. Consider Nepali t-īs ‘30’, cāl-īs ‘40’, pacc-ās ‘50’, sāṭ-hi ‘60’, sattar-i ‘70’, ass-ī ‘80’, and na-bbe ‘90’, with bases ending in -īs/ās (for 30–50) and -i/ī (for 60–80), and another base for 90.
  • Only a handful of languages swap from using an existing atom as base to using a purpose-made base form (Lusatian, Byelorussian, Russian, and Czech).
  • Four languages provide more than one way of encoding crowns:
    (1) Digor Ossetic uses either a vigesimal base system or a purpose-made base;
    (2 and 3) Afghan and Waziri have a decimal base for all crowns alongside a vigesimal system for the even crowns (20, 40, 60, and 80); and
    (4) Welsh has both a vigesimal base system and a decimal base system throughout. For example, in Welsh: 40 can be either deugain (2X20) or pedwar deg (4×10); and 50 can be either deg a deugain (10+[2X20]) or pum deg (5×10).
  • The only (limited) exception to reliance on multiplication occurs in Breton and Welsh. Those languages form 50 with the help of division as ‘half of a hundred’ (by using fractions). For example, Welsh 50 can be hanner cant (‘half hundred’), alongside the forms listed above. This is consistent with a suggestion in Greenberg (1978) that the only way languages use division is in the form of multiplication by a fraction.

Running numbers

For the running-numbers:

  • most languages use one single and consistent pattern for forming these, either base-then-atom or atom-then-base.
  • but Welsh, Old High German, Latin, Riksmal Norwegian, Faroese, Czech, Digor Ossetic, and Ancient Greek permit both orders.
  • The order of atom and base morphemes becomes constant within each Indo-European language by the time 20 is reached, except that Vlach (a close relative of Romanian) switches at 31.
  • Western Germanic, Indo-Aryan and most of Celtic have atom-then-base order. Conversely, Romance, Northern Germanic, Balto-Slavonic (except Lusatian and, Slovenian), Iranian (except Waziri and Afghan) and all other languages in the sample (Greek, Albanian, Armenian, and Tocharian) have base-then-atom order.
  • Languages which encode 19 by subtraction tend to code all other X9 numerals (29, 39, 49, 59, 69, 79, 89, 99) by subtraction as either the only form or an alternative.

Hundreds and thousands

All the languages in the sample form hundreds and thousands in the same manner (atom×100, atom×1,000). The authors say that this cross-linguistic consistency arose because it limits burdens on working memory. Further findings:

  • all the languages in the sample have atoms for 100 and 1,000—except that Gothic has only 10×base (the base not being 10).
  • the bases 100 and 1,000 are purpose-made forms, morphologically unrelated to any previous numerals—except that:
    (1) Old High German and Old English used a decimal base system (10 × 10); and
    (2) Digor Ossetic and Wakhi use a vigesimal base system (5 × 20), as an alternative to the purpose-made word for 100.
  • in deriving further hundreds, Marathi, Ancient Greek, and Modern Greek use a separate base, not the numeral 100.
  • Sanskrit, Cornish, Sardinian, Icelandic, and Oriya use two variant forms for 1,000: (1) a purpose-made word; or (2) a form that multiplies the form for 100 by a purpose-made base form. For more on Oriya (now known as Odia), please see Odia, a classical language in India – Language Miscellany

Overall comments on sequence of base and atom

I’ve summarised above separately the author’s conclusions on each type of numeral (teens, crowns, running-numbers, hundreds, thousands. They also conclude on the order of bases and atoms for all categories together:

  • A few languages in the sample (Wakhi, Armenian, Tocharian) use base-then-atom order throughout.
  • Some languages (Old Irish, Scots Gaelic, Lusatian, and others) use atom-then-base order throughout.
  • Some languages (Catalan, Ladin, Italian, Modern Greek, among others) start out with atom-then-base order for lower numbers before changing to base-then-atom order at some point for higher numbers. This changeover always happens during the teens, either at 19, 16, or 17 (and only Modern Greek changes at 13).
  • No languages change from base-then-atom order to atom-then-base order.

Numerals in Proto-Indo-European

The hypothesized common ancestor of the Indo-European languages is Proto-Indo-European (PIE). The authors used Bayesian probability methods to estimate the structure of numerals in PIE. They also estimated how the structures have changed over time.

The following comments summarises the authors’ conclusion on the most likely situation in PIE:

  • For teens, the pattern in PIE was probably atom+base (10), in that order. Later changes towards base-then-atom order placed the larger base component of the numeral first in Germanic (for 12), Romance (for 16, 17, 18, and 19), modern Greek, Armenian, and Tocharian. Later changes in Romance, Germanic, Balto-Slavic, and Indo-Iranian also replaced the base with a purpose-made base or eroded the base phonologically, making it no longer recognisable as 10.
  • Numeral 20 was probably an atom in PIE.
    Germanic and Balto-Slavic forms 2×10 and 2×base (not 10) are probably later simplifications.
  • For numeral 30, the estimate of the most probable ancestral PIE state is unstable. The structure could be atom×base (not 10). On the other hand, it could be atom×10. Albanian uses that form. It is also the most probable form for the ancestral node connecting Romance, Celtic, Germanic, and Balto-Slavic.
  • For the remaining crown words, 30–90, the PIE structure was probably multiplicative: atom×base (with the base not related to the atom for 10).
  • There are several changes simplifying away from the atom×base (not 10) structure towards existing base forms: atom×10 in Germanic, Balto-Slavic, and Romanian and Vlach; and towards a vigesimal base system in 5 languages (all in the Celtic or Iranian groups).
  • PIE probably switched from atom-then-10 coding to base-then-atom coding somewhere between the teens and running-numbers, but it is impossible to be sure that PIE didn’t also use atom-then-base coding for running-numbers. Ancient Greek, Latin, and some other contemporary languages use both structures.
  • The subtraction (1–higher base) structure (eg 19 = 1-20) was a later development in Latin, Indo-Aryan, and some Iranian languages.

Correlations with other word-order patterns

The authors conclude that running-numbers with base-then-atom order are more likely in languages that also use the following orders:

  • noun-then-postposition (rather than preposition-then-noun);
  • noun-then-genitive (rather than genitive-then-noun); and
  • verb-then-object (rather than object-then-verb).

Previous work has shown that these 3 parameters for word order are themselves intercorrelated:

  • languages with preposition-then-noun order tend to also have genitive-then-noun and object-then-verb order. This type occurs most often in Romance, Celtic, Germanic, and Balto-Slavonic languages.
  • conversely, languages with noun-postposition order tend to have noun-then-genitive and verb-then-object order. This latter type occurs in Indo-Iranian, Armenian, Greek, and Tocharian.

Consistent ordering of heads and dependents

The authors suggest the base should be viewed as the structural head of a numeral and that the atom should be viewed as a dependent of the head. The order of the base and the atom in the languages studied reflects a general preference in those languages for the order of heads and their dependents.

The authors also found:

  • no correlation between (1) the order of atom and base components in running-numbers and (2) word order in 5 other types of combination: adjective-noun; demonstrative-noun; numeral-noun; relative clause-noun, and subject-verb
  • no correlation between (1) the order of base and atom in teens and (2) word order in any of the 8 types of combination mentioned above.

Larger component first

The authors refer to another factor that create a preference for a base-then-atom order. That order puts the larger component of the numeral first, giving a close approximation to the entire numeral. The opposite order leaves the listener in the dark until the listener encounters the base.

The authors illustrate this point with a quotation from Joseph Greenberg (1978): ‘if I express a large number, say 10,253 in the order 10,000; 200; 50; 3; the very first element gives me a reasonably close approximation to the final result, and every successive item gives a further approximation’, whereas ‘the opposite order leaves the hearer in the dark till the last item is reached’.

The factor noted by Greenberg becomes more powerful as numerals become larger. Those facts may explain why no language in the sample switches from base-then-atom order for smaller numerals over to atom-then-base order for larger numerals.


The typology and diachrony of higher numerals in Indo-European: a phylogenetic comparative study, Andrea S. Calude and Annemarie Verkerk, Journal of Language Evolution (Volume 1, Issue 2, 2016).

Generalisation about Numeral Systems by Joseph Greenberg, in Universals of Human Language. Volume 3: Word Structure, edited by Greenberg J. Ferguson C. Moravcsik E. (eds., 1978). Quoted by Calude and Verkerk (2016)

One comment

Leave a comment

Your email address will not be published. Required fields are marked *