  • when did Indo-European languages separate from the rest of that language family, and in which order?
  • where did speakers of Proto Indo-European—the ancestral language of the Indo-European language family—live?

The paper Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages by Paul Heggarty, Corman Anderson, Matthew Scarborough (all of the Max Planck Institute for the Evolutionary Anthropology, in Leipzig, Germany) with 30 co-authors was published in Science in July 2023 at Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages | Science.

The paper is also available as a free download from IE-CoR – Home ( with supplementary materials available at Guide_to_Data_and_Results_Files.docx (

In future posts, I will look at the cognate database that this research team created and at the Bayesian approach they used.

Indo-European languages

Most European languages are members of the Indo-European language family, as are many of the languages spoken in India and Iran. Among European languages, Hungarian, Finnish, Estonian and Saami are members of another family—Finno-Ugric—and Basque has no known relationship to any other language.

The name Proto Indo-European (PIE) is used for the ancestral language of the family, which was of course spoken before the invention of writing. Although there are no written records of PIE, historical linguists have made plausible reconstructions of many of its features by analysing the descendant languages.  

12 branches of Indo-European are well established: Anatolian, Tocharian, Indic, Iranian, Baltic, Slavic, Germanic, Italic, Celtic, Armenian, Greek, Albanian. Of these, Anatolian and Tocharian are extinct.

Some scholars believe that there is also evidence for higher-level groupings of some branches, for example, some or all of Indic-Iranian; Baltic-Slavic, Germanic-Celtic; Greco-Armenian.

Family tree

Heggarty and his co-authors created a database of 109 modern and 52 historical Indo-European languages. They then used Bayesian phylogenetic inference to create a family tree showing the order in which each language separated from the rest of the Indo-European family. I will discuss in a separate post the database they created and the Bayesian method they used.

For brevity, I will summarise in this post only results for the 12 main branches. Their paper presents results for all 161 languages tested, and not just for the branches containing those languages.

Figures 1 and 2 summarise the tree they produce for the branches. Figure 1 deals with 7 of the branches, as well as a higher-level grouping covering the Baltic, Slavic, Italic, Germanic and Celtic branches. For space reasons, I have labelled that grouping as BSIGC and show its history separately as Figure 2.

Figure 1. Branches of Indo-European, summarised from Heggarty et al
BSIGC = Balto-Slavic-Italo-German-Celtic (see Figure 2)

Figure 2. Branches of Balto-Slavic-Italo-German-Celtic (BSIGC)
summarised from Heggarty et al

Comments on Figures 1 and 2

Consistently with much prior research, Heggarty and co-authors:

  • identify each of the 12 established branches of Indo-European.
  • show Anatolian as the 1st branch to separate from the rest of the family and Tocharian as the 2nd.
  • identify higher-level groupings that are widely (though not unanimously) held to be valid: Indo-Iranian; Baltic-Slavic.

Heggarty and co-authors generate some groupings that are disputed or that are not found by some other research:

  • German with Celtic
  • Germano-Celtic with Italic
  • Italo-Germano-Celtic with Balto-Slavic
  • Greek with Armenian

Probabilities, not a single estimate

The authors emphasise that their analysis produces a probability distribution of trees, not a single ‘Maximum Clade Credibility’ (MCC) tree. The trees I have shown in figures 1 and 2 are the MCC tree. For some groupings, the analysis produces probabilities below 50%:

  • for Anatolian (26%) and Tocharian (also 26%). The analysis found a marginally lower probability (25.9%) that there is a higher-level grouping covering both Anatolian and Tocharian.
  • for Indo-Iranic (24%)
  • for a higher-level grouping covering [Balto-Slavic]+[Italic+Germanic+Celtic] (24%) and then later for a split of that group into Balto-Slavic and Italic+Germanic+Celtic (63%)
  • for a grouping covering Greek and Armenian and excluding Albanian (49%).

There are high probabilities (but below 100%) for the split of Germanic+Celtic into Germanic and Celtic (87%) and for the split of Greco-Armenian into Greek and Armenian (86%). Probabilities for the other established branches (Indic, Iranic, Italic, Baltic, Slavic) are all 100%.

Comments on the probabilities

The analysis found no other tree to be more probable than the tree in Figures 1 and 2. If a low probability is reported above for a branch or grouping, there is, nevertheless, not decisive evidence the tree did in fact split at the point shown in Figures 1 and 2, rather than at a different point. Although some of the low probabilities summarised above are well below 100%, that is not surprising because they are all for splits about which no consensus exists.

The low probabilities for the splits of Anatolian, Tocharian and Indo-Iranic presumably mirror the fact that these groups all split off around the same time. Thus, it is uncertain which ones split first, and whether any of them first split off together.

Similarly, I assume that:

  • the low probability for an initial higher-level grouping covering both [Balto-Slavic] and +[Italic+Germanic+Celtic] reflects uncertainty about whether that grouping ever existed.
  • the lowish probability for a subsequent split into Balto-Slavic and Italic+Germanic+Celtic shows uncertainty about whether Italic ever constituted a single group together with only Germanic and Celtic.

Dating the splits

The paper suggests dates for when each branch split off from the rest of the family:

  • 4 branches split off from the rest of the Indo-European family around 7,000 years ago: Anatolian, Tocharian, Indo-Iranic and Baltic-Slavic-Italic-Germanic-Celtic.
  • Baltic-Slavic-Italic-Germanic-Celtic split into Baltic-Slavic and Italic-Germanic-Celtic around 6,500 years ago. Then:
    – Baltic and Slavic separated from each other around 3,500 years ago.
    – Italic separated from Germanic and Celtic about 5,500 years ago and those latter 2 groups split from each other some 500 years after that.
  • Indic and Iranic separated from each other around 5,500 years ago.    
  • The remaining branches (Albanian, Armenian and Greek) split from each other about 6,000 years ago (with Greek and Armenian perhaps remaining together until around 5,500 years ago).

The results above are the median results, taken from Table 1 of the paper (but I have rounded them to the nearest 500 years). The analysis shows most of the true dates could be up to 1,500 years earlier or later than the estimated dates.

When was the first split?

Although the split date for the first descendant (Anatolian) is around 7,000 years ago, the paper reports that Proto Indo-European started to diverge into descendant languages over 1,000 years earlier (median estimate of 8,120 years ago, with a 95% probability of being between between 6,740 and 9,610 years ago).

I assume that this apparently large difference arises because the median split date for Anatolian reflects only those trees in which Anatolian split first. In contrast, the median date of 8,120 for Proto Indo-European as whole reflects all trees.

Steppes or Anatolia?

There are 2 main theories about where people spoke Proto Indo-European before any of its branches started diverging from the rest of the family:

  • Steppe hypothesis: all branches of Indo-European (including the Anatolian and Tocharian branches) ultimately go back to rapid long-range migrations by highly mobile, horse-borne pastoralists out of the Pontic-Caspian Steppe from about 6,000 years ago. The Steppe is grassland stretching from north of the Black Sea to north of the Caspian Sea.
  • Anatolian hypothesis: Indo-European arose farther south, in the southern Caucasus or northern Fertile Crescent and spread slowly with early farming some 9,000 years ago.

New hybrid theory

Heggarty and co-authors now propose a hybrid hypothesis:

  • Indo-European languages spread out of an initial homeland south of the Caucasus, in the northern Fertile Crescent.
  • one major branch then spread northward onto the steppe through the Caucasus and on across much of Europe.
  • Indo-Iranic spread directly eastward out of a South Caucasus homeland through the Iranian Plateau, south of the Caspian (or perhaps through the steppe and Central Asia, north around the Caspian Sea).

The paper provides useful maps illustrating the Steppe, Anatolian and hybrid hypotheses.

DNA evidence

The authors comment that the tree produced by their work fits well with recent DNA evidence:

  • ancient DNA evidence from the Fertile Crescent now suggests that agriculture spread in a more complex way than suggested by the Anatolian hypothesis. That hypothesis holds that Indo-European languages spread with the first farmers both west (into Balkan Europe) and east (to the Indus).
  • South of the Caucasus is where DNA results first locate the only ancestry component found at high proportions in populations (past and present) associated with both Indo-Iranic and the main European branches of Indo-European.
  • it now appears that a new ancestry component that spread into Central Europe from about 5,000 years ago did not come from populations of the Yamnaya culture on the grasslands of the Pontic-Caspian Steppe. Instead, it came from further north in the ‘forest steppe’ of the Middle Dnepr region, and towards the Baltic. And the split between distinct forest vs. grassland steppe ancestries expanding westwards into Europe may fit at least roughly with a split between the Balto-Slavic and Italic+Germanic+Celtic groupings within Indo-European.
  • an east-to-west expansion brought a predominantly Caucasian hunter-gather/Iranian DNA component into the Eastern Mediterranean and the Balkans some 2,000 years before the expansions from the forest and grassland Steppe. This matches chronologically and geographically with the Albanian, Greek and Armenian branches of Indo-European.
  • the hybrid hypothesis fits with DNA evidence that from about 7,000 years ago, populations of the Pontic-Caspian steppe derived approximately half of their ancestry from a source first found from the southern Caucasus to north-western Iran.
  • DNA results show no intrusion of Steppe ancestry into the region that spoke the Anatolian branch of Indo-European. But they do detect ‘an admixture event that biologically connected’ regions from Western Anatolia to the Zagros in northern Iran, some 8,500 years ago, causing a common ‘Anatolian/Iranian ancestry cline’. Possible scenarios include population movements westwards and eastwards out of the northern Fertile Crescent—perhaps linked to early splits to the Anatolian and Indo-Iranic branches in a hybrid hypothesis for Indo-European origins.

Comparison with dates discussed above

As sumarised above, the paper reports results indicating that:

  • Proto Indo-European started to diverge into descendant languages around 8,000 years ago (6,700 to 9,600 years ago, at 95% probability).
  • the first descendants (Anatolian and Tocharian) started to diverge from the rest of the group around 7,000 years ago (5,500 to 8,600 years ago, at 95% probability).

Heggarty and his co-authors suggests that those dates are:

  • more compatible with the early initial split favoured under the Anatolian hypothesis (around 9,000 years ago) or with the initial early split south of the Caucasus proposed by their hybrid hypothesis.
  • less compatible with the later initial split favoured under the Steppe hypothesis (around 6,000 years ago).      

There is a discrepancy of 1,000 years between the date the authors gave for when Proto Indo-European started to diverge and the date when the first ancestor began to diverge from the rest of the family. Before trying to assess which hypotheses are more compatible with the dates their model generate, I would want to understand better what drives this apparently large discrepancy.


The authors have produced an interesting theory. I will review their methodology in future posts.

