Database of Indo-European cognates

This is a follow up to my post From the Steppes or from Anatolia? – Language Miscellany That post explored a recent paper looking again at some old questions:

  • when did Indo-European languages separate from the rest of that language family, and in which order?
  • where did speakers of the ancestral language Proto Indo-European live?

The paper Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages by Paul Heggarty, Corman Anderson, Matthew Scarborough (all of the Max Planck Institute for the Evolutionary Anthropology, in Leipzig, Germany) with 30 co-authors was published in Science in July 2023. The paper is available as a free download from IE-CoR – Home (clld.org) with supplementary materials available at Guide_to_Data_and_Results_Files.docx (science.org

My earlier post summarised the authors’ conclusions on when and how Proto Indo-European started to diverge into descendant languages, and when and how those languages then spread.

This post looks at the cognate database that the authors created. In a future post, I will look at the Bayesian approach they used to estimate when (and how) the 12 branches (and the languages within them) separated.

IE-COR database

Heggarty and his coauthors created a database of cognates for 109 modern and 52 older Indo-European languages. These include many new languages not covered by previous cognate databases for Indo-European.

The authors describe the database as the first of a new breed of language databases on Cognate Relationships across language families, implemented here in the first instance to the Indo-European language family. They use the abbreviation IE-CoR. It is available at https://iecor.clld.org/

The database contains a set of basic reference meanings. For each language, the database states whether the word used with that meaning derives from the same original word.

Creating the database

Drawing up IE-CoR involved 3 main tasks:

  • drawing up a list of reference meanings
  • identifying the primary word (lexeme) used in each language
  • assessing which lexemes are cognates

Reference meanings

IE-CoR currently covers 170 meanings. The meanings are basic concepts such as body parts, numbers, colour terms, and so on. Examples are HAND, THREE, FIRE, NIGHT, SMALL and EAT. IE-CoR’s new ‘Jena 170’ meaning set is based on 3 sets widely used in linguistics: the Swadesh 100-meaning set, the Swadesh 200-meaning set, and the Leipzig-Jakarta 100-meaning set. The authors combined, adapted and optimised these 3 sets to produce the most consistent data set.

The language data was entered entirely anew. It does not continue from previous databases, which contained many data errors and inconsistencies.

Identifying the primary term

For each of the 170 meanings, the database contains that language’s primary term for the precise IE-CoR definition of that meaning.  IE-CoR used over 80 specialists to identify lexemes for the languages in which they have expertise.

One word per meaning

For each meaning, IE-CoR lists at most one primary word (or ‘lexeme’) in each language.

Previous cognate databases of Indo-European contained near-synonymous lexemes in a single meaning. For example, some databases listed for BLACK in Ancient Greek not just the word mélas (μέλας) but also the word kelainós (κελαινός). In contrast, IE-CoR lists only mélas (μέλας).

Listing more than one word for a meaning in a language led to widespread and subjective inconsistencies in the number of lexemes and cognate sets per language. For a reference set of 207 meanings, one previous study had 231 cognate sets for Modern Greek, but 364 for Ancient Greek. This inconsistency caused at least 133 extra ‘changes’ (364-231) between these languages: ‘losses’ of cognate sets on the branch to Modern Greek and/or supposed ‘gains’ on a theoretically distinct branch leading to Ancient Greek. This data inconsistency distorted estimates of when the lineages leading to these languages split.

Identifying cognates

IE-CoR currently contains 5,013 cognate sets covering the 170 meanings in the database. Each cognate set includes only words (lexemes) assessed as being cognates. Words in different languages are cognate if they descend from the same source (directly, and not by borrowing from another language). For example:

  • in the meaning SALT, French sel, German Salz and English salt are all cognate with each other. They all derive from the same Proto-Indo-European source word, reconstructed as *sal.
  • in the meaning BLACK, French noir, German schwarz and English black are not cognate with each other.

To assess whether words are cognates (and not borrowed), IE-CoR relied on many specialists. The database includes references to the source documents, including many leading works in Indo-European linguistics—especially for all 1,600+ cognate sets that go back to Indo-European.

The database captures judgments about whether lexemes in different languages are cognates of each other. Those judgements often rely largely on interpretating information about sound systems and sound changes (phonology) and forms (morphology). But the database itself does not directly gather information about phonological and morphological information.

Example

The following example shows how these basic concepts—languages, meanings, lexemes and cognate sets—all relate to each other.

For the reference meaning FIRE, IE-CoR defines the reference as ‘the most generic, basic and default noun for fire, preferably applicable both to the concept of fire in general, and to a concrete instance of (a) fire’. The appendix to this post shows the complete description of FIRE as it appears in IE-CoR.

IE-CoR lists the primary, predominant, basic lexeme for FIRE in each language. It contains 156 lexemes for FIRE, in 22 cognate sets.   

  • In the meaning FIRE, the German word Feuer and English fire both derive from the corresponding word *fōr in the Proto-Germanic language. That word itself originated in the more ancient Proto-Indo-European word reconstructed as *péh₂u̯r̥. That shared ancestry defines IE-CoR cognate set 219, ie the set of words (in 23 languages) that all originated in *péh₂u̯r̥. Like any cognate set, its entry in IE-CoR can be viewed at the corresponding IE-CoR website URL of the form: https://iecor.clld.org/cognatesets/219  
  • Also in the meaning FIRE, is another cognate set 774, defined by descent from a different Proto-Indo-European word: *h₁n̥gni. This cognate set includes words for FIRE in 37 other languages, including Latin ignis, Lithuanian ugnìs and Early Vedic (Sanskrit) agníḥ (अिưः).
  • Alongside the *péh₂u̯r̥ and *h₁n̥gni cognate sets, there are a further 20 independent cognate sets in this FIRE meaning, across the 161 languages in IE-CoR. Each of these remaining 20 cognate sets is limited (as the primary word for FIRE) to just one of the 10 major branches of Indo-European. Cognate set 877, for instance, based on the Latin word focus (originally a hearth or fireplace). It is used only in 21 Romance languages within the Italic branch, e.g. Italian fuoco, Spanish fuego and French feu.
  • Each cognate set is either present (1) or absent (0) in a language, in a given target meaning. In the meaning FIRE, for cognate set 219 (*péh₂u̯r̥) English and German both have a value of 1 (present). But for cognate set 774 (*h₁n̥gni) they both have the value 0 (absent). Early Latin, Lithuanian and Vedic, conversely, have 0 for cognate set 219, and 1 for cognate set 774.

The full linguistic data and citations can be explored  at https://iecor.clld.org The raw dataset can be downloaded from: https://github.com/lexibank/iecor/tree/master/cldf.

Borrowings

If a cognate set originated in a borrowing, then lumping that set in with cognates derived from an earlier stage of a language family can distort an analysis of the family’s history. On the other hand, excluding borrowings discards information about changes after the time of the borrowing. IE-CoR lists borrowings in separate cognate sets, starting at the time of the borrowing.  

Of the 22 cognates sets for FIRE in IE-CoR, 6 originated in borrowings. For instance, cognate set 5348 is the form krak, found in both Eastern and Western Armenian. The entry explains that this lexeme derives from Classical Armenian krak (‘fire’,‘flames’), with its further origin unknown, perhaps being a loan from Iranian.

Link to Concepticon

One interesting feature in IE-Cor is that it links the reference meanings to concept labels and concept lists in the database Concepticon. Concepticon tries to link the large amount of different concept lists which are used in the linguistic literature. These range from Swadesh lists in historical linguistics to naming tests in clinical studies and psycholinguistics. https://concepticon.clld.org/

Conclusion

IE-CoR is a useful repository of data about cognates. It shows which languages contain cognates for the reference meanings listed in the database.

The main purpose of the database is to create data for use in constructing phylogenetic language trees—family trees showing which languages descended from which other languages. In a future post, I will review the work Heggarty and his co-authors did with the IE-CoR data in their 2023 paper Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages.

Appendix
Reference meaning FIRE

Here is how the IE-CoR database describes the reference meaning FIRE.


Illustrative Context

He was cold, so he moved closer to the fire.

Target Sense
  • Most generic, basic and default noun for fire, preferably applicable both to the concept of fire in general, and to a concrete instance of (a) fire.
  • In many languages this will be the same term as used in the traditional Classical concept of the four ‘elements’, i.e. fire as opposed to earthair and water.
  • The lexeme selected should fit the target context (as in the illustrative context sentence) of a relatively small fire, controlled and intentionally set, for heating or cooking.
  • The lexeme selected should normally refer to fire visible in the form of flame(s), but avoid terms that have the narrower and more specific meaning of flame rather than fire in general.
  • Avoid terms with specific narrower and limiting senses or usage of any kind, e.g..:
    • Terms that are more limited to the abstract concept of combustion. Indeed, avoid technical register terms such as English combustion.
    • Terms specific to particular types of fire, such as fire burning down a building, e.g. French incendie (rather than the correct generic word feu).
    • Terms with specific senses of intense, uncontrolled, damaging and dangerous fire, e.g. blaze.
    • Terms with a specific sense of a naturally occurring fire, e.g. wildfirebush fire.
    • Terms specific to indoor fires, fireplaces or hearths, e.g. French foyer (even if ultimately derived from the same root as correct feu).

Note: the illustrative case of modern Romance languages derive their modern default fire words not from the original Classical Latin ignis, but from the Latin word focus, which originally meant only ‘fireplace’. Regardless of that derivation and original sense, though, the cognates of focus have in modern Romance languages long since broadened semantically to displace the ignis root entirely, and take over its semantic slot as the basic fire word here. So the correct target lexemes for these languages in IE-CoR are indeed feufuocofuego, etc..


Leave a comment

Your email address will not be published. Required fields are marked *