Measuring how much languages differ

Is it possible to quantity how one language differs from another language? In 2015, two academic researchers tried to do that by creating what they called a ‘Language Friction Index’ (LFI). They describe the index in their paper Language friction and partner selection in cross-border R&D alliance formation, Amol M Joshi and Nandini Lahiri, Journal of International Business Studies, Vol 46, No 2 (2015).

This post:

  • lists the LFIs computed by Joshi and Lahiri for each language pair covered in their paper, and I comment on those LFIs
  • describes how they computed the LFIs.
  • outlines what they mean by ‘language friction’ and how that differs from language barriers.
  • summarises how they used the LFIs in the research they present in their paper.

LFIs computed by Joshi and Lahiri

Tables 1 and 2 show the LFI index Joshi and Lahiri present for each language covered in their paper. For ease of analysis, because thy cover several Bantu languages of South Africa, I list those languages separately (table 2) from the other languages they discuss (table 1).

LFIs for language pairs (not South Africa)

Each LFI is for a language pair: English and one other language. For comparison, the table also includes the language pair English-English. As explained below, LFIs are between 0.00 and 1.00.

Table 1. Language Friction Index for pairs English and one other (not South Africa)

Comments on those LFIs

In some respects, the LFI scores for those languages do not surprise me:

  • the LFI for English is 0.00 because, of course, English does not differ from itself.
  • the LFI for some of English’s closer relatives (German, French) is lower than for some other languages with no known relationship to English (eg Korean, Tagalog, Japanese, Sami, Tamil, Malay, Taiwanese)
  • the LFIs for the closely related Norwegian, Swedish and (slightly more distantly related) Dutch are similar to each other.

But in some other respects, the LFIs surprised me:

  • there is no known relationship between English and either Finnish (a Finno-Ugric language) or Mandarin (a Sinitic language), yet the LFIs for those 2 languages are among the lowest of all, and close to the LFIs for English’s close relatives French and German.
  • Mandarin and Taiwanese are both Sinitic languages, yet the difference between their LFIs is huge. Indeed, Taiwanese has the highest LFI of all languages presented, whereas Mandarin has the 4th lowest LFI. Similarly, Finnish and Sami are both Finno-Ugric languages, yet Finnish has the lowest score of all, whereas Sami has one of the higher scores.
  • German and English are related closely to Dutch, and only slightly less closely to Swedish and Norwegian, but Dutch (0.57), Swedish (0.51) and Norwegian (0.58) have much higher LFIs than German (0.33).
  • Italian (0.52), Portuguese (0.61) and Romansch (0.81) have much higher LFIs than French (0.31), even though all 4 languages are members of the Romance sub-group of the Indo-European family.
  • Korean, Tagalog, Japanese, Sami, Tamil and Zulu (covered in table 3) are all unrelated to English but all have LFIs similar to those of English’s close relatives Swedish, Norwegian, Dutch, Italian and Portuguese.

LFIs and language families

Joshi and Lahiri point out that closely related languages from a single language family can differ in key structural features—and languages used in societies that are culturally and geographically distant can have key similarities.

For example, English and Dutch are Germanic languages that are part of the same larger family of Indo-European languages. However, in English, the dominant word order is subject-verb-object (SVO), whereas Dutch often uses both SVO (in main clauses) and SOV (in subordinate clauses). In Mandarin, SVO dominates. For this feature, English and Mandarin are more similar to each other than either language is to Dutch.

I accept that explanation up to a point, though I find it unintuitive that, for example, the LFI for Dutch-English is so much higher than the LFI for English-Mandarin. Would anyone would really think Mandarin is closer to English than Dutch is?

LFIs for language pairs (South Africa)

Table 2 shows the LFI index Joshi and Lahiri compute for 10 languages of South Africa.

Northern Soto0.79
Table 2. Language Friction Index for pairs English and one other (South African)

Comments on the LFIs (South Africa):

  • The LFI index for Zulu is 0.53, similar to the LFI index for Korean, Swedish, Italian, Tagalog, Japanese, Dutch, Norwegian and Portuguese.
  • 8 of the other 9 languages are members of the southern branch of the Bantu family. The LFI index for them is between 0.71 and 0.81, a range similar to the LFI index for Sami, Tamil, Afrikaans, Romansch, Taiwanese and Malay.
  • The LFI for Afrikaans (0.77) is much higher than for Zulu, but similar to the LFIs for the other 8 South African languages investigated. This seems consistent with the idea that Afrikaans developed from a language descended directly from Dutch, fused with a Dutch-based creole that arose in South Africa from descendants of people speaking various other languages spoken locally.

Computing LFI

Joshi and Lahiri created an index, LFI. The LFI applies to a pair of languages and quantifies the structural differences between those 2 languages. The index is based on data in the World Atlas of Language Structures (WALS).

They picked WALS because it is ‘the most comprehensive and authoritative database of linguistic features’. Joshi and Lahiri took their data from WALS in 2011, when it included 192 linguistic features, organised into 10 main categories, across 2678 human languages.

For another posting looking at data from WALS, please see

Joshi and Lahiri used the data in WALS to calculate an LFI for a language pair as follows:

  • they compared all of the 192 linguistic features listed in the WALS database for those 2 languages. In WALS, the code for a feature has between 2 and 28 values.
  • they assigned to the feature a score of 0 (if WALS listed the same code for that feature for both languages) or 1 (If WALS lists different codes for the 2 languages).
  • they averaged the scores for all of the 192 features, giving each feature an equal weighting. That equally-weighted average is the LFI for that pair of languages. Thus, the LFI is 0 if all 192 scores are the same in both languages. The LFI is, for example, 0.2 if 80% of the scores are the same. The LFI is 1 if all 192 scores are different.

Computed in this manner, LFI is an overall estimate of the percentage of identifiable linguistic features that a pair of languages have in common.

Equal weightings

The LFI that Joshi and Lahiri created is an equally weighted average of the 192 linguistic features covered by WALS. They acknowledge that using a linear average imposed ‘linearity on a language friction construct which may not be inherently linear’. They recognised that future research might make it seem preferable:

  • to construct sub-indices (if the rlelationship is not linear); or
  • to reweight some features (if equal weighting is not apporpriate).

It seems intuitively unlikely that each feature covered in WALS is equally important. So, I suspect the use of equal weightings may be part of the reason why some of the LFIs listed above seem implausible measures of how much some pairs of languages differ.

Language friction

Joshi and Lahiri distinguish between:

  • language barriers—an obstacle to the accurate and complete flow of knowledge between two parties who cannot use a shared language.
  • language friction—a form of cultural friction arising from structural differences in the respective languages used by potential partners to reason and solve problems together. For example, a native speaker of a Mandarin and a native speaker of German may both be proficient in English as a second language and may deal with each other in English. Using that shared language may not change how each party thinks about the possible transaction in that own party’s structurally different native language.

Joshi and Lahiri’s paper focusses on language frictions between English and other languages, not on language barriers arising from people with only limited knowledge of English.

How Joshi and Lahiri used LFIs

Joshi and Lahiri posited that language friction may make it more difficult for potential partners with different native languages to form business alliances. They constructed their LFIs as part of their test of that hypothesis. They concentrated on cross-border business alliances in the industry of designing and fabricating semi-conductor chips. English is widely used as a lingua franca in that language.

Amon other things, they found that high levels of language friction (as measured by LFI) were associated with greater likelihood that cross-border research and development (R&D) alliances would come into existence.

Countries with more than one language

So far, I have mentioned only LFIs for pairs of languages. Joshi and Lahiri computed LFIs for countries and conducted their analyses at the country level. For linguistically diverse countries, they computed a weighted average LFI, by weighting the LFI for each language by the estimated number of speakers of that language in that country.  


Joshi and Lahiri’s Linguistic Friction Index is an interesting attempt to quantify differences between languages. If used well, researchers may find it a useful tool for testing hypotheses about whether differences between languages cause differences in behaviour.

Having said that, for some language pairs, the authors report LFIs that seem unintuitive and implausible as quantifications of differences between the 2 members of those pairs. I suspect it was not a good idea to give equal weighting to all 192 features listed in the World Atlas of Linguistic Structures. It seems unlikely that all of those features are equally important and that they all affect cognition and behaviour to the same extent.   

Finally, the authors identify language friction as something distinct from language barriers. This seems to be a useful concept.

Leave a comment

Your email address will not be published. Required fields are marked *