The most distinctive letter combinations in different European languages

8 comments
  1. Methodology: I extracted 100MB of article texts from each of the different Wikipedias and analysed the letter frequencies using Python. The map shows the letters and letter combinations whose frequencies most exceeds the average for all the languages that use the same script.

    **Clarification: the letters can appear in the middle of words, not just at the start or by themselves.**

    PS I agree that the algorithm is currently a bit too biased towards common combinations; I’ll have a think about how to pick up more distinct but less common combinations without having the maps dominated by really obscure garbage. Here’s [a first stab](https://i.imgur.com/xAoKUJy.png): I quite like the single letters, but the pairs and triples are very fragile and change completely if I so much as sneeze at the code.

  2. By distinctive it’s meant how many times each letter is repeated, shouldn’t that be something like “The most repeated letter”? I’d say in terms of distinction, our would probably be something like Ж, Щ or Ю, which not all Cyrillic-using countries use.

  3. “ent” for PT? I assume it is from the “mente” ending of adverbs, otherwise I have no idea since only a few words start with “ent”.

  4. Seems about right for Slovakia, because ___ých is how we end majority of adjectives in plural in Genitive, Accusative and Lokal case for masculine gendered words and in Genitive and Accusative for feminine and neutral words.

  5. ă or ț should be the most distinctive for Romanian, ő or ű for Hungarian, while ść should be the most distinctive 2-letter combination for Polish.

Leave a Reply