provided by
Home Technologies Reports Sites Quality Users Blog Forum FAQ Search

Featured products and servicesadvertise here

Forum > Content Languages > Topic

Content Languages Forum

What standard(s), what sense of ability to disambiguate?

Mr Pete on 21 June 2019, 25 days ago

I am inquiring to better understand the reliability and usefulness of the Content Languages section of this amazing long-term effort you've created for understanding the Web.

Currently, 192 content languages are represented in the list (https://w3techs.com/technologies/overview/content_language/all)

As I understand it, this is meant to say that:

* For the < 10 million sites for which some activity is noted by Alexa, and are qualified as distinct sites by your own criteria...

* 192 distinct content languages can be detected in some way.

* This is subject to the great limitation that it can be difficult to identify if there is content in multiple languages on a given website.

Some questions:

1) Is it valid to say that a language should be listed if it were discovered as a single-language site in that language?

2) There are currently 184 ISO 639-1 languages, 487 ISO 639-2 languages, and thousands more at the ISO 639-3 level. You use a mix of two and three letter codes...

2a) Question: do you have the ability to identify a significant fraction of the 487 ISO 639-2 languages?

2b) Do you have any sense what is the limit of language identification for your system? Could it extend to a significant fraction of ISO 639-3?

I ask because the 192 distinct languages listed cover a certain fraction of the population of the world, but not even close to 90%, let alone 99% or more... and that's the direction my consideration goes.

Thanks for your thoughts!


Sam Soltano (site administrator) on 27 June 2019, 20 days ago

Hi Pete,

Thank you for sharing your considerations.

We use a combination of several techniques to determine the language of a website. Among others, we look at what language the website claims to be written in. However, this is often quite unreliable, partly because it's not obvious how to do it properly (see our somewhat outdated blog entry on that subject).

We do indeed support the ISO 639-1 and ISO 639-2 language codes, of which only the 192 languages actually appear on any of the top 10 million site. As part of our plausibility checks, we do translate some of the ISO 639-3 codes into our supported codes. We could extend our language set to the full range of the ISO 639-3 codes, but at the moment we do not see a real need for this, because these languages are very rarely used on the web.

Which brings me to your last remark: According to language distribution statistics such as this one at Wikipedia, we do cover the vast majority of the population. What brings you to the conclusion that the 192 languages cover less that 90%.

You need to be logged in to reply.

About Us Disclaimer Terms of Use Privacy Policy Advertising Feedback
W3Techs on   Twitter Twitter LinkedIn LinkedIn
Copyright © 2009-2019 Q-Success