Featured products and servicesadvertise here
Image File Formats
SSL Certificate Authorities
Traffic Analysis Tools
Top Level Domains
This ForumNew Topic
only registered users
can post in the forums
Content Languages Forum
Mr Pete on 21 June 2019, 25 days ago
I am inquiring to better understand the reliability and usefulness of the Content Languages section of this amazing long-term effort you've created for understanding the Web.
Currently, 192 content languages are represented in the list (https://w3techs.com/technologies/overview/content_language/all)
As I understand it, this is meant to say that:
* For the < 10 million sites for which some activity is noted by Alexa, and are qualified as distinct sites by your own criteria...
* 192 distinct content languages can be detected in some way.
* This is subject to the great limitation that it can be difficult to identify if there is content in multiple languages on a given website.
1) Is it valid to say that a language should be listed if it were discovered as a single-language site in that language?
2) There are currently 184 ISO 639-1 languages, 487 ISO 639-2 languages, and thousands more at the ISO 639-3 level. You use a mix of two and three letter codes...
2a) Question: do you have the ability to identify a significant fraction of the 487 ISO 639-2 languages?
2b) Do you have any sense what is the limit of language identification for your system? Could it extend to a significant fraction of ISO 639-3?
I ask because the 192 distinct languages listed cover a certain fraction of the population of the world, but not even close to 90%, let alone 99% or more... and that's the direction my consideration goes.
Thanks for your thoughts!
Sam Soltano (site administrator) on 27 June 2019, 20 days ago
Thank you for sharing your considerations.
We use a combination of several techniques to determine the language of a website. Among others, we look at what language the website claims to be written in. However, this is often quite unreliable, partly because it's not obvious how to do it properly (see our somewhat outdated blog entry on that subject).
We do indeed support the ISO 639-1 and ISO 639-2 language codes, of which only the 192 languages actually appear on any of the top 10 million site. As part of our plausibility checks, we do translate some of the ISO 639-3 codes into our supported codes. We could extend our language set to the full range of the ISO 639-3 codes, but at the moment we do not see a real need for this, because these languages are very rarely used on the web.
Which brings me to your last remark: According to language distribution statistics such as this one at Wikipedia, we do cover the vast majority of the population. What brings you to the conclusion that the 192 languages cover less that 90%.
You need to be logged in to reply.