Content language surveysSummary: We publish now surveys on the usage of content languages of websites. Content languages are the natural languages of the text on a site, for example English or Chinese.
For our robot, determining the content language of a site is not as simple as one would think. The reasons for that are:
-
Many websites don't bother specifying the language. Although, it is certainly not the most important thing to specify for a website, it is still bad practice to ignore it. According to the W3C Internationalization Best Practices, language information is useful for things such as authoring tools, translation tools, accessibility, font selection, page rendering, search, and scripting. Specifying the language is as easy as putting a lang="en" attribute on the pages. One company that neglects language specifications is Google on all its local sites. But then, Google is probably one of the few companies that can afford not to be well ranked in Google.
-
Many websites specify the language incorrectly. We introduced content language quality alerts to indicate these problems and we try to correct them whenever we can. For instance we accept "italian" and "italiano" and even "italy" as language codes, even if that is all clearly outside of any standards.
We think that despite these difficulties, we can still provide good statistics on the usage of natural languages. This is done partly by applying heuristics in case of doubtful specifications, and partly by supplementing the specifications (or missing specifications) by other sources.
Share this page
|