Blog CategoriesAllNews 24/7 Real Media 51.la AddThis AddToAny AdRiver AdRoll AdTaily Adtech Advertising Networks Adzerk AngularJS Apache ASP.NET ASP.NET Ajax AT Internet AudienceScience Baidu Analytics Baidu Promote Baidu Share Bitrix Blogger BlueKai BuySellAds CDNJS CentOS Character Encodings Chartbeat Chitika Client-side Languages CNZZ ColdFusion Commission Junction Comodo Compression Concrete5 Content Delivery Content Languages Content Management Cookies CPM Star CSS DataLife Engine Debian Delicious Digg DigiCert Discuz! Dojo DotNetNuke DoubleClick Drupal Effective Measure Ektron EPiServer CMS ETag ExoClick Fedora Flash Frameset Full Circle Studies Gemius Gentoo GIF GlobalSign Gomez Google +1 Google AdSense Google Analytics Google Libraries API Google Servers GoSquared Gunicorn Histats HitTail HTML HubSpot IBM Servers Image File Formats Infolinks InterRed IP.Board IPv6 Java JavaScript JavaScript Libraries Joomla JQuery JQuery CDN KISSmetrics Knockout Liferay Linezing Linux LiteSpeed LiveInternet Lotame Markup Languages Microsoft Advertising Microsoft-IIS Mixpanel MooTools Movable Type MySpace Nginx Nielsen NetRatings Node.js NQcontent Omniture Operating Systems Oracle Servers OsCommerce Perl Persistent Cookies PHP PHP Link Directory Piwik Pligg Plone PNG PrestaShop Prototype Python Quantcast Red Hat Revolver Maps Ruby Scientific Linux Script.aculo.us Server-side Languages SharePoint ShareThis ShinyStat Silverlight Site Elements Sitefinity Smart AdServer Snoobi Social Widgets SPDY SPIP Squarespace SSL Certificate Authorities StatCounter StumbleUpon SuSE SwissSign Symantec Group Top Level Domains Traffic Analysis Tools TYPO3 Ubuntu UCoz Unix Urchin UTF-8 VBulletin Verizon Web Servers Webs Webtrends Whos.amung.us Windows WordPress WordPress Stats XHTML XpressEngine Yahoo Advertising Yandex.Direct Yandex.Metrika YUI Library Zanox Zedo Zope |
Content language quality alertsPosted by Sam Soltano on 23 March 2010 in News, Content LanguagesProperly defining the content language of a website (that is the natural language used on the site, such as Japanese or Italian) is not as straightforward as one might think. As a matter of fact, it is a terrible mess. Actually, in the overwhelming majority of the cases, it would be as simple as writing lang="en" in the html tag of a page. But if you start reading the corresponding standards to find out what your options are, you feel like you opened up a can of worms. It starts with the current HTML standard (4.01), being already quite aged, refers to a another document when dealing with language codes, and that other document is obsolete in the meantime, so that W3C recommends not to use it. Instead, you are expected to read W3C notes, best practice documents, study related standards and basically try to figure out yourself what would be an appropriate approach. If you read all the "further readings" on that subject, you will never have time to build a website anyway. You would read about regular and irregular codes, private use language tags, and grandfathered codes (that is not a no joke, see RFC5646) Also, it doesn't help, that the language can be defined in several places: In the HTTP header
Content-Language: en
As attributes of HTML tags In various HTML meta tags, where besides the http-equiv meta tag you have the rarely used Dublin Core notation meta tag
and you have the common but undefined plain language meta tag
If you read further, you notice that there are subtle differences, which W3C describes as "primary language" vs. "text-processing language" vs. "language of the document's intended audience". And there are un-intuitive priorities and relationships between these specifications. If you want to be more precise in your definition of the languages, you open another Pandora's box. You learn that you may differentiate between country or region specific tags, tags that indicate dialects (take for example "sl-IT-nedis" as the specific variant of the Nadiza dialect of Slovenian that is spoken in Italy), and tags that specify scripts, which must not be confused with the character encoding of that scripts. No wonder that some webmasters just give up and write language="português" and hope for the best. Others simply take any old code that's lying around and use it as languages code. For example, I appreciate that Switzerland is a multi-lingual country, but the number of Swiss websites supposedly written in Chamorro (language code ch) is amazing. Seems like everyone of the 47.000 people in the Mariana islands that speaks this language runs a website or two in Switzerland. Also the amount of Breton (language code br) used in Brazil and Twi (language code tw) used in Taiwan would be surprising, if we would rely on webmasters. To be fair, natural languages are an extremely complex subject. But what web standardization bodies can make out of that is more than a typical webmaster can master. Our new quality alerts indicate some of these problems and should help webmasters to clean up their language specifications. You find the new alerts at the end of the quality alerts overview, and, of course, on the site info page of the websites where they apply.
This entry is closed for comments. | W3Techson
|