W3Techs Logo
provided by
Q-Success
Home Technologies Reports Sites Quality Users Blog Forum FAQ Search

Blog Categories

All

News
24/7 Real Media
51.la
AddThis
AddToAny
AdRiver
AdRoll
AdTaily
Adtech
Advertising Networks
Adzerk
AngularJS
Apache
ASP.NET
ASP.NET Ajax
AT Internet
AudienceScience
Baidu Analytics
Baidu Promote
Baidu Share
Bitrix
Blogger
BlueKai
BuySellAds
CDNJS
CentOS
Character Encodings
Chartbeat
Chitika
Client-side Languages
CNZZ
ColdFusion
Commission Junction
Comodo
Compression
Concrete5
Content Delivery
Content Languages
Content Management
Cookies
CPM Star
CSS
DataLife Engine
Debian
Delicious
Digg
DigiCert
Discuz!
Dojo
DotNetNuke
DoubleClick
Drupal
Effective Measure
Ektron
EPiServer CMS
ETag
ExoClick
Facebook
Fedora
Flash
Frameset
Full Circle Studies
Gemius
Gentoo
GIF
GlobalSign
Gomez
Google +1
Google AdSense
Google Analytics
Google Libraries API
Google Servers
GoSquared
Gunicorn
Histats
HitTail
HTML
HubSpot
IBM Servers
Image File Formats
Infolinks
InterRed
IP.Board
IPv6
Java
JavaScript
JavaScript Libraries
Joomla
JQuery
JQuery CDN
KISSmetrics
Knockout
Liferay
Linezing
LinkedIn
Linux
LiteSpeed
LiveInternet
Lotame
Markup Languages
Microsoft Advertising
Microsoft-IIS
Mixpanel
MooTools
Movable Type
MySpace
Nginx
Nielsen NetRatings
Node.js
NQcontent
Omniture
Operating Systems
Oracle Servers
OsCommerce
Perl
Persistent Cookies
PHP
PHP Link Directory
Pinterest
Piwik
Pligg
Plone
PNG
PrestaShop
Prototype
Python
Quantcast
Red Hat
Revolver Maps
Ruby
Scientific Linux
Script.aculo.us
Server-side Languages
SharePoint
ShareThis
ShinyStat
Silverlight
Site Elements
Sitefinity
Smart AdServer
Snoobi
Social Widgets
SPDY
SPIP
Squarespace
SSL Certificate Authorities
StatCounter
StumbleUpon
SuSE
SwissSign
Symantec Group
Top Level Domains
Traffic Analysis Tools
Twitter
TYPO3
Ubuntu
UCoz
Unix
Urchin
UTF-8
VBulletin
Verizon
Web Servers
Webs
Webtrends
Whos.amung.us
Windows
WordPress
WordPress Stats
XHTML
XpressEngine
Yahoo Advertising
Yandex.Direct
Yandex.Metrika
YUI Library
Zanox
Zedo
Zope

Content language quality alerts

Posted by Sam Soltano on 23 March 2010 in News, Content Languages

Summary:

We started analyzing the content languages of websites. It turned out that a large number of sites specify the language incorrectly, so that we introduced new types of quality alerts to indicate these problems.

Properly defining the content language of a website (that is the natural language used on the site, such as Japanese or Italian) is not as straightforward as one might think. As a matter of fact, it is a terrible mess.

Actually, in the overwhelming majority of the cases, it would be as simple as writing lang="en" in the html tag of a page. But if you start reading the corresponding standards to find out what your options are, you feel like you opened up a can of worms. It starts with the current HTML standard (4.01), being already quite aged, refers to a another document when dealing with language codes, and that other document is obsolete in the meantime, so that W3C recommends not to use it. Instead, you are expected to read W3C notes, best practice documents, study related standards and basically try to figure out yourself what would be an appropriate approach. If you read all the "further readings" on that subject, you will never have time to build a website anyway. You would read about regular and irregular codes, private use language tags, and grandfathered codes (that is not a no joke, see RFC5646)

Also, it doesn't help, that the language can be defined in several places:

In the HTTP header

Content-Language: en

As attributes of HTML tags


In various HTML meta tags, where besides the http-equiv meta tag


you have the rarely used Dublin Core notation meta tag 

 

and you have the common but undefined plain language meta tag

 

If you read further, you notice that there are subtle differences, which W3C describes as "primary language" vs. "text-processing language" vs. "language of the document's intended audience". And there are un-intuitive priorities and relationships between these specifications.

If you want to be more precise in your definition of the languages, you open another Pandora's box. You learn that you may differentiate between country or region specific tags, tags that indicate dialects (take for example "sl-IT-nedis" as the specific variant of the Nadiza dialect of Slovenian that is spoken in Italy), and tags that specify scripts, which must not be confused with the character encoding of that scripts.

No wonder that some webmasters just give up and write language="português" and hope for the best. Others simply take any old code that's lying around and use it as languages code. For example, I appreciate that Switzerland is a multi-lingual country, but the number of Swiss websites supposedly written in Chamorro (language code ch) is amazing. Seems like everyone of the 47.000 people in the Mariana islands that speaks this language runs a website or two in Switzerland. Also the amount of Breton (language code br) used in Brazil and Twi (language code tw) used in Taiwan would be surprising, if we would rely on webmasters.

To be fair, natural languages are an extremely complex subject. But what web standardization bodies can make out of that is more than a typical webmaster can master.

Our new quality alerts indicate some of these problems and should help webmasters to clean up their language specifications. You find the new alerts at the end of the quality alerts overview, and, of course, on the site info page of the websites where they apply.

Share this page




Share |




This entry is closed for comments.


   
W3Techson


Find us on Facebook

W3Techs on LinkedIn

Follow W3Techs on Twitter







Our Book Recommendation
About Us Disclaimer Terms of Use Privacy Policy Publishing Partners Advertising Feedback
Copyright © 2009-2013 Q-Success