Featured products and servicesadvertise here
Image File Formats
SSL Certificate Authorities
Traffic Analysis Tools
Top Level Domains
This ForumNew Topic
only registered users
can post in the forums
Content Languages Forum
Anastasiia on 24 May 2018, 11 months ago
I think it would be nice to have another "Content Languages" survey.
Technology: Instead of top 10,000 websites use 10,000 (whatever big enough number) chosen uniformly at random.
Why? This is statistically more correct. Top 10,000 websites are biased and most probably the distribution of Languages (whatever statistics you want) differs from the distribution of all the Internet. If you choose big enough number of websites uniformly you will have sample which is close to the distribution of the Internet.
It would be interesting to see these scores altogether with your ones.
Matthias Gelbmann on 24 May 2018, 11 months ago
You bring up a very important point.
If you say "to choose websites randomly is statistically more correct", then I must say it's not as simple as that. If we were counting people speaking languages, I would completely agree with you. However, counting websites is different. Netcraft estimates that out of the 1.8 billion websites, only 10% are "active", and their definition of "active" is very generous, as it doesn't require any activity. Why is that so? Because it's very easy to register a website and have a default page showing up on it. People do this to reserve domain names or to participate in some black hat SEO, or for a number of other reasons. Depending on your definition of "website", you can create millions of sites with a mouse click. Now, if you happen to choose Neapolitan as language on those sites, our statistics would be completely biased for no good reason at all.
While your approach is theoretically correct, it doesn't make sense when 90% or more of what you are measuring is garbage. Our way to filter out that garbage is to require a minimal amount of traffic on the sites, as measured by the Alexa ranking. And, btw, we use the top 10 million, not the top 10k sites. We are convinced that this leads to much more useful statistics.
You need to be logged in to reply.