W3Techs
advertise here
provided by
Q-Success
Home Technologies Reports API Sites Quality Users Blog Forum FAQ Search

Featured products and servicesadvertise here

Forum > Content Languages > Topic

Content Languages Forum

Technology proposal: use uniform distribution

Anastasiia on 24 May 2018, 7 years ago

Hello! 

I think it would be nice to have another "Content Languages" survey. 

Technology: Instead of top 10,000 websites use 10,000 (whatever big enough number) chosen uniformly at random. 

Why? This is statistically more correct. Top 10,000 websites are biased and most probably the distribution of Languages (whatever statistics you want) differs from the distribution of all the Internet. If you choose big enough number of websites uniformly you will have sample which is close to the distribution of the Internet. 

It would be interesting to see these scores altogether with your ones.

 

Best,

Anastasiia

Matthias Gelbmann on 24 May 2018, 7 years ago

Hi Anastasiia,

You bring up a very important point.

If you say "to choose websites randomly is statistically more correct", then I must say it's not as simple as that. If we were counting people speaking languages, I would completely agree with you. However, counting websites is different. Netcraft estimates that out of the 1.8 billion websites, only 10% are "active", and their definition of "active" is very generous, as it doesn't require any activity. Why is that so? Because it's very easy to register a website and have a default page showing up on it. People do this to reserve domain names or to participate in some black hat SEO, or for a number of other reasons. Depending on your definition of "website", you can create millions of sites with a mouse click. Now, if you happen to choose Neapolitan as language on those sites, our statistics would be completely biased for no good reason at all.

While your approach is theoretically correct, it doesn't make sense when 90% or more of what you are measuring is garbage. Our way to filter out that garbage is to require a minimal amount of traffic on the sites, as measured by the Alexa ranking. And, btw, we use the top 10 million, not the top 10k sites. We are convinced that this leads to much more useful statistics.

You need to be logged in to reply.

About Us Disclaimer Terms of Use Privacy Policy Advertising Contact
W3Techs on   LinkedIn LinkedIn Mastodon Mastodon Bluesky Bluesky
Copyright © 2009-2025 Q-Success