Featured products and servicesadvertise here
ForumsW3TechsContent Management Server-side Languages Client-side Languages JavaScript Libraries CSS Frameworks Web Servers Web Panels Operating Systems Web Hosting Data Centers Reverse Proxies DNS Servers Email Servers SSL Certificate Authorities Content Delivery Traffic Analysis Tools Advertising Networks Tag Managers Social Widgets Site Elements Structured Data Markup Languages Character Encodings Image File Formats Top Level Domains Server Locations Content Languages This ForumNew Topiconly registered users can post in the forums |
Forum > Content Languages > Topic Content Languages ForumTechnology proposal: use uniform distributionAnastasiia on 24 May 2018, 7 years ago Hello! I think it would be nice to have another "Content Languages" survey. Technology: Instead of top 10,000 websites use 10,000 (whatever big enough number) chosen uniformly at random. Why? This is statistically more correct. Top 10,000 websites are biased and most probably the distribution of Languages (whatever statistics you want) differs from the distribution of all the Internet. If you choose big enough number of websites uniformly you will have sample which is close to the distribution of the Internet. It would be interesting to see these scores altogether with your ones.
Best, Anastasiia Matthias Gelbmann on 24 May 2018, 7 years ago Hi Anastasiia, You bring up a very important point. If you say "to choose websites randomly is statistically more correct", then I must say it's not as simple as that. If we were counting people speaking languages, I would completely agree with you. However, counting websites is different. Netcraft estimates that out of the 1.8 billion websites, only 10% are "active", and their definition of "active" is very generous, as it doesn't require any activity. Why is that so? Because it's very easy to register a website and have a default page showing up on it. People do this to reserve domain names or to participate in some black hat SEO, or for a number of other reasons. Depending on your definition of "website", you can create millions of sites with a mouse click. Now, if you happen to choose Neapolitan as language on those sites, our statistics would be completely biased for no good reason at all. While your approach is theoretically correct, it doesn't make sense when 90% or more of what you are measuring is garbage. Our way to filter out that garbage is to require a minimal amount of traffic on the sites, as measured by the Alexa ranking. And, btw, we use the top 10 million, not the top 10k sites. We are convinced that this leads to much more useful statistics. You need to be logged in to reply. |