Wikimedia Traffic Analysis Report

Wikimedia Traffic Analysis Report - Google requests

Monthly requests or daily averages, for period: 1 Jun 2015 - 30 Jun 2015

This analysis is based on a 1:1000 sampled server log (squids)

Other reports:
Requests: Destination/Mime - Origin - Methods - Scripts - User agents - Skins - Crawlers - Op.Sys. - Browsers - Google - Country data

Notes on reliability of these data

Unresolved Bugzilla bugs: 46190, 46191, 46195, 46201, 46265, (46267), 46268, 46269, 46271, 46273, 46274, 46275, 46277, 46278, 46279, 57376

Aug 2015: This report has been discontinued until further notice, due to lack of maintenance.
Only some of the Wikistats traffic reports will be migrated to Wikimedia Foundation's new hadoop-based infrastructure.
If you want this particular report included in the migration let us know (id for this report is '3').
Read more here.

This report shows all requests to Wikimedia servers where a Google server of service was involved in any way, be it the GoogleBot crawler or FeedFetcher collector scripts that run on Google servers, or a user that follows a link from a Google Web or Google Desktop search results page, or from Google Maps or Google Earth etcetera.
Technically speaking three fields in the squid log records are checked for this: client ip address, referer header and user agent header. A request can originate from an ip address which has been registered by Google and/or it can carry a referer tag that tells us a user clicked a link on a Google results page and/or it can carry an agent string that mentions a Google application which can reasonably be assumed to be genuinely Google's. See bottom of page for further details.
In total Google was somehow involved in 37.6% of daily external page^* requests
Google referred to our sites, through its services including search, maps, and Google Earth, 227,525,700 page views per day, representing 35.7% of our external page requests.
Including all of its different search crawlers and services hosted on its servers, Google itself requested another 12,600,210 page pages per day, representing 2.0% of our external page requests.
* = mime type text/html only

In order of request volume

Requests originating from a Google ip address

Service Total Pages Images Other

GoogleBot

Other

Web search

Wireless

Translate -

FeedFetcher

Total

Requests originating from elsewhere

Service Total Pages Images Other

Web search

Other

GoogleBot?

Desktop -

Wireless

Translate

FeedFetcher

Image search

Toolbar

Mail

Earth

Maps -

Total

Top level domains

.com

.de

.jp

.uk

.in

.it

.fr

.org

.mx

.br

.es

.ru

.ca

.pl

.ar

.au

.co

.tr

.nl

.ph

.tw

.cl

.id

.ua

.pe

.se

.th

.at

.be

.ch

.ve

.il

.hk

.ro

.fi

.sg

.gr

.pt

.hu

.dk

.ec

.cz

.sa

.no

.ie

.pk

.za

.rs

.nz

.kr

.eg

.bg

.ae

.hr

.kz

.gt

.sk

.do

.by

.uy

.bo

.ng

.dz

.vn

.py

.ba

.bd

.si

.sv

.iq

.pr

.az

.lt

.tn

.lk

.my

.hn

.jo

.ee

.lv

.lb

.ge

.kw

.ni

.gh

.np

.am

.cy

.lu

.al

.jm

.tt

.tz

.bh

.ps

.mz

.net

.me

.mt

.mu

.sn

.cr

.kh

.ly

.ht

.et

.is

.cd

.cm

.gp

.mm

.af

.zw

.mn

.ma

.bn

.zm

.pa

.mg

.tj

.na

.sr

.bs

.mv

.la

.ir

.tm

.ml

.md

.bw

.as

.gy

.bf

.so

.rw

.bj

.cat

.pg

.bz

.cv

.cu

.sl

.cg

.ad

.biz

.bt

.tg

.ci

.ag

.qa

.uz

.li

.ls

.vc

.tl

.dm

.bi

.td

.dj

.ke

.kg

.us

.gm

.gi

.vg

.gl

.io

.om

.cn

.mk

.ai

.st

.ws

.tk

.sc

.je

.ao

.eu

.fm

.tv

.ga

.gg

.sb

.fj

.su

.cf

.ug

.to

.im

.edu

.sm

.cc

.ck

.mw

.ne

.ac

.sh

.ki

.vu - -

.nu -

.pw -

.vi

.ms - -

.gq -

.nr - -

.nf - -

.pro - -

undefined

Total

In alphabetical order

Requests originating from a Google ip address

Service Total Pages Images Other

Wireless

Web search

Translate -

Other

GoogleBot

FeedFetcher

Total

Requests originating from elsewhere

Service Total Pages Images Other

Wireless

Web search

Translate

Toolbar

Other

Maps -

Mail

Image search

GoogleBot?

FeedFetcher

Earth

Desktop -

Total

Top level domains

.ac

.ad

.ae

.af

.ag

.ai

.al

.am

.ao

.ar

.as

.at

.au

.az

.ba

.bd

.be

.bf

.bg

.bh

.bi

.biz

.bj

.bn

.bo

.br

.bs

.bt

.bw

.by

.bz

.ca

.cat

.cc

.cd

.cf

.cg

.ch

.ci

.ck

.cl

.cm

.cn

.co

.com

.cr

.cu

.cv

.cy

.cz

.de

.dj

.dk

.dm

.do

.dz

.ec

.edu

.ee

.eg

.es

.et

.eu

.fi

.fj

.fm

.fr

.ga

.ge

.gg

.gh

.gi

.gl

.gm

.gp

.gq -

.gr

.gt

.gy

.hk

.hn

.hr

.ht

.hu

.id

.ie

.il

.im

.in

.io

.iq

.ir

.is

.it

.je

.jm

.jo

.jp

.ke

.kg

.kh

.ki

.kr

.kw

.kz

.la

.lb

.li

.lk

.ls

.lt

.lu

.lv

.ly

.ma

.md

.me

.mg

.mk

.ml

.mm

.mn

.ms - -

.mt

.mu

.mv

.mw

.mx

.my

.mz

.na

.ne

.net

.nf - -

.ng

.ni

.nl

.no

.np

.nr - -

.nu -

.nz

.om

.org

.pa

.pe

.pg

.ph

.pk

.pl

.pr

.pro - -

.ps

.pt

.pw -

.py

.qa

.ro

.rs

.ru

.rw

.sa

.sb

.sc

.se

.sg

.sh

.si

.sk

.sl

.sm

.sn

.so

.sr

.st

.su

.sv

.td

.tg

.th

.tj

.tk

.tl

.tm

.tn

.to

.tr

.tt

.tv

.tw

.tz

.ua

.ug

.uk

.us

.uy

.uz

.vc

.ve

.vg

.vi

.vn

.vu - -

.ws

.za

.zm

.zw

undefined

Total

IP ranges: known ip ranges for Google are 64.233.[160.0-191.255], 66.249.[64.0-95.255], 66.102.[0.0-15.255], 72.14.[192.0-255.255],
74.125.[0.0-255.255], 209.085.[128.0-255.255], 216.239.[32.0-63.255] and a few minor other subranges
Agents: as for genuine agent strings: too many crawlers indentify themselves as 'GoogleBot' to take this at face value. They are accepted as genuine Google crawler requests only when the ip address matches a known range (see above). Other records that mention GoogleBot are counted as GoogleBot? (question mark, as this may include partners, like DoCoMo). However when the agent string mentions Google Desktop or Google Earth this is always accepted
Service: the service name is based on the agent string (plus for GoogleBot check for ip address, see above), if this is inconclusive it is based on the referer string.
Here is detailed breakdown per service of indicators that pointed to Google (total ≥ 3)

Service Total Originating from
Google ip address Referer mentions
Google url Agent mentions
Google service

Desktop - - Y
Earth - - Y
FeedFetcher - - Y
FeedFetcher Y - Y
GoogleBot Y - Y
GoogleBot? - - Y
GoogleBot? - Y Y
Image search - Y -
Mail - Y -
Other - - Y
Other - Y Y
Other Y - -
Other Y - Y
Other Y Y Y
Toolbar - - Y
Translate - Y -
Translate Y Y -
Web search - Y -
Web search Y Y -
Wireless - - Y
Wireless - Y Y
Wireless Y - Y
Wireless Y Y Y

Top Level Domain 'undefined': requests with top level domain 'undefined' are nearly all requests from anonymous ip addresses (crawler and other services)
Note: averages below 1 are always rounded up to 1

Errata: WMF traffic logging service suffered from server capacity problems in Aug/Sep/Oct 2011.
Absolute traffic counts for October 2011 are approximatly 7% too low.
Data loss only occurred during peak hours. It therefore may have had somewhat different impact for traffic from different parts of the world.
and may have also skewed relative figures like share of traffic per browser or operating system.

From mid September till late November squid log records for mobile traffic were in invalid format.
Data could be repaired for logs from mid October onwards. Older logs were no longer available.

Generated on Wed, Aug 5, 2015 9:11
Author:Erik Zachte (Web site)
Mail: ezachte@### (no spam: ### = wikimedia.org)
All data and images on this page are in the public domain.

Note: page may load slower on Microsoft Internet explorer than on other major browsers