Skip Navigation Links weather.gov 
NOAA logo - Click to go to the NOAA homepage National Weather Service   NWS logo - Click to go to the NWS homepage
National Centers for Environmental Prediction
Navigation Bar Left Cap
Navigation Bar End Cap
 

NCEP Home > NCO Home > Shared Infrastructure Services Branch > WWW/FTP Server Usage Statistics > About the Server Usage Statistics

About the Server Usage Statistics

The following information was taken from the documentation for The Webalizer. It goes into the details concerning how the statistics are created and what they mean.


What is The Webalizer?

The Webalizer is a web server log file analysis program which produces usage statistics in HTML format for viewing with a browser. The results are presented in both columnar and graphical format, which facilitates interpretation. Yearly, monthly, daily and hourly usage statistics are presented, along with the ability to display usage by site, URL, referrer, user agent (browser), search string, entry/exit page, username and country (some information is only available if supported and present in the log files being processed). Processed data may also be exported into most database and spreadsheet programs that support tab delimited data formats.

The Webalizer supports CLF (common log format) log files, as well as Combined log formats as defined by NCSA and others, and variations of these which it attempts to handle intelligently. In addition, wu-ftpd xferlog formatted logs and squid proxy logs are supported.

Gzip compressed logs may now be used as input directly. Any log filename that ends with a '.gz' extension will be assumed to be in gzip format and uncompressed on the fly as it is being read. In addition, the Webalizer also supports DNS lookup capabilities if enabled at compile time. See the file DNS.README for additional information.

This documentation applies to The Webalizer Version 2.01

Output Produced

The Webalizer produces several reports (html) and graphics for each month processed. In addition, a summary page is generated for the current and previous months (up to 12), a history file is created and if incremental mode is used, the current month's processed data. The exact location and names of these files can be changed using configuration files and command line options. The files produced, (default names) are:

index.html              - Main summary page (extension may be changed)
usage.png               - Yearly graph displayed on the main index page
usage_YYYYMM.html       - Monthly summary page (extension may be changed)
usage_YYYYMM.png        - Monthly usage graph for specified month/year
daily_usage_YYYYMM.png  - Daily usage graph for specified month/year
hourly_usage_YYYYMM.png - Hourly usage graph for specified month/year
site_YYYYMM.html        - All sites listing (if enabled)
url_YYYYMM.html         - All urls listing (if enabled)
ref_YYYYMM.html         - All referrers listing (if enabled)
agent_YYYYMM.html       - All user agents listing (if enabled)
search_YYYYMM.html      - All search strings listing (if enabled)
webalizer.hist          - Previous month history (may be changed)
webalizer.current       - Incremental Data (may be changed)
site_YYYYMM.tab         - tab delimited sites file
url_YYYYMM.tab          - tab delimited urls file
ref_YYYYMM.tab          - tab delimited referrers file
agent_YYYYMM.tab        - tab delimited user agents file
user_YYYYMM.tab         - tab delimited usernames file
search_YYYYMM.tab       - tab delimited search string file


The yearly (index) report shows statistics for a 12 month period, and links to each month. The monthly report has detailed statistics for that month with additional links to any URL's and referrers found. The various totals shown are explained below.

Hits

Any request made to the server which is logged, is considered a 'hit'. The requests can be for anything... html pages, graphic images, audio files, CGI scripts, etc... Each valid line in the server log is counted as a hit. This number represents the total number of requests that were made to the server during the specified report period.

Files

Some requests made to the server, require that the server then send something back to the requesting client, such as a html page or graphic image. When this happens, it is considered a 'file' and the files total is incremented. The relationship between 'hits' and 'files' can be thought of as 'incoming requests' and 'outgoing responses'.

Pages

Pages are, well, pages! Generally, any HTML document, or anything that generates an HTML document, would be considered a page. This does not include the other stuff that goes into a document, such as graphic images, audio clips, etc... This number represents the number of 'pages' requested only, and does not include the other 'stuff' that is in the page. What actually constitutes a 'page' can vary from server to server. The default action is to treat anything with the extension '.htm', '.html' or '.cgi' as a page. A lot of sites will probably define other extensions, such as '.phtml', '.php3' and '.pl' as pages as well. Some people consider this number as the number of 'pure' hits... I'm not sure if I totally agree with that viewpoint. Some other programs (and people :) refer to this as 'Pageviews'.

Sites

Each request made to the server comes from a unique 'site', which can be referenced by a name or ultimately, an IP address. The 'sites' number shows how many unique IP addresses made requests to the server during the reporting time period. This DOES NOT mean the number of unique individual users (real people) that visited, which is impossible to determine using just logs and the HTTP protocol (however, this number might be about as close as you will get).

Visits

Whenever a request is made to the server from a given IP address (site), the amount of time since a previous request by the address is calculated (if any). If the time difference is greater than a pre-configured 'visit timeout' value (or has never made a request before), it is considered a 'new visit', and this total is incremented (both for the site, and the IP address). The default timeout value is 30 minutes (can be changed), so if a user visits your site at 1:00 in the afternoon, and then returns at 3:00, two visits would be registered. Note: in the 'Top Sites' table, the visits total should be discounted on 'Grouped' records, and thought of as the "Minimum number of visits" that came from that grouping instead. Note: Visits only occur on PageType requests, that is, for any request whose URL is one of the 'page' types defined with the PageType option. Due to the limitation of the HTTP protocol, log rotations and other factors, this number should not be taken as absolutely accurate, rather, it should be considered a pretty close "guess".

KBytes

The KBytes (kilobytes) value shows the amount of data, in KB, that was sent out by the server during the specified reporting period. This value is generated directly from the log file, so it is up to the web server to produce accurate numbers in the logs (some web servers do stupid things when it comes to reporting the number of bytes). In general, this should be a fairly accurate representation of the amount of outgoing traffic the server had, regardless of the web servers reporting quirks.

Note: A kilobyte is 1024 bytes, not 1000 :)

Top Entry and Exit Pages

The Top Entry and Exit tables give a rough estimate of what URL's are used to enter your site, and what the last pages viewed are. Because of limitations in the HTTP protocol, log rotations, etc... this number should be considered a good "rough guess" of the actual numbers, however will give a good indication of the overall trend in where users come into, and exit, your site.

Notes on FTP Log Files

The Webalizer now supports ftp logs produced by wu-ftpd and others, as a standard 'xferlog'. To process an ftp log, you must either use the -Ff command line option or have "LogType ftp" in your configuration file. Support for additional formats may be forthcoming, however a future version of the Webalizer is in the works that will allow user defined log formats, so this will become a non-issue. It is recommended that you create a separate configuration file for ftp analysis, since the values used for your web server will most likely not be suited for ftp log analysis (ie: page types, hostname, etc.. should be different).

Because of the difference in web and ftp logs, there are a few limitations:

  • Because there is no concept of a 'response code' in ftp world, response codes are restricted to either 200 (OK) or 206 (Partial Content), based on the completion status found in xferlog (for wu-ftpd, 'i'=incomplete and will generate a 206, 'c'=complete and will generate a 200). If your ftp server doesn't supply the completion status, all requests will be assigned a response code of 200. This allows the usage graph to display all transfer requests (hits), and how many of those completed in success (files - ie: 200 response codes).

  • Page totals won't accurately reflect reality, since there isn't really the concept of a 'page' in regards to ftp services. I have found that setting the PageType value to "README", "FIRST", etc... seems to work fairly well however, and will give a pretty good indication of how many 'non-binary' files were requested. Of course, the content of your ftp site will be different, so your results may vary.

  • Visit totals also won't accurately reflect reality, since visits are triggered on PageType requests (see above). What you usually wind up with is visits=sites in most cases.

  • Entry/Exit pages will not be calculated for ftp logs.

  • For obvious reasons, referrers and user agents are not supported.

  • You _cannot_ analyze both web and ftp logs at the same time.. they must be done separately in different runs.

Notes on Referrers

Referrers are weird critters... They take many shapes and forms, which makes it much harder to analyze than a typical URL, which at least has some standardization. What is contained in the referrer field of your log files varies depending on many factors, such as what site did the referral, what type of system it comes from and how the actual referral was generated. Why is this? Well, because a user can get to your site in many ways... They may have your site bookmarked in their browser, they may simply type your sites URL field in their browser, they could have clicked on a link on some remote web page or they may have found your site from one of the many search engines and site indexes found on the web. The Webalizer attempts to deal with all this variation in an intelligent way by doing certain things to the referrer string which makes it easier to analyze. Of course, if your web server doesn't provide referrer information, you probably don't really care and are asking yourself why you are reading this section...

Most referrer's will take the form of "http://somesite.com/somepage.html", which is what you will get if the user clicks on a link somewhere on the web in order to get to your site. Some will be a variation of this, and look something like "file:/some/such/sillyname", which is a reference from a HTML document on the users local machine. Several variations of this can be used, depending on what type of system the user has, if he/she is on a local network, the type of network, etc... To complicate things even more, dynamic HTML documents and HTML documents that are generated by CGI scripts or external programs produce lots of extra information which is tacked on to the end of the referrer string in an almost infinite number of ways. If the user just typed your URL into their browser or clicked on a bookmark, there won't be any information in the referrer field and will take the form "-".

In order to handle all these variations, The Webalizer parses the referrer field in a certain way. First, if the referrer string begins with "http", it assumes it is a normal referral and converts the "http://" and following hostname to lowercase in order to simplify hiding if desired. For example, the referrer "HTTP://WWW.MyHost.Com/This/Is/A/HTML/Document.html" will become "http://www.myhost.com/This/Is/A/HTML/Document.html". Notice that only the "http://" and hostname are converted to lower case... The rest of the referrer field is left alone. This follows standard convention, as the actual method (HTTP) and hostname are always case insensitive, while the document name portion is case sensitive.

Referrers that came from search engines, dynamic HTML documents, CGI scripts and other external programs usually tack on additional information that it used to create the page. A common example of this can be found in referrals that come from search engines and site indexes common on the web. Sometimes, these referrers URL's can be several hundred characters long and include all the information that the user typed in to search for your site. The Webalizer deals with this type of referrer by stripping off all the query information, which starts with a question mark '?'. The Referrer "http://search.yahoo.com/search?p=usa%26global%26link" will be converted to just "http://search.yahoo.com/search".

When a user comes to your site by using one of their bookmarks or by typing in your URL directly into their browser, the referrer field is blank, and looks like "-". Most sites will get more of these referrals than any other type. The Webalizer converts this type of referral into the string "- (Direct Request)". This is done in order to make it easier to hide via a command line option or configuration file option. This is because the character "-" is a valid character elsewhere in a referrer field, and if not turned into something unique, could not be hidden without possibly hiding other referrers that shouldn't be.



Notes on Visits/Entry/Exit Figures

The majority of data analyzed and reported on by The Webalizer is as accurate and correct as possible based on the input log file. However, due to the limitation of the HTTP protocol, the use of firewalls, proxy servers, multi-user systems, the rotation of your log files, and a myriad of other conditions, some of these numbers cannot, without absolute accuracy, be calculated. In particular, Visits, Entry Pages and Exit Pages are suspect to random errors due to the above and other conditions. The reason for this is twofold, 1) Log files are finite in size and time interval, and 2) There is no way to distinguish multiple individual users apart given only an IP address. Because log files are finite, they have a beginning and ending, which can be represented as a fixed time period. There is no way of knowing what happened previous to this time period, nor is it possible to predict future events based on it. Also, because it is impossible to distinguish individual users apart, multiple users that have the same IP address all appear to be a single user, and are treated as such. This is most common where corporate users sit behind a proxy/firewall to the outside world, and all requests appear to come from the same location (the address of the proxy/firewall itself). Dynamic IP assignment (used with dial-up internet accounts) also present a problem, since the same user will appear as to come from multiple places.

For example, suppose two users visit your server from XYZ company, which has their network connected to the Internet by a proxy server 'fw.xyz.com'. All requests from the network look as though they originated from 'fw.xyz.com', even though they were really initiated from two separate users on different PC's. The Webalizer would see these requests as from the same location, and would record only 1 visit, when in reality, there were two. Because entry and exit pages are calculated in conjunction with visits, this situation would also only record 1 entry and 1 exit page, when in reality, there should be 2.

As another example, say a single user at XYZ company is surfing around your website.. They arrive at 11:52pm the last day of the month, and continue surfing until 12:30am, which is now a new day (in a new month). Since a common practice is to rotate (save then clear) the server logs at the end of the month, you now have the users visit logged in two different files (current and previous months). Because of this (and the fact that the Webalizer clears history between months), the first page the user requests after midnight will be counted as an entry page. This is unavoidable, since it is the first request seen by that particular IP address in the new month.

For the most part, the numbers shown for visits, entry and exit pages are pretty good 'guesses', even though they may not be 100% accurate. They do provide a good indication of overall trends, and shouldn't be that far off from the real numbers to count much. You should probably consider them as the 'minimum' amount possible, since the actual (real) values should always be equal or greater in all cases.


NOAA/ National Weather Service
National Centers for Environmental Prediction
5830 University Research Court
College Park, MD 20740
NCEP Internet Services Team
Disclaimer
Credits
Glossary
Privacy Policy
About Us
Career Opportunities
Page last modified: Wednesday, 11-May-2005 12:56:36 UTC