[Editor’s note: This post is part 2 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.]
Log file formatting, naming conventions and compression are all aspects of proper data management. Of the three of these sub-topics within this blog entry, clearly, the log file formatting is the most important. Naming can always be changed after and compression is purely done to save disk space so its importance is limited.
While there are many different types of Web servers, this blog will focus on Windows IIS server log formatting and Apache log file formatting. In both cases, the goal is to ensure that the proper data fields are captured in a consistent way in the log files. The main difference in the two server types is that Windows IIS log files are configured and managed through a visual interface while Apache logs are managed in configuration files through text formatting. This entry assumes that people managing either of these server types are familiar with where to access this information and will not discuss that in detail, however, will outline some best practices with respect to the fields required in either case.
First and foremost, there are several fields which are critical to proper log file analysis while a few others are optional, but useful to have.
Critical Fields: Date, Time, Client IP Address, Method, URI Stem, URI Query, Protocol Status, Time Taken, User Agent and Referrer
Optional Fields: Bytes Sent, Bytes Received, Cookie
To configure a log file on a typical Windows IIS (version 5 or 6 for example), do the following steps:
Should you have more than one site on the same server, or add new sites in the future, this process will have to be completed for each site. Note that the default settings in IIS do not include Bytes Sent, Bytes Received, Time Taken, Cookie or Referrer. Of each of these, Referrer is one of the most critical ones which is not included as it eliminates your ability to review where traffic is coming from and what keywords they are using in search engines to find your site (clearly a flaw in the default settings native to Windows in my opinion).
To configure a log file on a typical Apache server, use the default ‘Combined’ log format. The following code would represent the best practice for the log file configuration which would be acceptable for most log file analyzers:
%h %l %u %t “%r” %>s %b “%{Referer}i” “%{User-agent}i”
The above entry would translate to the following:
HOST IDENT USERNAME DATE/TIME REQUEST STATUS BYTES REFERRER USERAGENT
In the above example, IDENT and USERNAME are not typically required, therefore the %l and &u fields can be removed from the formatting. Should the log file formatting be changed while the server is live, a graceful restart would be required to action the changes made to the configuration. The graceful restart will allow the server to complete all current processes prior to restarting therefore no data would be lost.
More information on this topic can be read at the following address: http://httpd.apache.org/docs/1.3/logs.html#accesslog
Naming conventions are fairly straightforward. All in all, its nice to have a simple format which allows you to manage your logs in a chronological view. To accomplish this, a general name plus the month, day and year of the log can be included in the log name.
IIS servers will typically show names as follows:
EXYYYYMMDD.log – can be sorted chronologically by date when viewed in a windows Explorer.
Apache servers will typically show names as follows:
access_log-YYYY-MM-DD-WXX.log
The ‘WXX’ part of this log name references the ‘week’ in the calendar year; for example ‘access_log-2009-01-01-W01.log’ would refer to a log file for January 1st, 2009 which is the first week of the calendar year. While the ‘WXX’ data is not required nor is it overly useful, it is also not harmful in any way.
Like naming conventions, compression isn’t overly complex either. Whether using a Zip compression tool or GZip for example, the principal is all the same: make the files smaller as to use less hard disk space. Certain log file analyzer programs are optimized to use a specific type of compression so we’d recommend you review the documentation of your log analyzer to see what it recommends. In our case, we use GZip and compress log files on a daily basis through a simple batch file script. The log file from the example above would then be called ‘access_log-2009-01-01-w01.gz’ and would typically compress the file to between 10% and 50% of its original raw size.
In short, it is important to review the data being captured in the log files for your site. You may be surprised to find that certain critical information, such as referrer (which also holds the keyword values), is not present. While naming convention and compression are also factors to keep in mind, they are not as critical. Names of the log files can always be changed in the future and compression can be done or undone at any time.
Don’t get caught by surprise! Data not captured on a daily basis is lost immediately so please check your log file configuration (or have your IT people do it for you) if you are unsure what is being logged.
[Editor’s note: This post is part 3 of a series of posts discussing Log File Management. For more on this topic, be sure to read Tyler’s other posts.]
As consumers become increasingly digitally savvy, and more and more brand touchpoints take place online,…
Marketers are on a constant journey to optimize the efficiency of paid search advertising. In…
Unassigned traffic in Google Analytics 4 (GA4) can be frustrating for data analysts to deal…
This website uses cookies.