Beginner's Guide to Understanding Raw Server Logs

Managing a website involves handling of several technical issues both at the front end as well as the back end. One of the most important administration tasks for the webmasters is identifying traffic pattern for any suspicious or unusual activity. This is generally best done with the help of raw server logs. Almost all popular hosting environments run on Apache server which generates a standard form of server logs. These raw server logs are accessible in every hosting account to examine traffic activity at the micro level. But, how many of us are able to analyze it correctly? A large number of bloggers and website owners doesn't even know about their existence. Today, we're going to learn about the analysis of the most common entries found in these raw server logs. This will not only help you in finding unusual traffic pattern but can also help you in optimizing your website.

The most common formats that are used for these logs are Common Log Format and Combined Log Format. A typical combined log entry looks something like this.

127.0.0.1 - - [22/Aug/2012:13:55:36 -0700] "GET /freshtechtips.com/2012/08/panda-penguin-seo-optimization.html HTTP/1.0" 200 25962 "http://www.google.com.ph/url?sa=t&rct=j&q=panda%20optimization&source=web&cd=9&cad=rja&ved=0CHQQFjAI&url=http%3A%2F%2Fwww.freshtechtips.com%2F2008%2F12%2F10-panda-penguin-seo-optimization.html&ei=KqM1UM-2JOaOmQWbv4GwAQ&usg=AFQjCNFphYnPYWbufh2vk2xZkM4bksjDrQ" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1"

Don't panic, it's very easy to read and internet this cryptic entry. We're taking a type of record that is mostly observed to identify traffic pattern and sources. Let's try to dissect this sample log entry and see what information is hidden in its different parts. For demo purposes, I've taken the local host IP address.

IP Address

The first entry '127.0.0.1' denotes the IP address of the remote host (visitor, crawler or a bot) that requested an object hosted on your web server. I'm saying object, because every single file or directory hosted on your web server is considered as a single object. Images, scripts and every other kind of file hosted on your web server is an object. Remember, if the user is accessing your website through a proxy server, you won't get his actual IP address. In such cases, the IP address logged will be that of the proxy server through which the object was redirected to his computer.

Identification of Client Machine

You can see a dash '-' symbol just after the IP address. It is used to log the identification of the computer through which the request came. A dash symbol denotes that the corresponding information is not available. In most general purpose web hosting environments, this information is not gathered by the servers. You can safely ignore this field as it's highly unreliable and does not guarantee accurate results.

User ID of the Requester

In the sample entry shown above, you can see another dash '-' symbol after the field discussed above. This is the user ID of the person who's requesting an object from your web server. Generally, if your web page is not password-protected, this information is not fetched at all. That's why in most cases this information is also not shown in the server logs. In case a password-protected document is requested, this information may still not be fetched successfully due to absence of necessary CGI script on the client's machine that provides this value.

Time Stamp

The next field '[22/Aug/2012:13:55:36 -0700]' denotes the server time when the request was completed. It has day, month, year, hour, minute, second and the time zone of the web server. Hour and minutes are in 24-hour format.

This is an extremely important field which can be used to know the exact local time when a request was made by the visitor. First, you must know the location of your web server. You can either ask your customer support about the same or can simply use a reliable IP tracer to know the physical location of your web server. Once you know about the city, you can use this time zone converter to know the exact local time when the object was requested.

Request Line from the Client

The next entry displays the request made by the client "GET /freshtechtips.com/2012/08/panda-penguin-seo-optimization.html HTTP/1.0" as shown in the sample record. GET denotes that the object is requested towards the client's browser. The next part shows the object that was requested. And the last part displays the protocol and its version that was used by the client to request the object.

Let me explain this in layman language. This part of the entry explains that a browser on the visitor's computer has fetched (GET) an HTML page (/freshtechtips.com/2012/08/panda-penguin-seo-optimization.html) through HTTP protocol (HTTP/1.0).

Status Code

The next entry in the sample record is the numeric value 200. This is the status code sent back to the client by your web server. These status codes can tell you whether the requested object was served successfully in a normal way, or some redirection was done, or there was some kind of error while serving the requested object.

The most common status code is 200 that show that the object was successfully served to the client. If you see a status code of 404, you can immediately understand that the client has requested a non-existent object which is not available on your web server. Here's a complete list all HTTP status codes that can be used to easily interpret different entries in your log files.

Object Size

You will find one more numeric entry 25962 immediately after the status code in the sample record shown above. This is the size of the object that was requested by the client. Larger this numeric value, bigger is the object in terms of disk space it is consuming. If you find zero or dash symbol in this field, it simply means that no object was served to the client machine. In such cases, you must also check the status code to know the reason requested resource was not sent to the client.

Referrer HTTP Request Header

The next long cryptic entry 'http://www.google.com.ph/url?sa=t&rct=j&q=panda%20optimization&source=web&cd=9&cad=rja&ved=0CHQQFjAI&url=http%3A%2F%2Fwww.freshtechtips.com%2F2008%2F12%2F10-panda-penguin-seo-optimization.html&ei=KqM1UM-2JOaOmQWbv4GwAQ&usg=AFQjCNFphYnPYWbufh2vk2xZkM4bksjDrQ' is the source of URL (referring site) that contains the link to the requested object.

It has several important parts that can be used to know about the referrer. The primary link 'http://www.google.com.ph' shows that the visitor most probably is in Philippines (.ph extension) and used the Google search engine. The next important part 'q=panda%20optimization' shows that he used 'panda optimization' query. So, this record is telling you that a visitor from Philippines used a search query on Google search engine to reach a specific web page on this blog.

User-Agent HTTP Request Header

The last entry "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1" in the log record tells about the type of browser on the visitor's computer that requested the object. The sample record shows that the request was made from a Windows machine through a Firefox browser - version 14.0.1.

This field greatly helps in identifying problems coming in object fetching that are limited to specific browsers. You can combine this part with the status code discussed above to identify such problems.

Now Your Turn

Now that you've gone through interpretation of a sample raw log entry, it's time to fetch a similar file from your hosting account and try analyzing it to identify different kinds of log entries.

Remember, understanding these raw server logs is very critical for finding out problems in emergencies when your website goes down. In such conditions, raw server log is the best place to know what exactly happened that brought down your website. Happy analyzing!