Unix commands for logfile parsing
Although many web analysts use a JavaScript-tagged solution, some of us still do log analysis on one or more sites. Even when JS data is used, sometimes you have a troubleshooting situation that requires you to go back to your logs. If you have access to a Unix environment, commands like grep, cut, and awk are invaluable for prowling through large files. You can also download these commands to use in a PC/DOS environment, although I’ve found the DOS version to be a little more awkward to use.
Here is an introduction to some of my favorite commands:
grep – used to find lines that contain a certain string or regular expression (note that regular expressions are not fully supported in the default grep command for some Unix systems)
cut – used to pull out specific columns from a file based on a specified delimiter; most logs are space-delimited
awk – a programming language that can parse through text files; short pieces of code can be used on the command line
sort – sort the output; the –n modifier sorts numerically; can use a –t modifier to sort on something other than the beginning of the line
uniq – eliminate duplicate lines; the –c modifier shows a count of how many times each line appears
In order to make “cut” work, you need to know which fields contain your data of interest. If you use the “combined” log format, the following table lists the fields where data is located. Cutting out cookie data can be a bit more difficult: we’re using cut with a space delimiter, but spaces can be contained in the user agent field so pulling out cookie values takes a little more work.
|
Field # |
Information |
|
1 |
IP address |
|
3 |
Auth (userid) field; note it’s not always populated |
|
4 |
Timestamp |
|
7 |
Request URL |
|
9 |
Status code |
|
11 |
Referrer |
|
12- |
User agent (the dash means go through the end of the line; UA can contain spaces and thus spans several columns) |
|
(varies) |
Cookies |
In Unix environments, you are allowed to view your results page by page on the screen, or to save them to a file. To page through the results on screen, pipe the command through “more” as shown:
command | more
To save the results to a file, redirect your output to a file of your choice using the greater than symbol:
command > outputfile
To open a logfile, use gunzip –c (the –c will only gunzip it to the screen instead of uncompressing and saving your file) if your file is ends in a .gz, which indicates it is compressed. Use the “cat” command if the logfile is not compressed. To take a peek at one of your logfiles, you would do the following:
gunzip –c file.gz | more or cat file | more
The remainder of our examples assume we are examining a compressed logfile.
To pull out all records from one IP address (1.2.3.4, for example):
gunzip –c file.gz | grep “1.2.3.4” > outputfile
To pull out all records from any IP address that begins with 12:
gunzip –c file.gz | grep “^12.” > outputfile
Notes:
- A caret (^) is the how you specify the beginning of a line with a regular expression
- The backslash () tells the regular expression you are looking for an actual period instead of a wildcard
To look at the requests made by one IP address (1.2.3.4, for example):
gunzip –c file.gz | grep “^1.2.3.4 “ | cut –d’ ‘ –f7 > outputfile
Pull out “page” requests only (status code = 200, and not an image, css, or javascript file):
gunzip –c file.gz | grep “ 200 “ | grep –v “.jpg “ | grep –v “.gif “ | grep –v “.png “ | grep –v “.css “ | grep –v “.ico “ | grep –v “.js “ > outputfile
Notes:
- You can exclude any other file extensions you wish by piping another grep –v into your command; ending the grep string ends with a space ensures you will only eliminate lines where those extensions are the request, and not embedded in a query string value.
- If you do a lot of logfile parsing, you may wish to put all the grep –v commands into a script so you don’t have to type all the commands every time you want to limit your output to pages.
Make a list of the most popular referrer fields for the /index.html page:
gunzip –c file.gz | grep “GET /index.html” | cut –d’ ‘ –f11 | sort | uniq –c | sort –nr > outputfile
Notes:
- The output will be a sorted list of lines with a number and a URL; the number is how many times the referrer occurred, and the URL is the referrer
- The uniq command must be executed on sorted input, which is why we sort the output first
- The second sort command lists the output by most to least popular referrer; -n is numeric and –r is reverse order
Pull out all the records from userid “angie”, and sort them by timestamp:
gunzip –c file.gz | grep “ angie “ | sort –t’ ‘ +3 > outputfile
Note:
The sort command is modified as follows: -t’ ‘ says the input is space-delimited, while the +3 says to sort on the fourth column (defaults to first column, but we need to move it over three columns)
Find all requests that are more than 1000 characters long:
Very long requests are often a sign that something is wrong: they can indicate a problem with your website’s code or they can be indicative of someone trying to hack into your website (especially if the requests contain any SQL code words).
gunzip –c file.gz | awk ‘length > 1000’ > outputfile
Stay tuned for more posts with additional commands.
January 12th, 2009 at 8:16 am
Hmmm I wish my shell scripts gave me more CARATS, i’d be rich after a couple thousand code lines
Great work Angie, this definitely goes into my bookmarks!
Cheers,
Julien
January 13th, 2009 at 1:20 am
LOL! I guess a girl can hope for a Unix command that returns jewelry at the beginning of each line.
Darn spell checker!