Analytics from the Show Me State

Unix commands for logfile parsing

Although many web analysts use a JavaScript-tagged solution, some of us still do log analysis on one or more sites. Even when JS data is used, sometimes you have a troubleshooting situation that requires you to go back to your logs. If you have access to a Unix environment, commands like grep, cut, and awk are invaluable for prowling through large files. You can also download these commands to use in a PC/DOS environment, although I’ve found the DOS version to be a little more awkward to use.

Here is an introduction to some of my favorite commands:

grep – used to find lines that contain a certain string or regular expression (note that regular expressions are not fully supported in the default grep command for some Unix systems)

cut – used to pull out specific columns from a file based on a specified delimiter; most logs are space-delimited

awk – a programming language that can parse through text files; short pieces of code can be used on the command line

sort – sort the output; the –n modifier sorts numerically; can use a –t modifier to sort on something other than the beginning of the line

uniq – eliminate duplicate lines; the –c modifier shows a count of how many times each line appears

In order to make “cut” work, you need to know which fields contain your data of interest. If you use the “combined” log format, the following table lists the fields where data is located. Cutting out cookie data can be a bit more difficult: we’re using cut with a space delimiter, but spaces can be contained in the user agent field so pulling out cookie values takes a little more work.

Field #

Information

1

IP address

3

Auth (userid) field; note it’s not always populated

4

Timestamp

7

Request URL

9

Status code

11

Referrer

12-

User agent (the dash means go through the end of the line; UA can contain spaces and thus spans several columns)

(varies)

Cookies

In Unix environments, you are allowed to view your results page by page on the screen, or to save them to a file. To page through the results on screen, pipe the command through “more” as shown:

command | more

To save the results to a file, redirect your output to a file of your choice using the greater than symbol:

command > outputfile

To open a logfile, use gunzip –c (the –c will only gunzip it to the screen instead of uncompressing and saving your file) if your file is ends in a .gz, which indicates it is compressed. Use the “cat” command if the logfile is not compressed. To take a peek at one of your logfiles, you would do the following:

gunzip –c file.gz | more or cat file | more

The remainder of our examples assume we are examining a compressed logfile.

To pull out all records from one IP address (1.2.3.4, for example):

gunzip –c file.gz | grep “1.2.3.4” > outputfile

To pull out all records from any IP address that begins with 12:

gunzip –c file.gz | grep “^12.” > outputfile

Notes:

  • A caret (^) is the how you specify the beginning of a line with a regular expression
  • The backslash () tells the regular expression you are looking for an actual period instead of a wildcard

To look at the requests made by one IP address (1.2.3.4, for example):

gunzip –c file.gz | grep “^1.2.3.4 “ | cut –d’ ‘ –f7 > outputfile

Pull out “page” requests only (status code = 200, and not an image, css, or javascript file):

gunzip –c file.gz | grep “ 200 “ | grep –v “.jpg “ | grep –v “.gif “ | grep –v “.png “ | grep –v “.css “ | grep –v “.ico “ | grep –v “.js “ > outputfile

Notes:

  • You can exclude any other file extensions you wish by piping another grep –v into your command; ending the grep string ends with a space ensures you will only eliminate lines where those extensions are the request, and not embedded in a query string value.
  • If you do a lot of logfile parsing, you may wish to put all the grep –v commands into a script so you don’t have to type all the commands every time you want to limit your output to pages.

Make a list of the most popular referrer fields for the /index.html page:

gunzip –c file.gz | grep “GET /index.html” | cut –d’ ‘ –f11 | sort | uniq –c | sort –nr > outputfile

Notes:

  • The output will be a sorted list of lines with a number and a URL; the number is how many times the referrer occurred, and the URL is the referrer
  • The uniq command must be executed on sorted input, which is why we sort the output first
  • The second sort command lists the output by most to least popular referrer; -n is numeric and –r is reverse order

Pull out all the records from userid “angie”, and sort them by timestamp:

gunzip –c file.gz | grep “ angie “ | sort –t’ ‘ +3 > outputfile

Note:

The sort command is modified as follows: -t’ ‘ says the input is space-delimited, while the +3 says to sort on the fourth column (defaults to first column, but we need to move it over three columns)

Find all requests that are more than 1000 characters long:

Very long requests are often a sign that something is wrong: they can indicate a problem with your website’s code or they can be indicative of someone trying to hack into your website (especially if the requests contain any SQL code words).

gunzip –c file.gz | awk ‘length > 1000’ > outputfile

Stay tuned for more posts with additional commands.


2 Responses to “Unix commands for logfile parsing”

  1. Julien Coquet Says:

    Hmmm I wish my shell scripts gave me more CARATS, i’d be rich after a couple thousand code lines ;-)

    Great work Angie, this definitely goes into my bookmarks!

    Cheers,

    Julien

  2. angie Says:

    LOL! I guess a girl can hope for a Unix command that returns jewelry at the beginning of each line. :)

    Darn spell checker!

Trackbacks