<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>showmeanalytics.com &#187; grep</title>
	<atom:link href="http://showmeanalytics.com/tag/grep/feed/" rel="self" type="application/rss+xml" />
	<link>http://showmeanalytics.com</link>
	<description>Analytics from the Show Me State</description>
	<lastBuildDate>Wed, 05 May 2010 23:56:40 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Unix commands for logfile parsing</title>
		<link>http://showmeanalytics.com/2009/01/unix-commands-for-logfile-parsing/</link>
		<comments>http://showmeanalytics.com/2009/01/unix-commands-for-logfile-parsing/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 04:18:41 +0000</pubDate>
		<dc:creator>angie</dc:creator>
				<category><![CDATA[Unix]]></category>
		<category><![CDATA[awk]]></category>
		<category><![CDATA[cut]]></category>
		<category><![CDATA[grep]]></category>
		<category><![CDATA[Logfiles]]></category>
		<category><![CDATA[server logs]]></category>
		<category><![CDATA[sort]]></category>
		<category><![CDATA[uniq]]></category>

		<guid isPermaLink="false">http://showmeanalytics.com/?p=19</guid>
		<description><![CDATA[Although many web analysts use a JavaScript-tagged solution, some of us still do log analysis on one or more sites. Even when JS data is used, sometimes you have a troubleshooting situation that requires you to go back to your logs. If you have access to a Unix environment, commands like grep, cut, and awk [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">Although many web analysts use a JavaScript-tagged solution, some of us still do log analysis on one or more sites. Even when JS data is used, sometimes you have a troubleshooting situation that requires you to go back to your logs. If you have access to a Unix environment, commands like grep, cut, and awk are invaluable for prowling through large files. You can also download these commands to use in a PC/DOS environment, although I’ve found the DOS version to be a little more awkward to use.</p>
<p class="MsoNormal">Here is an introduction to some of my favorite commands:</p>
<p class="MsoNormal"><strong>grep</strong> – used to find lines that contain a certain string or regular expression (note that regular expressions are not fully supported in the default grep command for some Unix systems)</p>
<p class="MsoNormal"><strong>cut</strong> – used to pull out specific columns from a file based on a specified delimiter;  most logs are space-delimited</p>
<p class="MsoNormal"><strong>awk</strong> – a programming language that can parse through text files; short pieces of code can be used on the command line</p>
<p class="MsoNormal"><strong>sort</strong> – sort the output; the –n modifier sorts numerically; can use a –t modifier to sort on something other than the beginning of the line</p>
<p class="MsoNormal"><strong>uniq</strong> – eliminate duplicate lines; the –c modifier shows a count of how many times each line appears</p>
<p class="MsoNormal">In order to make “cut” work, you need to know which fields contain your data of interest. If you use the <a href="http://httpd.apache.org/docs/1.3/logs.html">“combined” log format</a>, the following table lists the fields where data is located. Cutting out cookie data can be a bit more difficult: we’re using cut with a space delimiter, but spaces can be contained in the user agent field so pulling out cookie values takes a little more work.</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="73" valign="top">
<p class="MsoNormal"><strong>Field #</strong></p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal"><strong>Information</strong></p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">1</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">IP address</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">3</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Auth (userid) field; note   it’s not always populated</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">4</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Timestamp</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">7</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Request URL</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">9</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Status code</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">11</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Referrer</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">12-</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">User agent (the dash means   go through the end of the line; UA can contain spaces and thus spans several   columns)</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">(varies)</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Cookies</p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal">In Unix environments, you are allowed to view your results page by page on the screen, or to save them to a file. To page through the results on screen, pipe the command through “more” as shown:</p>
<p class="MsoNormal"><span style="color: #800080;"><em>command</em> | more</span></p>
<p class="MsoNormal">To save the results to a file, redirect your output to a file of your choice using the greater than symbol:</p>
<p class="MsoNormal"><span style="color: #800080;"><em>command</em> &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">To open a logfile, use gunzip –c (the –c will only gunzip it to the screen instead of uncompressing and saving your file) if your file is ends in a .gz, which indicates it is compressed. Use the “cat” command if the logfile is not compressed. To take a peek at one of your logfiles, you would do the following:</p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | more</span> or                     <span style="color: #800080;">cat <em>file</em> | more</span></p>
<p class="MsoNormal">The remainder of our examples assume we are examining a compressed logfile.</p>
<p class="MsoNormal"><strong>To pull out all records from one IP address (1.2.3.4, for example):</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “1.2.3.4” &gt; <em>outputfile</em></span></p>
<p class="MsoNormal"><strong>To pull out all records from any IP address that begins with 12: </strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “^12.” &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Notes:</p>
<ul>
<li>A caret (^) is the how you specify the beginning of a line with a regular expression</li>
<li>The backslash () tells the regular expression you are looking for an actual period instead of a wildcard</li>
</ul>
<p class="MsoNormal"><strong>To look at the requests made by one IP address (1.2.3.4, for example):</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “^1.2.3.4 “  | cut –d’ ‘ –f7 &gt; <em>outputfile</em></span></p>
<p class="MsoNormal"><strong>Pull out “page” requests only (status code = 200, and not an image, css, or javascript file):</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “ 200 “ | grep –v “.jpg “ | grep –v “.gif “ | grep –v “.png “ | grep –v “.css “ | grep –v “.ico “ | grep –v “.js “ &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Notes:</p>
<ul style="margin-top: 0in;" type="disc">
<li class="MsoNormal">You can exclude any other      file extensions you wish by piping another grep –v into your command; ending      the grep string ends with a space ensures you will only eliminate lines      where those extensions are the request, and not embedded in a query string      value.</li>
<li class="MsoNormal">If you do a lot of logfile      parsing, you may wish to put all the grep –v commands into a script so you      don’t have to type all the commands every time you want to limit your      output to pages.</li>
</ul>
<p class="MsoNormal"><strong>Make a list of the most popular referrer fields for the /index.html page:</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “GET /index.html” | cut –d’ ‘ –f11 | sort | uniq –c | sort –nr &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Notes:</p>
<ul style="margin-top: 0in;" type="disc">
<li class="MsoNormal">The output will be a      sorted list of lines with a number and a URL; the number is how many times      the referrer occurred, and the URL is the referrer</li>
<li class="MsoNormal">The uniq command must be      executed on sorted input, which is why we sort the output first</li>
<li class="MsoNormal">The second sort command      lists the output by most to least popular referrer; -n is numeric and –r is      reverse order</li>
</ul>
<p class="MsoNormal"><strong>Pull out all the records from userid “angie”, and sort them by timestamp:</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “ angie “ | sort –t’ ‘ +3 &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Note:</p>
<p class="MsoNormal">The sort command is modified as follows: -t’ ‘ says the input is space-delimited, while the +3 says to sort on the fourth column (defaults to first column, but we need to move it over three columns)</p>
<p class="MsoNormal"><strong>Find all requests that are more than 1000 characters long:</strong></p>
<p class="MsoNormal">Very long requests are often a sign that something is wrong: they can indicate a problem with your website’s code or they can be indicative of someone trying to hack into your website (especially if the requests contain any SQL code words).</p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | awk ‘length &gt; 1000’ &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Stay tuned for more posts with additional commands.</p>
]]></content:encoded>
			<wfw:commentRss>http://showmeanalytics.com/2009/01/unix-commands-for-logfile-parsing/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
