<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>showmeanalytics.com &#187; Logfiles</title>
	<atom:link href="http://showmeanalytics.com/tag/logfiles/feed/" rel="self" type="application/rss+xml" />
	<link>http://showmeanalytics.com</link>
	<description>Analytics from the Show Me State</description>
	<lastBuildDate>Wed, 05 May 2010 23:56:40 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>One visit, two user agents</title>
		<link>http://showmeanalytics.com/2009/07/one-visit-two-user-agents/</link>
		<comments>http://showmeanalytics.com/2009/07/one-visit-two-user-agents/#comments</comments>
		<pubDate>Tue, 14 Jul 2009 12:50:58 +0000</pubDate>
		<dc:creator>angie</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[browsers]]></category>
		<category><![CDATA[Logfiles]]></category>
		<category><![CDATA[visits]]></category>

		<guid isPermaLink="false">http://showmeanalytics.com/?p=110</guid>
		<description><![CDATA[I found out recently that visitors using Internet Explorer 8 on a site that is not compatible with that browser, can exhibit multiple user agent strings during one visit. This is because of a compatibility view provided in IE8 that makes it look and act mostly (but not exactly) like IE7, for sites that don’t [...]]]></description>
			<content:encoded><![CDATA[<p>I found out recently that visitors using Internet Explorer 8 on a site that is not compatible with that browser, can exhibit multiple user agent strings during one visit. This is because of a <a href="http://blogs.msdn.com/ie/archive/2008/08/27/introducing-compatibility-view.aspx">compatibility view</a> provided in IE8 that makes it look and act mostly (<a href="http://blogs.msdn.com/ie/archive/2009/03/12/site-compatibility-and-ie8.aspx">but not exactly</a>) like IE7, for sites that don’t play nicely with the newer browser.  If you are trying to provide a proper browser breakdown in support of a site redesign, or if you are troubleshooting browser-related data or user problems, the compatibility view will complicate things.</p>
<p>I assume that most web analytics tools identify the IE version by looking for <em>MSIE X.Y</em> in the browser string. However, this is no longer valid for IE8. This is because the IE8 user agent string will include <em>MSIE 7.0</em> when in compatibility mode. The difference between the “real” IE7, and IE8 in compatibility mode is the word <em>Trident</em>, which is included in both variants of IE8:</p>
<p><em>Example of a regular IE8 user agent: </em>Mozilla/4.0 (compatible; <strong>MSIE 8.0</strong>; Windows NT 6.0; <strong>Trident</strong>/4.0; SLCC1; Media Center PC 5.0; .NET CLR 3.5.21022)</p>
<p><em>Example of IE8 in compatibility mode:</em> Mozilla/4.0 (compatible; <strong>MSIE 7.0</strong>; Windows NT 6.0; <strong>Trident</strong>/4.0; SLCC1; Media Center PC 5.0; .NET CLR 3.5.21022)</p>
<p>Literally thousands of web sites are not compatible with IE8. A list of <a href="http://www.microsoft.com/downloads/thankyou.aspx?familyId=b885e621-91b7-432d-8175-a745b87d2588&amp;displayLang=en">more than 3,000 incompatible sites</a> is maintained by Microsoft.  This list can be downloaded by IE8 users so that the browser can automatically switch itself into compatibility view when a site is encountered that has previously been identified by IE8 users as incompatible. Many more sites are not compatible, but are not on the list because they have lower traffic levels.</p>
<p>Because a visitor can have multiple user agents in one visit, this raises a number of questions:</p>
<ul>
<li>Does your analytics tool keep the user agent string from each individual page view, or do they associate one browser with the entire visit?</li>
<li>If browser is associated with the entire visit, which browser is recorded? If they keep the string on the entry page, then IE8 is likely represented correctly in your data, but you won’t know if users are resorting to compatibility mode in order to view your site. If your analytics tool keeps the last browser string encountered in the visit, then your numbers are likely biased toward IE7 unless your tool is properly grouping this traffic as IE8.</li>
<li>If browser is associated with page views instead of the visit, then adding up visits in your browser report would give you more than the total visits for your site. In other words, browser visits would not be “summable” the way they are when one can assume that each visit has only one browser. This is not the end of the world, just something to be aware of because it’s not intuitive.</li>
<li>Does your analytics tool properly group the browsers with both <em>MSIE 7.0</em> and <em>Trident</em> as IE8? If not, do they expose the entire string so you can do the calculations yourself to see if your site has IE8 issues?</li>
<li>If you are doing logfile analysis without cookies, sessionization is probably based on IP + User Agent. For sites where I’ve transitioned from logfiles to tags in the same tool, my experience has been that IP/User Agent sessionization tends to over-count visits: this issue will increase that inflation even more. Bear in mind that many tag-based tools resort to IP/UA when cookies are blocked, so there could be a small inflation effect regardless of the type of data-collection you use.</li>
</ul>
<p>I examined a few of my sites and found the percentage of visits with IE8 to be roughly between 5% and 15%, depending on the site. My B2B sites tend to have lower IE8 penetration, while sites that attract high-tech users will tend to show a higher percentage of the latest browsers.</p>
<p>If your web analytics tool exposes the entire browser string (Google Analytics does not), I recommend you search through your user agent strings looking for <em>Trident</em>, and see for yourself if this is an issue for the sites you analyze. One metric I’m looking at is the percentage of my <em>Trident</em> browser visits that also contain <em>MSIE 7</em>, assuming that sites that are not compatible with IE8 will show a higher percentage of users resorting to compatibility mode. For a site with known IE8 issues I calculated 25% , while another site I randomly chose calculated to 12%. I haven’t examined enough sites yet to know if that means the second site also has IE8 issues, or if it just means it&#8217;s &#8220;normal&#8221; for a certain percentage of IE8 users to surf in compatibility mode. Clearly I have more work to do.</p>
<p><strong>Update</strong>: Last night I received an email from a colleague who had read this post, asking why should they care? It&#8217;s a fair question so I thought I&#8217;d answer it publicly.</p>
<p>First, if you&#8217;re asking then you probably aren&#8217;t in a situation where you need to care. That&#8217;s OK: the lowly browser report isn&#8217;t the most important report in your web analytics tool, not by a long shot.</p>
<p>But I can think of a couple of situations where it&#8217;s important:</p>
<p>1. When deciding whether or not to fund development changes to enable compatibility with certain browsers, &#8220;fewer than 5% of our visits use that browser&#8221; is a lot different than &#8220;nearly 10% of our visits use that browser&#8221;.  The numbers you use for those decisions should be as accurate as practical.</p>
<p>2. Your customer service department may receive emails or phone calls from visitors complaining that they are unable to perform certain tasks on your site (like complete a transaction). When they receive multiple complaints that sound similar but are unable to reproduce the problem in house they may ask you, the analytics ninja, for help defining the scope of the problem. These intermittent issues are difficult to troubleshoot because they&#8217;re often environment-related. One starting point is to examine the user experience through that transaction &#8212; transaction page views per visit is sometimes sufficient, or you may want to look at a funnel chart for the process &#8212; and segment it by different browser versions. If the issue is due to a browser incompatibility, you can sometimes pinpoint it quickly with this type of analysis.</p>
]]></content:encoded>
			<wfw:commentRss>http://showmeanalytics.com/2009/07/one-visit-two-user-agents/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Perverts Make My Job Interesting</title>
		<link>http://showmeanalytics.com/2009/07/perverts-make-my-job-interesting/</link>
		<comments>http://showmeanalytics.com/2009/07/perverts-make-my-job-interesting/#comments</comments>
		<pubDate>Sun, 05 Jul 2009 21:57:58 +0000</pubDate>
		<dc:creator>angie</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Logfiles]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[search keywords]]></category>

		<guid isPermaLink="false">http://showmeanalytics.com/?p=103</guid>
		<description><![CDATA[If you are a web analyst, and you have ever had to Google “zoo porn” as part of your job, you would understand why I loathe the idea of targeted advertising based on user searches. The terms I’ve searched as part of my job have gotten me on the net-nanny list of every employer I’ve [...]]]></description>
			<content:encoded><![CDATA[<p>If you are a web analyst, and you have ever had to Google “zoo porn” as part of your job, you would understand why I loathe the idea of targeted advertising based on user searches. The terms I’ve searched as part of my job have gotten me on the net-nanny list of every employer I’ve had since working in this field. It’s the perverts: they really affect my data.</p>
<div id="attachment_106" class="wp-caption aligncenter" style="width: 451px"><a href="http://showmeanalytics.com/wp-content/uploads/2009/07/fark.jpg"><img class="size-full wp-image-106" title="Screenshot: www.fark.com" src="http://showmeanalytics.com/wp-content/uploads/2009/07/fark.jpg" alt="Screenshot: www.fark.com" width="441" height="66" /></a><p class="wp-caption-text">If Fark is to be believed, the Internet is all about porn anyway.</p></div>
<p style="text-align: center;">
<p>For the record, I don’t analyze porn sites for a living. While I admit I have done analysis for at least one adult-oriented site in the past, this is different. This is the effect of sexually-oriented search terms on websites that have little or nothing to do with sex, websites that I would happily show to my mother. But if you analyze a wide enough variety of sites, you will find that fetishes come in a surprising variety of shapes and sizes, and you’ll be surprised where they, um, pop up.</p>
<p>There are three ways that these “thrill-seekers” may affect your data.</p>
<p style="padding-left: 30px;">1. <strong>By causing a one-time traffic spike</strong>. This is more likely to happen for a blog or a news site, when an article mentions something sexual in a fairly innocuous way. For example, this article contains plenty of keywords that may attract traffic that is not part of my target audience (and if you haven’t bounced by now, welcome to the world of web analytics!). This can happen on news or magazine sites that run features on a variety of subjects, and it can often catch the web analyst off guard. For example, consider the more-or-less legitimate &#8212; if somewhat sensational &#8212; news articles that were all the rage a couple months ago, talking about teens sending naked pictures of themselves to each other on cell phones. When you mention “teens” and “sex” and “naked pictures”  in the same article, you’re bound to attract some of <em>that</em> kind of traffic.</p>
<p style="padding-left: 30px;">This usually only becomes an issue when the traffic spike for a single article is large enough to influence aggregate numbers for the entire week or month. Any sudden spike (or dip) in traffic should always be investigated: it may have been due to a simple editorial choice instead of that awesome marketing campaign that your HiPPO designed.</p>
<p style="padding-left: 30px;">2. <strong>By inflating search engine visits long-term</strong>. Perhaps “inflating” isn’t the best term, since the traffic is real, it’s human, and it’s coming from search engines. This situation happens when there are articles or images on your site that are intended for one audience but end up attracting another audience – the kind that’s not likely to become a customer – and it can wreak havoc with your conversion rates. A prime example is a site that publishes medical information intended for a professional medical audience. A thorough enough site will likely contain pictures of certain body parts or descriptions of rare medical procedures, and a glance through some of your top search terms can yield insights into the human psyche that you wish you didn’t know.</p>
<p style="padding-left: 30px;">Always look past the “Top X” keyword report that is spit out of your web analytics package by default. Look for terms that seem over-represented on a site like yours. Pay careful attention to image searches, and ensure that you can separate image search keywords from text search keywords if necessary.</p>
<p style="padding-left: 30px;">3. <strong>By logging visits that never really happened</strong>. This is fairly rare, and you will likely only catch it if a) your analytics are based on server logs instead of JavaScript tags, and b) your site contains one or more unprotected redirect URLs, “pages” that contain a URL as a value in the query string. The symptom is a sudden appearance in your keyword reports of sexually-oriented phrases that have absolutely nothing to do with your site. The cause is a search engine ranking hack, where a site-of-ill-repute manages to get themselves indexed by means of your redirect URLs, using your site’s good reputation to increase their rankings. You can confirm by looking at the entry pages for the offending terms to see if they are the redirect pages.</p>
<p>As with any traffic that is obviously unqualified, you very likely want to segment out the perverts from some of your conversion rate calculations, especially if you are doing optimization efforts on one or more areas of your site. Unqualified traffic volume can be more than enough to skew results and mask changes to real customer behavior. However, I don’t recommend you filter this traffic from your entire data set. If your linking, advertising, or SEO efforts are bringing in the wrong kind of traffic, this is something you really need to know.</p>
]]></content:encoded>
			<wfw:commentRss>http://showmeanalytics.com/2009/07/perverts-make-my-job-interesting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unix commands for logfile parsing</title>
		<link>http://showmeanalytics.com/2009/01/unix-commands-for-logfile-parsing/</link>
		<comments>http://showmeanalytics.com/2009/01/unix-commands-for-logfile-parsing/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 04:18:41 +0000</pubDate>
		<dc:creator>angie</dc:creator>
				<category><![CDATA[Unix]]></category>
		<category><![CDATA[awk]]></category>
		<category><![CDATA[cut]]></category>
		<category><![CDATA[grep]]></category>
		<category><![CDATA[Logfiles]]></category>
		<category><![CDATA[server logs]]></category>
		<category><![CDATA[sort]]></category>
		<category><![CDATA[uniq]]></category>

		<guid isPermaLink="false">http://showmeanalytics.com/?p=19</guid>
		<description><![CDATA[Although many web analysts use a JavaScript-tagged solution, some of us still do log analysis on one or more sites. Even when JS data is used, sometimes you have a troubleshooting situation that requires you to go back to your logs. If you have access to a Unix environment, commands like grep, cut, and awk [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">Although many web analysts use a JavaScript-tagged solution, some of us still do log analysis on one or more sites. Even when JS data is used, sometimes you have a troubleshooting situation that requires you to go back to your logs. If you have access to a Unix environment, commands like grep, cut, and awk are invaluable for prowling through large files. You can also download these commands to use in a PC/DOS environment, although I’ve found the DOS version to be a little more awkward to use.</p>
<p class="MsoNormal">Here is an introduction to some of my favorite commands:</p>
<p class="MsoNormal"><strong>grep</strong> – used to find lines that contain a certain string or regular expression (note that regular expressions are not fully supported in the default grep command for some Unix systems)</p>
<p class="MsoNormal"><strong>cut</strong> – used to pull out specific columns from a file based on a specified delimiter;  most logs are space-delimited</p>
<p class="MsoNormal"><strong>awk</strong> – a programming language that can parse through text files; short pieces of code can be used on the command line</p>
<p class="MsoNormal"><strong>sort</strong> – sort the output; the –n modifier sorts numerically; can use a –t modifier to sort on something other than the beginning of the line</p>
<p class="MsoNormal"><strong>uniq</strong> – eliminate duplicate lines; the –c modifier shows a count of how many times each line appears</p>
<p class="MsoNormal">In order to make “cut” work, you need to know which fields contain your data of interest. If you use the <a href="http://httpd.apache.org/docs/1.3/logs.html">“combined” log format</a>, the following table lists the fields where data is located. Cutting out cookie data can be a bit more difficult: we’re using cut with a space delimiter, but spaces can be contained in the user agent field so pulling out cookie values takes a little more work.</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="73" valign="top">
<p class="MsoNormal"><strong>Field #</strong></p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal"><strong>Information</strong></p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">1</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">IP address</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">3</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Auth (userid) field; note   it’s not always populated</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">4</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Timestamp</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">7</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Request URL</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">9</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Status code</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">11</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Referrer</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">12-</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">User agent (the dash means   go through the end of the line; UA can contain spaces and thus spans several   columns)</p>
</td>
</tr>
<tr>
<td style="padding: 0in 5.4pt; width: 54.9pt;" width="73" valign="top">
<p class="MsoNormal">(varies)</p>
</td>
<td style="padding: 0in 5.4pt; width: 238.5pt;" width="318" valign="top">
<p class="MsoNormal">Cookies</p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal">In Unix environments, you are allowed to view your results page by page on the screen, or to save them to a file. To page through the results on screen, pipe the command through “more” as shown:</p>
<p class="MsoNormal"><span style="color: #800080;"><em>command</em> | more</span></p>
<p class="MsoNormal">To save the results to a file, redirect your output to a file of your choice using the greater than symbol:</p>
<p class="MsoNormal"><span style="color: #800080;"><em>command</em> &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">To open a logfile, use gunzip –c (the –c will only gunzip it to the screen instead of uncompressing and saving your file) if your file is ends in a .gz, which indicates it is compressed. Use the “cat” command if the logfile is not compressed. To take a peek at one of your logfiles, you would do the following:</p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | more</span> or                     <span style="color: #800080;">cat <em>file</em> | more</span></p>
<p class="MsoNormal">The remainder of our examples assume we are examining a compressed logfile.</p>
<p class="MsoNormal"><strong>To pull out all records from one IP address (1.2.3.4, for example):</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “1.2.3.4” &gt; <em>outputfile</em></span></p>
<p class="MsoNormal"><strong>To pull out all records from any IP address that begins with 12: </strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “^12.” &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Notes:</p>
<ul>
<li>A caret (^) is the how you specify the beginning of a line with a regular expression</li>
<li>The backslash () tells the regular expression you are looking for an actual period instead of a wildcard</li>
</ul>
<p class="MsoNormal"><strong>To look at the requests made by one IP address (1.2.3.4, for example):</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “^1.2.3.4 “  | cut –d’ ‘ –f7 &gt; <em>outputfile</em></span></p>
<p class="MsoNormal"><strong>Pull out “page” requests only (status code = 200, and not an image, css, or javascript file):</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “ 200 “ | grep –v “.jpg “ | grep –v “.gif “ | grep –v “.png “ | grep –v “.css “ | grep –v “.ico “ | grep –v “.js “ &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Notes:</p>
<ul style="margin-top: 0in;" type="disc">
<li class="MsoNormal">You can exclude any other      file extensions you wish by piping another grep –v into your command; ending      the grep string ends with a space ensures you will only eliminate lines      where those extensions are the request, and not embedded in a query string      value.</li>
<li class="MsoNormal">If you do a lot of logfile      parsing, you may wish to put all the grep –v commands into a script so you      don’t have to type all the commands every time you want to limit your      output to pages.</li>
</ul>
<p class="MsoNormal"><strong>Make a list of the most popular referrer fields for the /index.html page:</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “GET /index.html” | cut –d’ ‘ –f11 | sort | uniq –c | sort –nr &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Notes:</p>
<ul style="margin-top: 0in;" type="disc">
<li class="MsoNormal">The output will be a      sorted list of lines with a number and a URL; the number is how many times      the referrer occurred, and the URL is the referrer</li>
<li class="MsoNormal">The uniq command must be      executed on sorted input, which is why we sort the output first</li>
<li class="MsoNormal">The second sort command      lists the output by most to least popular referrer; -n is numeric and –r is      reverse order</li>
</ul>
<p class="MsoNormal"><strong>Pull out all the records from userid “angie”, and sort them by timestamp:</strong></p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | grep “ angie “ | sort –t’ ‘ +3 &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Note:</p>
<p class="MsoNormal">The sort command is modified as follows: -t’ ‘ says the input is space-delimited, while the +3 says to sort on the fourth column (defaults to first column, but we need to move it over three columns)</p>
<p class="MsoNormal"><strong>Find all requests that are more than 1000 characters long:</strong></p>
<p class="MsoNormal">Very long requests are often a sign that something is wrong: they can indicate a problem with your website’s code or they can be indicative of someone trying to hack into your website (especially if the requests contain any SQL code words).</p>
<p class="MsoNormal"><span style="color: #800080;">gunzip –c <em>file.gz</em> | awk ‘length &gt; 1000’ &gt; <em>outputfile</em></span></p>
<p class="MsoNormal">Stay tuned for more posts with additional commands.</p>
]]></content:encoded>
			<wfw:commentRss>http://showmeanalytics.com/2009/01/unix-commands-for-logfile-parsing/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
