<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Google, I love you but you&#8217;re killing the vibe</title>
	<atom:link href="http://window.punkave.com/2009/02/13/google-i-love-you-but-youre-killing-the-vibe/feed/" rel="self" type="application/rss+xml" />
	<link>http://window.punkave.com/2009/02/13/google-i-love-you-but-youre-killing-the-vibe/</link>
	<description></description>
	<lastBuildDate>Sat, 24 Jul 2010 01:10:05 -0700</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: P&#8217;unk Avenue Window &#187; Blog Archive &#187; Google Responds, Love Resumes</title>
		<link>http://window.punkave.com/2009/02/13/google-i-love-you-but-youre-killing-the-vibe/comment-page-1/#comment-66060</link>
		<dc:creator>P&#8217;unk Avenue Window &#187; Blog Archive &#187; Google Responds, Love Resumes</dc:creator>
		<pubDate>Wed, 18 Feb 2009 23:16:34 +0000</pubDate>
		<guid isPermaLink="false">http://window.punkave.com/2009/02/13/google-i-love-you-but-youre-killing-the-vibe/#comment-66060</guid>
		<description>[...] had fun with last week&#8217;s post, but the topic was actually serious stuff regarding the way search engines behave when they arrive [...]</description>
		<content:encoded><![CDATA[<p>[...] had fun with last week&#8217;s post, but the topic was actually serious stuff regarding the way search engines behave when they arrive [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tom Boutell</title>
		<link>http://window.punkave.com/2009/02/13/google-i-love-you-but-youre-killing-the-vibe/comment-page-1/#comment-66038</link>
		<dc:creator>Tom Boutell</dc:creator>
		<pubDate>Mon, 16 Feb 2009 17:29:27 +0000</pubDate>
		<guid isPermaLink="false">http://window.punkave.com/2009/02/13/google-i-love-you-but-youre-killing-the-vibe/#comment-66038</guid>
		<description>Hi John,

Thanks for the reply!

Blocking all of /filtered/ is definitely not what I want.
I don&#039;t want to block ALL Google access to filters. In fact, that would be Very Bad. I definitely want Google to index things like &quot;everything that&#039;s happening at Brasils Nightclub.&quot;

What I *don&#039;t* want Google to do is index &lt;i&gt;combinations of two or more filters&lt;/i&gt;. End users have a pretty good idea which combinations are useful, but of course search engines try to explore the whole space... combinatorial explosion! Kaboom!

I separate filters with semicolons, so my original thought was to do this in robots.txt, following the globbing syntax that robots.txt resembles (but doesn&#039;t quite follow):

Disallow: *;*

But everything I read up to that point led me to believe that this was not supported. So I wound up taking a different approach.

Now that I&#039;ve read the article you linked to, I&#039;m aware that I can in fact write just:

Disallow *;

And I don&#039;t need a trailing * because the expanded Robot Exclusion Protocol doesn&#039;t really work like a shell &quot;glob&quot; pattern: you don&#039;t have to match the whole thing. In that sense it works more like a Perl-style regular expression, although it is confusing because the use of * strongly suggests it will act like globbing. But this is a pre-existing issue in the REP since:

Disallow: /foo

Has been around for a while.

I also see that you&#039;ve added $, which is a logical thing to offer if the REP is going to match substrings beginning from the root by default and not complete URLs. Nice.

I will try Disallow: *; and see how that plays out. Though I&#039;m not sure it will matter since there is already a rule on Google&#039;s side. I see that MSN is grinding through my site too, although not a CPU-pinning fashion, so I&#039;ll watch and see if they respond well to the new rule.

I did not find the Google Webmaster Blog article because I was searching for robots.txt and not Robot Exclusion Protocol. Ironically, Google didn&#039;t offer me that blog article as a result for the former. It did offer me several pages that fail to mention the new features and don&#039;t use the term Robot Exclusion Protocol. They use &quot;Robot Exclusion Standard&quot; and other not-quite-the-same terminology. Perhaps Google could reach out to the webmasters of those pages. It would also be more than acceptable to give a bump to your own webmaster blog article on this, IMHO, as it&#039;s clearly what people need to know.

The Wikipedia article doesn&#039;t cover your improvements to the Robot Exclusion Protocol either, although it does at least reference the term:

http://en.wikipedia.org/wiki/Robots.txt

* * *

My hack certainly cut down on the Googlebot traffic to the site. But so far the Google results for the filters are still pretty messy. A search for:

salsadelphia brasils

Yields this as the first result:

http://salsadelphia.com/filtered/genre,bachata;venue,brasils

Which narrows things too much by also filtering for bachata, and gives a weird impression of what is available there.

When what I want to see here is just:

http://salsadelphia.com/filtered/venue,brasils

I am hoping that the multiple-filter results will age off your search indexes soon?

In the meantime, I&#039;ve been told that a regexp was added on Google&#039;s end, and sure enough Googlebot has calmed down. Today&#039;s Googlebot traffic looks like this:

/filtered/instructor,joe_figueroa
/filtered/studio,take_the_lead
/filtered/dj,rumbero

... And so on. Lots of single filters, no multiple filters. This is exactly what I want.

I&#039;ll be completely happy when the results for multiple filters stop showing up in results.

Thanks again for the response!</description>
		<content:encoded><![CDATA[<p>Hi John,</p>
<p>Thanks for the reply!</p>
<p>Blocking all of /filtered/ is definitely not what I want.<br />
I don&#8217;t want to block ALL Google access to filters. In fact, that would be Very Bad. I definitely want Google to index things like &#8220;everything that&#8217;s happening at Brasils Nightclub.&#8221;</p>
<p>What I *don&#8217;t* want Google to do is index <i>combinations of two or more filters</i>. End users have a pretty good idea which combinations are useful, but of course search engines try to explore the whole space&#8230; combinatorial explosion! Kaboom!</p>
<p>I separate filters with semicolons, so my original thought was to do this in robots.txt, following the globbing syntax that robots.txt resembles (but doesn&#8217;t quite follow):</p>
<p>Disallow: *;*</p>
<p>But everything I read up to that point led me to believe that this was not supported. So I wound up taking a different approach.</p>
<p>Now that I&#8217;ve read the article you linked to, I&#8217;m aware that I can in fact write just:</p>
<p>Disallow *;</p>
<p>And I don&#8217;t need a trailing * because the expanded Robot Exclusion Protocol doesn&#8217;t really work like a shell &#8220;glob&#8221; pattern: you don&#8217;t have to match the whole thing. In that sense it works more like a Perl-style regular expression, although it is confusing because the use of * strongly suggests it will act like globbing. But this is a pre-existing issue in the REP since:</p>
<p>Disallow: /foo</p>
<p>Has been around for a while.</p>
<p>I also see that you&#8217;ve added $, which is a logical thing to offer if the REP is going to match substrings beginning from the root by default and not complete URLs. Nice.</p>
<p>I will try Disallow: *; and see how that plays out. Though I&#8217;m not sure it will matter since there is already a rule on Google&#8217;s side. I see that MSN is grinding through my site too, although not a CPU-pinning fashion, so I&#8217;ll watch and see if they respond well to the new rule.</p>
<p>I did not find the Google Webmaster Blog article because I was searching for robots.txt and not Robot Exclusion Protocol. Ironically, Google didn&#8217;t offer me that blog article as a result for the former. It did offer me several pages that fail to mention the new features and don&#8217;t use the term Robot Exclusion Protocol. They use &#8220;Robot Exclusion Standard&#8221; and other not-quite-the-same terminology. Perhaps Google could reach out to the webmasters of those pages. It would also be more than acceptable to give a bump to your own webmaster blog article on this, IMHO, as it&#8217;s clearly what people need to know.</p>
<p>The Wikipedia article doesn&#8217;t cover your improvements to the Robot Exclusion Protocol either, although it does at least reference the term:</p>
<p><a href="http://en.wikipedia.org/wiki/Robots.txt" rel="nofollow">http://en.wikipedia.org/wiki/Robots.txt</a></p>
<p>* * *</p>
<p>My hack certainly cut down on the Googlebot traffic to the site. But so far the Google results for the filters are still pretty messy. A search for:</p>
<p>salsadelphia brasils</p>
<p>Yields this as the first result:</p>
<p><a href="http://salsadelphia.com/filtered/genre,bachata;venue,brasils" rel="nofollow">http://salsadelphia.com/filtered/genre,bachata;venue,brasils</a></p>
<p>Which narrows things too much by also filtering for bachata, and gives a weird impression of what is available there.</p>
<p>When what I want to see here is just:</p>
<p><a href="http://salsadelphia.com/filtered/venue,brasils" rel="nofollow">http://salsadelphia.com/filtered/venue,brasils</a></p>
<p>I am hoping that the multiple-filter results will age off your search indexes soon?</p>
<p>In the meantime, I&#8217;ve been told that a regexp was added on Google&#8217;s end, and sure enough Googlebot has calmed down. Today&#8217;s Googlebot traffic looks like this:</p>
<p>/filtered/instructor,joe_figueroa<br />
/filtered/studio,take_the_lead<br />
/filtered/dj,rumbero</p>
<p>&#8230; And so on. Lots of single filters, no multiple filters. This is exactly what I want.</p>
<p>I&#8217;ll be completely happy when the results for multiple filters stop showing up in results.</p>
<p>Thanks again for the response!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: JohnMu</title>
		<link>http://window.punkave.com/2009/02/13/google-i-love-you-but-youre-killing-the-vibe/comment-page-1/#comment-66025</link>
		<dc:creator>JohnMu</dc:creator>
		<pubDate>Fri, 13 Feb 2009 22:30:34 +0000</pubDate>
		<guid isPermaLink="false">http://window.punkave.com/2009/02/13/google-i-love-you-but-youre-killing-the-vibe/#comment-66025</guid>
		<description>Hi Tom!
I&#039;m a webmaster trends analyst at Google and I thought I&#039;d drop a few short comments regarding things that could be done to improve this situation :). 

First off, yes, we - and I believe most of the other big search engines - support wildcards in robots.txt files. You can find a blog post about this at http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html . That said, you don&#039;t need wildcards that to block the /filtered/ subdirectory, just use &quot;disallow: /filtered/&quot; in your robots.txt file and search engine crawlers will automatically know to ignore everything that starts with /filtered/ :). To block everything with /filters/ in the URL, you could use &quot;disallow: /*/filters/&quot;. 

Give it a shot &amp; remove that redirect. I&#039;m pretty sure it&#039;ll work how you want it to work :).
John</description>
		<content:encoded><![CDATA[<p>Hi Tom!<br />
I&#8217;m a webmaster trends analyst at Google and I thought I&#8217;d drop a few short comments regarding things that could be done to improve this situation <img src='http://window.punkave.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . </p>
<p>First off, yes, we &#8211; and I believe most of the other big search engines &#8211; support wildcards in robots.txt files. You can find a blog post about this at <a href="http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html" rel="nofollow">http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html</a> . That said, you don&#8217;t need wildcards that to block the /filtered/ subdirectory, just use &#8220;disallow: /filtered/&#8221; in your robots.txt file and search engine crawlers will automatically know to ignore everything that starts with /filtered/ <img src='http://window.punkave.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . To block everything with /filters/ in the URL, you could use &#8220;disallow: /*/filters/&#8221;. </p>
<p>Give it a shot &amp; remove that redirect. I&#8217;m pretty sure it&#8217;ll work how you want it to work <img src='http://window.punkave.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .<br />
John</p>
]]></content:encoded>
	</item>
</channel>
</rss>
