Google, I love you but you’re killing the vibe
February 13th, 2009 by Tom 3 CommentsOnce upon a time there was a web site. And the web site offered filters. And lo, the filters were good.
They let you browse for events at Brasils Nightclub. Or classes featuring Darlin Garcia. Or DJ nights within the city limits. Or any combination thereof.
And this was awesome. Until Googlebot arrived.
Googlebot, I love you. I really do. Hot, burning, profitable, whuffie-ridden love. But when you multiply all of the event types by all of the locations by all of the venues by all of the studios by all of the instructors by all of the genres…
Hoo boy, that’s a lot of URLs. And Googlebot wanted to index all of them.
Googlebot, I tried to break it to you gently. I whispered little suggestions in your ear. Suggestions like:
<meta name="robots" content="noindex, nofollow" />
… In pages with URLs that contain more than one filter. That way Google could index the listings for Brasils Nightclub but not every combination of Brasils and everything else. I want you to index me, Googlebot, I just don’t want you to index me to death. I do have other interests, you know.
But Google said “ohhh, you don’t want people to KNOW I followed that link! Okay. It’ll be our secret. I’m still going to follow it though. Because I LURV U.”
TOM: [Pained expression]
But Googlebot is so good to me, I couldn’t part ways with it lightly. I tried again:
<a rel="nofollow" href="http://salsadelphia.com/filtered/type,intermediate_class;venue,brasils">Tougher Classes</a>
But the more I played hard to get, the more ardent Googlebot’s love became. Googlebot honored my wishes in a sense— it didn’t kiss and tell— but it sure wouldn’t back off either. It became difficult to hold a conversation with anybody else.
My processor load was spiking. If you know what I mean. And I think you do.
Finally I turned to an old friend famous for her bluntness: Apache’s mod_rewrite module. And I begged mod_rewrite to do what I could not: cut Googlebot off at the knees.
And she obliged with panache:
# Google has been ignoring my subtle hints not to
# beat the crap out of the database by indexing
# multiple filter combos. So be blunt about it and
# explicitly redirect all attempts to access a
# double filter (or worse) to the home page if they
# come from Googlebot.
# Filters are separated by semicolons, so this ain't hard.
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteRule \; http://salsadelphia.com/ [L,R]
Did Googlebot take the hint? Well… sort of. It’s still banging the gong for me. But mod_rewrite tirelessly deflects Googlebot’s passion in a more appropriate direction:
66.249.72.65 - - [13/Feb/2009:08:55:06 -0600] "GET /filtered/genre,rueda;instructor,victor_colon;studio,prince_of_salsa;type,basic_class;venue,family_tavern HTTP/1.1" 302 302 www.salsadelphia.com "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
That’s right Googlebot— I totally freakin’ heart you, but if you push me too far you’re gonna be looking at my home page like everyone else.
Because my home page is cached for mass consumption. It’s something I can afford to give. If you know what I mean. And I think you do.
Edit: nope, robots.txt would not help with this issue. I remembered this from the Old Days, but thought I’d check in and make sure it’s still the case. It certainly seems to be: “note also that globbing and regular expression are not supported in either the User-agent or Disallow lines… you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”. “
February 13th, 2009 at 5:30 pm
Hi Tom!
.
I’m a webmaster trends analyst at Google and I thought I’d drop a few short comments regarding things that could be done to improve this situation
First off, yes, we – and I believe most of the other big search engines – support wildcards in robots.txt files. You can find a blog post about this at http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html . That said, you don’t need wildcards that to block the /filtered/ subdirectory, just use “disallow: /filtered/” in your robots.txt file and search engine crawlers will automatically know to ignore everything that starts with /filtered/
. To block everything with /filters/ in the URL, you could use “disallow: /*/filters/”.
Give it a shot & remove that redirect. I’m pretty sure it’ll work how you want it to work
.
John
February 16th, 2009 at 12:29 pm
Hi John,
Thanks for the reply!
Blocking all of /filtered/ is definitely not what I want.
I don’t want to block ALL Google access to filters. In fact, that would be Very Bad. I definitely want Google to index things like “everything that’s happening at Brasils Nightclub.”
What I *don’t* want Google to do is index combinations of two or more filters. End users have a pretty good idea which combinations are useful, but of course search engines try to explore the whole space… combinatorial explosion! Kaboom!
I separate filters with semicolons, so my original thought was to do this in robots.txt, following the globbing syntax that robots.txt resembles (but doesn’t quite follow):
Disallow: *;*
But everything I read up to that point led me to believe that this was not supported. So I wound up taking a different approach.
Now that I’ve read the article you linked to, I’m aware that I can in fact write just:
Disallow *;
And I don’t need a trailing * because the expanded Robot Exclusion Protocol doesn’t really work like a shell “glob” pattern: you don’t have to match the whole thing. In that sense it works more like a Perl-style regular expression, although it is confusing because the use of * strongly suggests it will act like globbing. But this is a pre-existing issue in the REP since:
Disallow: /foo
Has been around for a while.
I also see that you’ve added $, which is a logical thing to offer if the REP is going to match substrings beginning from the root by default and not complete URLs. Nice.
I will try Disallow: *; and see how that plays out. Though I’m not sure it will matter since there is already a rule on Google’s side. I see that MSN is grinding through my site too, although not a CPU-pinning fashion, so I’ll watch and see if they respond well to the new rule.
I did not find the Google Webmaster Blog article because I was searching for robots.txt and not Robot Exclusion Protocol. Ironically, Google didn’t offer me that blog article as a result for the former. It did offer me several pages that fail to mention the new features and don’t use the term Robot Exclusion Protocol. They use “Robot Exclusion Standard” and other not-quite-the-same terminology. Perhaps Google could reach out to the webmasters of those pages. It would also be more than acceptable to give a bump to your own webmaster blog article on this, IMHO, as it’s clearly what people need to know.
The Wikipedia article doesn’t cover your improvements to the Robot Exclusion Protocol either, although it does at least reference the term:
http://en.wikipedia.org/wiki/Robots.txt
* * *
My hack certainly cut down on the Googlebot traffic to the site. But so far the Google results for the filters are still pretty messy. A search for:
salsadelphia brasils
Yields this as the first result:
http://salsadelphia.com/filtered/genre,bachata;venue,brasils
Which narrows things too much by also filtering for bachata, and gives a weird impression of what is available there.
When what I want to see here is just:
http://salsadelphia.com/filtered/venue,brasils
I am hoping that the multiple-filter results will age off your search indexes soon?
In the meantime, I’ve been told that a regexp was added on Google’s end, and sure enough Googlebot has calmed down. Today’s Googlebot traffic looks like this:
/filtered/instructor,joe_figueroa
/filtered/studio,take_the_lead
/filtered/dj,rumbero
… And so on. Lots of single filters, no multiple filters. This is exactly what I want.
I’ll be completely happy when the results for multiple filters stop showing up in results.
Thanks again for the response!
February 18th, 2009 at 6:16 pm
[...] had fun with last week’s post, but the topic was actually serious stuff regarding the way search engines behave when they arrive [...]