[ic] search engine indexing scan/ MM=0f73bb47ac44f4e422.....

Fri Jun 16 22:18:07 EDT 2006

Steve Graham <icdev at mrlock.com> wrote:
> I just noticed that Google is reindexing our site after the upgrade to IC 5.4
> 
> Among the normal results are some of these:
> 
> www.mrlock.com/eshop/locks/scan/ 
> MM=0f73bb47ac44f4e422fab7057f73d0c0:250:299:50.html?mv_more_ip=1&mv_n... 
> - 58k - 
> <http://64.233.187.104/search?q=cache:1wYAQPUy_g4J:www.mrlock.com/eshop/locks/scan/MM%3D0f73bb47ac44f4e422fab7057f73d0c0:250:299:50.html%3Fmv_more_ip%3D1%26mv_nextpage%3Dresults%26mv_arg%3D+cat+60+lock&hl=en&gl=us&ct=clnk&cd=3>Cached 
> - 
> <http://www.google.com//search?hl=en&lr=&q=related:www.mrlock.com/eshop/locks/scan/MM%3D0f73bb47ac44f4e422fab7057f73d0c0:250:299:50.html%3Fmv_more_ip%3D1%26mv_nextpage%3Dresults%26mv_arg%3D>Similar 
> pages
> 
> If I click on the link on the google site - it returns nothing, but 
> if I click on there cached page it does show the result the spider 
> obtained originally.
> 
> It seems that the spider did a search and returned a good chunk of 
> our inventory, which is ok (I guess) ....
> 
You only just noticed the problem after your upgrade to 5.4, but I
can assure you that it would have been exactly the same before you
upgraded. :-)

The problem is that the "more" key (the MM=key) is tied to the user's
session.  You will not be using the same session ID as the spider, so
the key will be invalid.  Spiders cannot follow the "more" links for
the same reason;  They don't maintain a session while crawling about
your website.

There are a few things you can do:

    1. Recognise spiders and don't paginate the list for them.  Set
       ml=999999, or whatever seems large enough.

    2. Use the Google Sitemap facility to allow the GoogleBot to find
       all of your products (that's Google-specific, of course).

    3. Try the new (in 5.4+, as luck would have it) "permanent more"
       facility.  You can activate this by adding pm=1 to your search
       spec or [query] args etc., or setting "mv_more_permanent" true
       in a form.

A warning about option #3 is probably in order:

When your data changes, your "permanent more" cache could be invalid.
Cleaning it out is fine, as long as you don't want to keep it valid for
use in a search engine's cache or a user's bookmark.  Cleaning the
cache might also annoy any current visitors who were happily following
"more" links until you cleaned them out.

You can ask Google to not keep a "cached" copy of the page by adding
the following to your HTML <head>:

    <meta name="robots" content="index,follow,noarchive">

You can't force users to not bookmark pages unless you're standing
next to them with a cricket bat.

Option #1, perhaps combined with the "noarchive" flag in the "robots"
<meta> tag, is usually sufficient.  You can be one of the first to try
option #3 if you're feeling adventurous.

-- 
   _/   _/  _/_/_/_/  _/    _/  _/_/_/  _/    _/
  _/_/_/   _/_/      _/    _/    _/    _/_/  _/   K e v i n   W a l s h
 _/ _/    _/          _/ _/     _/    _/  _/_/    kevin at cursor.biz
_/   _/  _/_/_/_/      _/    _/_/_/  _/    _/