[ic] Inktomi/Yahoo Search Engine Results include Session ID's
prtyof5 at attglobal.net
Fri Feb 20 23:53:52 EST 2004
> Gary Norton [gnorton at broadgap.com] wrote:
> > I was wondering if anyone else was experiencing any problems with this as
> > well.
> > To illustrate, if you go to yahoo and search for "Toyota lift kits"
> > (http://search.yahoo.com/search?p=toyota+lift+kits&ei=UTF-8&fr=fp-tab-web-t&
> > n=20&fl=0&x=wrt)
> > And look at the current #2 listing (suspensionconnection.com) you will
> > notice that the session id has been indexed.
> > If you go even further and click "More pages from this site"
> > (http://search.yahoo.com/search?p=toyota+lift+kits&ei=UTF-8&n=20&fl=0&fr=fp-
> > tab-web-t&vst=0&vs=www.suspensionconnection.com)
> > It will display the "TOP 20 WEB RESULTS out of about 15,700". All together
> > this site should have less than 3000 pages. If you look at many of the
> > links you can find the same page listed several times with a different
> > session ID.
> My guess is that you have upgraded to Interchange 5, from 4.8 or lower,
> and these entries are artifacts from previous spider runs. If a spider
> is identified, Interchange 5 (and some 4.9s) will prevent session IDs
> from being encoded into the URI args, so you get nice clean index entries.
> Interchange versions 4.8 and earlier didn't have any spider-trap code
> at all.
> If a search engine already has a URI with a session ID in its index
> then it will attempt to check if the URI is still valid. To do this,
> it will simply request the page as part of its crawl. Interchange will
> happily serve the page, so the search engine will assume that the
> index entry is correct.
> It is relatively easy to clean out the "invalid" search engine index
> entries with a small change to the Interchange core. Once your website
> has been re-crawled (perhaps a month later) and the indexes are clean,
> the extra Interchange core code can be removed.
> At least, with Interchange 5, you will not see any new session IDs in
> the indexes. Google, of course, is more sensible and tends to simply
> not follow URIs with arguments at all.
Perhaps one other possibility to consider and that is that Yahoo may not be
the UA inktomi. One of my pages cached at Yahoo has a change that occurred
after inktomi last visited. This is part of the reason I went looking down the IP
path for google.
And I've read, but only seen once, where googlebot visited with out identifying
the UA. The IP address was googlebot though. Also based on what others have
stated in the email thread RobotIPs would it be safe and appropriate to add
64.68.82? to the RobotIP list ? I don't know the IPs of Yahoo though
More information about the interchange-users