[ic] RobotUA Problems

Kevin Walsh kevin at cursor.biz
Sat Mar 20 13:06:18 EST 2004


Jamie Neil [jamie at versado.net] wrote:
> Just been doing some site optimisation for spiders (disabling "more" in
> search results etc.) and I've stumbled across a problem with the default
> robot detection settings. 
> 
> RobotUA matches on substrings in the HTTP User Agent. This is fine for
> things like "Googlebot" or "Slurp", but I've noticed when trawling
> through the logs that some users have customised user agent strings
> after installing "branded" browsers or toolbars. A couple of examples:
> 
> 	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; AskBar 3.00; YPC
> 3.0.2; yplus 4.3.01b)
> 
> 	Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; sureseeker.com;
> searchengine2000.com) 
> 
UA spam - that's nasty.  Browsers should identify themselves correctly
and should make the identification immutable so that neither the plug-ins
nor the end users can modify the value.

>
> Both of these will match the default RobotUA list (Ask and seek) and so
> won't get a sessionid (which I assume means the basket won't work).
> 
> I'm not sure whether this is a widespread problem, but searching through
> the usertrack log with: 
> 
> 	tail -n 100000 usertrack |grep nsession.*ADD
> 
> showed up 7 users in the last week without a sessionid who tried to add
> stuff to the basket. 
> 
> I've replaced "Ask" with "Ask?Jeeves?Teoma" (I assume spaces and / are
> not allowed so I've used wildcards), but I'm not sure what to do with
> the more generic matches like "seek" or "search".
>
If you, or anyone else, wants to research a list of "seek" and "search"
(and other) spiders then the results will be considered for inclusion
into the distributed list.

My list currently looks like this:

RobotUA <<EOR
    ADSAComponent, ASPseek, ATN_Worldwide, Almaden, AltaVista, Appie,
    Arachnoidea, Aranha, Architext, Ask, Atomz, AvantGo, BackRub, Builder,
    Bumblebee, CMC, Contact, Cosmos, Digital*Integrity, Directory, Download,
    EasyDL, EZResult, Excite, FAST, Ferret, Fireball, GMX, Google, Gromit,
    Gulliver, Harvest, Hitwise, Hubater, Htdig, HTTPGet, H?m?h?kki, IlTrovatore,
    Infoseek, Ingrid, Inktomi, IncyWincy, Interarchy, Jack, JoBo, KIT*Fireball,
    Knowledge, Kototoi, Larbin, LeechGet, Libwww, LWP, Lycos, MegaSheep,
    Mercator, MOO, MyCrawler, Nazilla, NetAnts, NetMechanic, NetResearch,
    Netcraft, NetScoop, NG, NPBot, Nutch, Offline, Organica, ParaSite, Pavuk,
    PingALink, Pompos, Popdexter, Progressive, Pverify, QuepasaCreep, Reifier,
    Refiner, RepoMonkey, Rico, RMA, RoboDude, Robozilla, Rotondo, Rover,
    Rumours, Rutgers, Scooter, Scrubby, Sherlock, SiteSnagger, SiteWinder,
    SiteXpert, Slarp, Slurp, Spade, Spyder, Stamina, Steeler, SurferF3, Szukacz,
    TECOMAC, Teleport, T-H-U-N-D-E-R-S-T-O-N-E, Toutatis, TulipChain, Tv*Merc,
    Tygo, URLBlaze, URLGetFile, UtilMind, Vagabondo, Valkyrie, Vagabondo,
    Voyager, WIRE, Walker, WebCompass, WebCopier, WebCraft, WebQL, WebRACE,
    Webspinne, WebStripper, WebTrends, WebVal, WebZIP, WFARC, Wget, WhizBang,
    Willow, Wire, Wombat, Xinu, Yahoo, Yandex, Zeus, Zippy, ZyBorg, agent,
    ah-ha, appie, archiver, asterias, bot, contact, collector, crawl, curl,
    eXtractor, fetch, fido, find, gazz, grabber, griffon, grub, ia_archiver,
    index, ip3000, legs, link, lwp, marvin, mirago, moget, monitor, puf, rabaz,
    reap, roach, scan, search, seek, speedy, spider, sitecheck, suke, tarantula,
    targetblaster, teomaagent, webbandit, webcollage, webhack, whowhere, winona,
    worm, xtreme, zao,
EOR

You'll notice that I have the "generic" keywords in there too, so I'll
be interested in any posted lists.  I'd prefer to see a list, rather
than a bunch of links to websites where the information can be found;
I already have that.

On the other hand, perhaps we could have another directive, such as
the following:

    NonRobotUA <<EOR
        AskBar, YPC, yplus, sureseeker.com, searchengine2000.com
    EOR

The idea would be to double-check with NonRobotUA if RobotUA found a
match.  That would work, assuming that the spiders don't use identical
keywords in their UA identification.

-- 
   _/   _/  _/_/_/_/  _/    _/  _/_/_/  _/    _/
  _/_/_/   _/_/      _/    _/    _/    _/_/  _/   K e v i n   W a l s h
 _/ _/    _/          _/ _/     _/    _/  _/_/    kevin at cursor.biz
_/   _/  _/_/_/_/      _/    _/_/_/  _/    _/



More information about the interchange-users mailing list