[ic] RobotUA Problems
Kevin Walsh
kevin at cursor.biz
Sat Mar 20 13:06:18 EST 2004
Jamie Neil [jamie at versado.net] wrote:
> Just been doing some site optimisation for spiders (disabling "more" in
> search results etc.) and I've stumbled across a problem with the default
> robot detection settings.
>
> RobotUA matches on substrings in the HTTP User Agent. This is fine for
> things like "Googlebot" or "Slurp", but I've noticed when trawling
> through the logs that some users have customised user agent strings
> after installing "branded" browsers or toolbars. A couple of examples:
>
> Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; AskBar 3.00; YPC
> 3.0.2; yplus 4.3.01b)
>
> Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows 98; sureseeker.com;
> searchengine2000.com)
>
UA spam - that's nasty. Browsers should identify themselves correctly
and should make the identification immutable so that neither the plug-ins
nor the end users can modify the value.
>
> Both of these will match the default RobotUA list (Ask and seek) and so
> won't get a sessionid (which I assume means the basket won't work).
>
> I'm not sure whether this is a widespread problem, but searching through
> the usertrack log with:
>
> tail -n 100000 usertrack |grep nsession.*ADD
>
> showed up 7 users in the last week without a sessionid who tried to add
> stuff to the basket.
>
> I've replaced "Ask" with "Ask?Jeeves?Teoma" (I assume spaces and / are
> not allowed so I've used wildcards), but I'm not sure what to do with
> the more generic matches like "seek" or "search".
>
If you, or anyone else, wants to research a list of "seek" and "search"
(and other) spiders then the results will be considered for inclusion
into the distributed list.
My list currently looks like this:
RobotUA <<EOR
ADSAComponent, ASPseek, ATN_Worldwide, Almaden, AltaVista, Appie,
Arachnoidea, Aranha, Architext, Ask, Atomz, AvantGo, BackRub, Builder,
Bumblebee, CMC, Contact, Cosmos, Digital*Integrity, Directory, Download,
EasyDL, EZResult, Excite, FAST, Ferret, Fireball, GMX, Google, Gromit,
Gulliver, Harvest, Hitwise, Hubater, Htdig, HTTPGet, H?m?h?kki, IlTrovatore,
Infoseek, Ingrid, Inktomi, IncyWincy, Interarchy, Jack, JoBo, KIT*Fireball,
Knowledge, Kototoi, Larbin, LeechGet, Libwww, LWP, Lycos, MegaSheep,
Mercator, MOO, MyCrawler, Nazilla, NetAnts, NetMechanic, NetResearch,
Netcraft, NetScoop, NG, NPBot, Nutch, Offline, Organica, ParaSite, Pavuk,
PingALink, Pompos, Popdexter, Progressive, Pverify, QuepasaCreep, Reifier,
Refiner, RepoMonkey, Rico, RMA, RoboDude, Robozilla, Rotondo, Rover,
Rumours, Rutgers, Scooter, Scrubby, Sherlock, SiteSnagger, SiteWinder,
SiteXpert, Slarp, Slurp, Spade, Spyder, Stamina, Steeler, SurferF3, Szukacz,
TECOMAC, Teleport, T-H-U-N-D-E-R-S-T-O-N-E, Toutatis, TulipChain, Tv*Merc,
Tygo, URLBlaze, URLGetFile, UtilMind, Vagabondo, Valkyrie, Vagabondo,
Voyager, WIRE, Walker, WebCompass, WebCopier, WebCraft, WebQL, WebRACE,
Webspinne, WebStripper, WebTrends, WebVal, WebZIP, WFARC, Wget, WhizBang,
Willow, Wire, Wombat, Xinu, Yahoo, Yandex, Zeus, Zippy, ZyBorg, agent,
ah-ha, appie, archiver, asterias, bot, contact, collector, crawl, curl,
eXtractor, fetch, fido, find, gazz, grabber, griffon, grub, ia_archiver,
index, ip3000, legs, link, lwp, marvin, mirago, moget, monitor, puf, rabaz,
reap, roach, scan, search, seek, speedy, spider, sitecheck, suke, tarantula,
targetblaster, teomaagent, webbandit, webcollage, webhack, whowhere, winona,
worm, xtreme, zao,
EOR
You'll notice that I have the "generic" keywords in there too, so I'll
be interested in any posted lists. I'd prefer to see a list, rather
than a bunch of links to websites where the information can be found;
I already have that.
On the other hand, perhaps we could have another directive, such as
the following:
NonRobotUA <<EOR
AskBar, YPC, yplus, sureseeker.com, searchengine2000.com
EOR
The idea would be to double-check with NonRobotUA if RobotUA found a
match. That would work, assuming that the spiders don't use identical
keywords in their UA identification.
--
_/ _/ _/_/_/_/ _/ _/ _/_/_/ _/ _/
_/_/_/ _/_/ _/ _/ _/ _/_/ _/ K e v i n W a l s h
_/ _/ _/ _/ _/ _/ _/ _/_/ kevin at cursor.biz
_/ _/ _/_/_/_/ _/ _/_/_/ _/ _/
More information about the interchange-users
mailing list