[interchange-bugs] [interchange-core] [rt.icdevgroup.org #344] 80legs webcrawler not recognized as Robot due to NotRobotUA
Stefan Hornburg via RT
interchange at rt.icdevgroup.org
Wed Mar 2 14:33:14 UTC 2011
<URL: http://rt.icdevgroup.org/Ticket/Display.html?id=344 >
On 03/02/2011 03:15 PM, David Christensen wrote:
>> 80 legs webcrawler identifies itself as:
>> Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620
>> Because of the NotRobotUA entry 'Gecko' this crawler is not identified as such.
>> Blocking via RobotIP will not work as it works via a distributed network of IP's ... So it will crawl creating a bunch of session id's with all different IP numbers.
> Yeah, I've been reconsidering the NotRobotUA change. I like it in principle, but then you end up with cases like this. Short of a JustKiddingThisIsReallyARobotUA directive, I'm not sure how to do this generally—it starts to feel like an arms race. I think in the general case, we'd rather users always be able to have a session/checkout, so basically we'd run into cases like this as the exception to handle.
> Perhaps a suitable negative lookahead/behind pattern would help in this specific case. I'm also open to ideas/other thoughts.
* Allowing multiple RobotUA/NotRobotUA configuration directives in interchange.cfg.
* Compiling regexes after configuration is completed.
* Add a lookup hash to determine which match in regex wins.
* Break it out in a subroutine which can be overridden/supplemented.
LinuXia Systems => http://www.linuxia.de/
Expert Interchange Consulting and System Administration
ICDEVGROUP => http://www.icdevgroup.org/
Interchange Development Team
More information about the interchange-bugs