[interchange-bugs] [interchange-core] [rt.icdevgroup.org #344] 80legs webcrawler not recognized as Robot due to NotRobotUA
david at endpoint.com
Wed Mar 2 15:12:22 UTC 2011
On Mar 2, 2011, at 8:33 AM, Stefan Hornburg via RT wrote:
> <URL: http://rt.icdevgroup.org/Ticket/Display.html?id=344 >
> On 03/02/2011 03:15 PM, David Christensen wrote:
>>> 80 legs webcrawler identifies itself as:
>>> Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620
>>> Because of the NotRobotUA entry 'Gecko' this crawler is not identified as such.
>>> Blocking via RobotIP will not work as it works via a distributed network of IP's ... So it will crawl creating a bunch of session id's with all different IP numbers.
>> Yeah, I've been reconsidering the NotRobotUA change. I like it in principle, but then you end up with cases like this. Short of a JustKiddingThisIsReallyARobotUA directive, I'm not sure how to do this generally—it starts to feel like an arms race. I think in the general case, we'd rather users always be able to have a session/checkout, so basically we'd run into cases like this as the exception to handle.
>> Perhaps a suitable negative lookahead/behind pattern would help in this specific case. I'm also open to ideas/other thoughts.
> What about:
> * Allowing multiple RobotUA/NotRobotUA configuration directives in interchange.cfg.
> * Compiling regexes after configuration is completed.
> * Add a lookup hash to determine which match in regex wins.
> * Break it out in a subroutine which can be overridden/supplemented.
Okay, building on this:
* have the existing robots.cfg RobotUA/NotRobotUA directives, maintained as usual. This will allow backwards-compatibility with the existing install base for people to pull out the latest versions of the robots.cfg with older IC versions.
* the single code path which determines whether to hand out the session or not (in Vend::Dispatch::dispatch, IIRC) will be refactored to call a specific sub/subref to determine Robot/NotRobot status.
* the new default implementation will more-or-less be the existing logic ripped out, however we'll add the hooks/overrides and perhaps examples to the latest dev version. Users can augment the subroutine definition or replace it.
* move existing robot-related logic to new class Vend::Robot(Detection|UA|) with a simple C<is_robot()> function. Factor out any relevant params into a context hash to pass so it will work suitably to be unit tested, used in external scripts, etc. (These values could be defaulted with $Vend::blah globals if absolutely necessary, but since there's currently a single caller, seems like we should fix the call site, not clutter up the API futher.)
* since this is affecting the ability for users to place orders/get sessions, backport these Robot-related changes to 5.6 and 5.4 as well; this way users of those versions can take advantage of any additional changes.
* pie-in-the sky: some WS API to allow IC to detect on restart (within some short minimal timeout) whether an updated version of robots.cfg is available, and optionally download it. perhaps even just a contributed script that could be put into cron in a weekly job or similar.
End Point Corporation
david at endpoint.com
More information about the interchange-bugs