[ic] Rolling big tables (mysql)

Grant emailgrant at gmail.com
Thu May 24 15:36:10 EDT 2007


> > > > > > I do keep a separate table of robot UAs and match traffic rows to them
> > > > > > with op=eq to populate another table with robot IPs and non-robot IPs
> > > > > > for the day to speed up the report.  Don't you think it would be
> > > > > > slower to match/no-match each IC request to a known robot UA and write
> > > > > > to the traffic table based on that, instead of unconditionally writing
> > > > > > all requests to the traffic table?  If not, excluding the robot
> > > > > > requests from the traffic table would mean a lot less processing for
> > > > > > the report and a lot fewer records for the traffic table.
> > > > > >
> > > > > Perhaps you should create a column called "spider" in the traffic table
> > > > > and save a true or false value depending upon the [data session spider]
> > > > > value.  You can then generate reports "WHERE spider = 0", for ordinary
> > > > > users, or "WHERE spider = 1" for robots etc.  An index on the spider column
> > > > > would be nice, of course.
> > > > >
> > > > I let this roll around in my head for quite a while and I ended up
> > > > writing the IC page accesses to my traffic table based on [data
> > > > session spider] like you suggested.  This should mean a much smaller
> > > > traffic table and less processing when running a report on it.  We'll
> > > > see how much time it buys me before running the report takes too long
> > > > again.  I also need to set up indexes.
> > > >
> > > Also, you may as well grab the latest robots.cfg file from CVS and
> > > "include" it into your interchange.cfg file.
> > >
> > I just had a look at robots.cfg and I think I see a few opportunities
> > for false positives.  I would think "agent" could be bad, and there
> > are browser toolbars for GetRight and Yahoo which probably alter the
> > UA.  Is there a crucial set of NotRobotUA entries to go along with
> > robots.cfg?
> >
> > Is anyone using robots.cfg and actively watching for false positives?
> >
> I'll look into those.  Do you know the UA names for the various toolbars?
> I can probably look that up somewhere.  I wouldn't have thought that
> "agent" would be a risk.

I have a good system set up for rooting out false positives.  I'll put
the current robots.cfg file into effect and keep a close eye on
things.  I'll have some specific info for you soon.

- Grant


More information about the interchange-users mailing list