[drupal-devel] [feature] Filtering of statistics ("real" visitors only)

Bèr Kessels drupal-devel at drupal.org
Mon Aug 22 08:27:46 UTC 2005


Issue status update for 
http://drupal.org/node/29328
Post a follow up: 
http://drupal.org/project/comments/add/29328

 Project:      Drupal
 Version:      cvs
 Component:    statistics.module
 Category:     feature requests
 Priority:     normal
 Assigned to:  Anonymous
 Reported by:  mikeryan
 Updated by:   Bèr Kessels
 Status:       patch (code needs review)

I know it is optional. But still: filtering your logs on *save* is
unacceptible. logs should contain *everything*. If you want to not show
certain entries, you should filter them on *output*.




Bèr Kessels



Previous comments:
------------------------------------------------------------------------

Sun, 21 Aug 2005 18:08:32 +0000 : mikeryan

Attachment: http://drupal.org/files/issues/statistics.module_2.patch (6.73 KB)

Justification:


For all but the most heavily-trafficked sites, the statistics reported
by Drupal are severely skewed by visits from crawlers, and from the
administrators themselves. Assuming that the purpose of the statistics
is to inform administrators about visits from human beings other than
themselves, it is highly desirable to do our best to ignore other
visits. To that end, I developed the statistics_filter module [1] (and
its spinoff, the browscap module [2]).


Why core?


There's enough concern over the logging the statistics module does in
the exit hook for the performance issues to be detailed in the help. To
work as a contributed module, the statistics_filter module needs to undo
what the statistics module did, essentially doubling the overhead for
accesses that are meant to be ignored. If incorporated into the
statistics module directly, the filtering functionality will actually
reduce the database overhead (no database queries at all for ignored
roles).


Open issue


Ignoring crawlers (which are the biggest part of the issue for most
sites - my own site, with modest volume, gets 40% of its raw traffic
from the Google crawler) requires the browscap database to identify
crawlers. Currently I have maintenance of the browscap data (as well as
provision for browser/crawler statistcs) encapsulated in a separate
module. Should this support be submitted to core as a separate module,
or integrated into the statistics module?


Attached is a patch to statistics.module implementing filtering by
roles, with filtering out crawlers dependent on an external browscap
module. I hope this patch can be accepted into Drupal 4.7 - if the
feeling is that the browscap code should be incorporated into
statistics.module, I can do that.


Thanks.
[1] http://drupal.org/node/18013
[2] http://drupal.org/node/26569




------------------------------------------------------------------------

Sun, 21 Aug 2005 19:43:11 +0000 : Bèr Kessels

a big -1.


we should STORE all (read absolute all) logs, yet FILTER them in the
reports. 


What makes you think crawlers are not users? Or that I am not
interested in crawlers? 


I think you might be more interested in adding value to xstatistics,
which wants to be a more advanced stats module. 


And last, but not least, adding checks for contrib modules in core
(if_module_exists) is a no-go. In that case, you could try to introduce
a hook, but hardcoded checks for modules will simply not do.


Let us hear moer comments and then decide the status of this patch.




------------------------------------------------------------------------

Sun, 21 Aug 2005 20:45:23 +0000 : mikeryan

Of course, the filtering is optional - if an administrator wants to
count crawlers as if they were users, then they just don't turn on
filtering of crawlers. Personally, I don't find it useful to see that a
node has 100 views when I know some unknown (but substantial) portion of
those views were from Google/Inktomi/etc - I want to know what the human
beings are reading, not what the crawlers' algorithms picked to index
today.


Yes, I know better than to reference contrib code from core - as I
said, the question is whether (if this is to go into core) it would be
better to keep browscap as a separate (core) module or incorporate it
into statistics.module. Probably the latter, but I figured I'd raise
the issue before putting the integration work in...


The advantage of filtering at the point of logging is performance -
reduced overhead in the exit hook, plus a substantially smaller
accesslog table. The disadvantage is, of course, losing the log entries
for crawlers and ignored roles, but if you're not interested in them
anyway it's a win.


So, the question is whether others are interested in filtering accesses
from the log entirely, or it's just me....




------------------------------------------------------------------------

Sun, 21 Aug 2005 23:55:21 +0000 : dopry

+1 for this patch


If I remember correctly popular content block, etc are linked to the
statistics module and the data it logs. So some sites may want to not
get this data skewed by administrators and search engines. If you still
want full logging capabilities you can use the apache access logs. For
larger sites it may be a slight performance advantage and save some db
access time with smaller log tables even though admin access and bot
access would be a negligable percentage. As an option I think its a
nice one. 


I think Bers objection to core requiring a contrib module check is an
important one, though and if other people think this is something that
should go into core, then it should be addresses.







More information about the drupal-devel mailing list