Issue status update for http://drupal.org/node/29328 Post a follow up: http://drupal.org/project/comments/add/29328 Project: Drupal Version: cvs Component: statistics.module Category: feature requests Priority: normal Assigned to: Anonymous Reported by: mikeryan Updated by: Boris Mann Status: patch (code needs review) +1 for this. Full logs should be the job of the HTTP layer (i.e. Apache logs), not Drupal. Even Apache logs get rolled over -- there is no way to do this in Drupal other than to discard the logs outright. If people are so against it, then go for the admin option that Mike includes. The goal of a statistics module is to give real results to people. These filtered results give a better picture of what is actually going on. Much like archive, I'd almost like to see stats module removed from core rather than be in the decrepit state it is today (often because of the difficulty of getting core commits -- so as Ber suggests, maybe xstatistics; and Ber, you maintain a lot of modules, maybe give that one to Mike?). As it is, everyone just needs to install a separate stats package to get real results for their Drupal site. Boris Mann Previous comments: ------------------------------------------------------------------------ Sun, 21 Aug 2005 18:08:32 +0000 : mikeryan Attachment: http://drupal.org/files/issues/statistics.module_2.patch (6.73 KB) Justification: For all but the most heavily-trafficked sites, the statistics reported by Drupal are severely skewed by visits from crawlers, and from the administrators themselves. Assuming that the purpose of the statistics is to inform administrators about visits from human beings other than themselves, it is highly desirable to do our best to ignore other visits. To that end, I developed the statistics_filter module [1] (and its spinoff, the browscap module [2]). Why core? There's enough concern over the logging the statistics module does in the exit hook for the performance issues to be detailed in the help. To work as a contributed module, the statistics_filter module needs to undo what the statistics module did, essentially doubling the overhead for accesses that are meant to be ignored. If incorporated into the statistics module directly, the filtering functionality will actually reduce the database overhead (no database queries at all for ignored roles). Open issue Ignoring crawlers (which are the biggest part of the issue for most sites - my own site, with modest volume, gets 40% of its raw traffic from the Google crawler) requires the browscap database to identify crawlers. Currently I have maintenance of the browscap data (as well as provision for browser/crawler statistcs) encapsulated in a separate module. Should this support be submitted to core as a separate module, or integrated into the statistics module? Attached is a patch to statistics.module implementing filtering by roles, with filtering out crawlers dependent on an external browscap module. I hope this patch can be accepted into Drupal 4.7 - if the feeling is that the browscap code should be incorporated into statistics.module, I can do that. Thanks. [1] http://drupal.org/node/18013 [2] http://drupal.org/node/26569 ------------------------------------------------------------------------ Sun, 21 Aug 2005 19:43:11 +0000 : Bèr Kessels a big -1. we should STORE all (read absolute all) logs, yet FILTER them in the reports. What makes you think crawlers are not users? Or that I am not interested in crawlers? I think you might be more interested in adding value to xstatistics, which wants to be a more advanced stats module. And last, but not least, adding checks for contrib modules in core (if_module_exists) is a no-go. In that case, you could try to introduce a hook, but hardcoded checks for modules will simply not do. Let us hear moer comments and then decide the status of this patch. ------------------------------------------------------------------------ Sun, 21 Aug 2005 20:45:23 +0000 : mikeryan Of course, the filtering is optional - if an administrator wants to count crawlers as if they were users, then they just don't turn on filtering of crawlers. Personally, I don't find it useful to see that a node has 100 views when I know some unknown (but substantial) portion of those views were from Google/Inktomi/etc - I want to know what the human beings are reading, not what the crawlers' algorithms picked to index today. Yes, I know better than to reference contrib code from core - as I said, the question is whether (if this is to go into core) it would be better to keep browscap as a separate (core) module or incorporate it into statistics.module. Probably the latter, but I figured I'd raise the issue before putting the integration work in... The advantage of filtering at the point of logging is performance - reduced overhead in the exit hook, plus a substantially smaller accesslog table. The disadvantage is, of course, losing the log entries for crawlers and ignored roles, but if you're not interested in them anyway it's a win. So, the question is whether others are interested in filtering accesses from the log entirely, or it's just me.... ------------------------------------------------------------------------ Sun, 21 Aug 2005 23:55:21 +0000 : dopry +1 for this patch If I remember correctly popular content block, etc are linked to the statistics module and the data it logs. So some sites may want to not get this data skewed by administrators and search engines. If you still want full logging capabilities you can use the apache access logs. For larger sites it may be a slight performance advantage and save some db access time with smaller log tables even though admin access and bot access would be a negligable percentage. As an option I think its a nice one. I think Bers objection to core requiring a contrib module check is an important one, though and if other people think this is something that should go into core, then it should be addresses. ------------------------------------------------------------------------ Mon, 22 Aug 2005 08:27:44 +0000 : Bèr Kessels I know it is optional. But still: filtering your logs on *save* is unacceptible. logs should contain *everything*. If you want to not show certain entries, you should filter them on *output*. ------------------------------------------------------------------------ Mon, 22 Aug 2005 08:33:36 +0000 : robertDouglass I agree with Bér. ------------------------------------------------------------------------ Mon, 22 Aug 2005 09:17:45 +0000 : varunvnair +1 for what mikeryan is suggesting. I use Drupal to power by blog (http://www.thoughtfulchaos.com). I have a shared webhosting package and 100s of other websites are also hosted on the machine that hosts my blog (and the machine that hosts my database also has 100s or 1000s of other databases) Sometimes my site seems to be quite slow. This is probably bcoz of the machine receiving too much traffic. I cannot move to a better package bcoz I cannot afford it. I often look to sqeeze every ounce of performance I can from my installation and 1 way of doing this is by reducing the number of SQL queries. What to log and what not to log should be at the discretion of the site admin. After all s/he is the 1 who is going to decide what to do with the logs. There is no 1 golden rule that applies to all installations. There is no sense in logging everything if the site admin has to go to extra lengths to ignore what s/he doesn't need. Anyways all accesses are logged by the provider and most people can access the Apache logs and use them for more detailed analysis (I can). For a CMS like Drupal, to capture everything in a log is probably unnecessary. ------------------------------------------------------------------------ Mon, 22 Aug 2005 09:20:29 +0000 : Kobus I can't see any reason besides "taking up space" for not logging everything. I say -1 for not logging everything, +1 for filtering logs on output, with full logs available on demand. Regards, Kobus ------------------------------------------------------------------------ Tue, 23 Aug 2005 01:13:19 +0000 : mikeryan Hmm, didn't expect the proposal to be so controversial... I'd like to point out a couple of things you can do filtering at log time that you can't at output time * Leave "ignored" hits out of the node counter table. * Ignore crawlers, unless a user agent column is added to accesslog. Either one makes filtering at output time unacceptable for my purposes. Since opinion is divided, how about... <?php $group = form_radios(t('When to apply filters'), 'statistics_filter_apply', variable_get('statistics_filter_apply_when_logging', 0), array('1' => t('At logging time'), '0' => t('At display time')), t('If applied at display time, filtered accesses are logged to the database but ignored by default in reports. '. 'If applied at logging time, they are not written to the database.')); ?> P.S. Is Preview broken on drupal.org? I hit Preview and just get the edit form back...