Arnab Nandi wrote:
On 8/11/06, Gerhard Killesreiter <gerhard@killesreiter.de> wrote:
Furthermore, as has been pointed out by me, drupal.org is a large, untapped datamine. If we make sure the data is properly anonymized, we can certainly use that data.
You should extrapolate from your own customs to everybody else's.
Logs are of this form:
ip_address | log information ip_address | log information ip_address | log information ip_address | log information ip_address | log information
Now if you discard the first column, and use only the second column. imho, i don't see how you can invade someone's privacy here. (please provide examples against this if you can) The data is not as useful as before, but is still quite worthy of analysis.
The data becomes completely useless. If I wanted this kind of data, I would look at apache logs. What makes the accesslog table interesting is that it can track the complete path a user takes through your site. sid varchar(32) NOT NULL default '', path varchar(255) default NULL, url varchar(255) default NULL, hostname varchar(128) default NULL, uid int(10) unsigned default '0', These are the interesting columns of the table: sid is the session id which is unique even for anonymous users, path stores the Drupal system path, url stores the referrer, hostname the IP. Now you can take the path of one log entry to look up where it has been used as referer and you readily find out where the user clicked. The only "problem" is that the referer is stored as fully qualified url. So, in order to anonymize the data, I should probably hash the IPs with md5 using a secret salt and do the same with the sids. All what matters is that they are unique, after all. Paths like user/nnn/edit should probably also be anonymized since only the author can edit his account usually. Of course the uid column should be omitted or similarly anonymized. Actually, IPs are only interesting for bots which don't keep session data. Would be interesting to see how google traverses our site. :p Cheers, Gerhard