[development] Heatmaps on Drupal.org

Gerhard Killesreiter gerhard at killesreiter.de
Fri Aug 11 14:14:44 UTC 2006


Arnab Nandi wrote:
> On 8/11/06, Gerhard Killesreiter <gerhard at killesreiter.de> wrote:
>> Furthermore, as has been pointed out by me, drupal.org is a large,
>> untapped datamine. If we make sure the data is properly anonymized, we
>> can certainly use that data.
>>
>> You should extrapolate from your own customs to everybody else's.
> 
> Logs are of this form:
> 
> ip_address | log information
> ip_address | log information
> ip_address | log information
> ip_address | log information
> ip_address | log information
> 
> Now if you discard the first column, and use only the second column.
> imho, i don't see how you can invade someone's privacy here. (please
> provide examples against this if you can) The data is not as useful as
> before, but is still quite worthy of analysis.

The data becomes completely useless. If I wanted this kind of data, I 
would look at apache logs.

What makes the accesslog table interesting is that it can track the 
complete path a user takes through your site.

   sid varchar(32) NOT NULL default '',
   path varchar(255) default NULL,
   url varchar(255) default NULL,
   hostname varchar(128) default NULL,
   uid int(10) unsigned default '0',

These are the interesting columns of the table: sid is the session id 
which is unique even for anonymous users, path stores the Drupal system 
path, url stores the referrer, hostname the IP. Now you can take the 
path of one log entry to look up where it has been used as referer and 
you readily find out where the user clicked. The only "problem" is that 
the referer is stored as fully qualified url.

So, in order to anonymize the data, I should probably hash the IPs with 
md5 using a secret salt and do the same with the sids. All what matters 
is that they are unique, after all. Paths like user/nnn/edit should 
probably also be anonymized since only the author can edit his account 
usually. Of course the uid column should be omitted or similarly anonymized.

Actually, IPs are only interesting for bots which don't keep session 
data. Would be interesting to see how google traverses our site. :p

Cheers,
	Gerhard


More information about the development mailing list