[development] Heatmaps on Drupal.org
Gerhard Killesreiter
gerhard at killesreiter.de
Fri Aug 11 14:14:44 UTC 2006
Arnab Nandi wrote:
> On 8/11/06, Gerhard Killesreiter <gerhard at killesreiter.de> wrote:
>> Furthermore, as has been pointed out by me, drupal.org is a large,
>> untapped datamine. If we make sure the data is properly anonymized, we
>> can certainly use that data.
>>
>> You should extrapolate from your own customs to everybody else's.
>
> Logs are of this form:
>
> ip_address | log information
> ip_address | log information
> ip_address | log information
> ip_address | log information
> ip_address | log information
>
> Now if you discard the first column, and use only the second column.
> imho, i don't see how you can invade someone's privacy here. (please
> provide examples against this if you can) The data is not as useful as
> before, but is still quite worthy of analysis.
The data becomes completely useless. If I wanted this kind of data, I
would look at apache logs.
What makes the accesslog table interesting is that it can track the
complete path a user takes through your site.
sid varchar(32) NOT NULL default '',
path varchar(255) default NULL,
url varchar(255) default NULL,
hostname varchar(128) default NULL,
uid int(10) unsigned default '0',
These are the interesting columns of the table: sid is the session id
which is unique even for anonymous users, path stores the Drupal system
path, url stores the referrer, hostname the IP. Now you can take the
path of one log entry to look up where it has been used as referer and
you readily find out where the user clicked. The only "problem" is that
the referer is stored as fully qualified url.
So, in order to anonymize the data, I should probably hash the IPs with
md5 using a secret salt and do the same with the sids. All what matters
is that they are unique, after all. Paths like user/nnn/edit should
probably also be anonymized since only the author can edit his account
usually. Of course the uid column should be omitted or similarly anonymized.
Actually, IPs are only interesting for bots which don't keep session
data. Would be interesting to see how google traverses our site. :p
Cheers,
Gerhard
More information about the development
mailing list