[development] Scalable Internal Activity Stream Module

Josh Koenig josh at chapterthree.com
Wed Dec 31 18:42:38 UTC 2008


Greetings Drupalistas,

I'm working on what feels like the hundredth (but is really the 4th or  
5th) project that includes some variety of Facebook-like "Activity  
Stream" for a Drupal-based community. Having tackled this problem in a  
number of different ways in the past couple years -- none of which  
I've really ever loved -- I'm tempted to launch a new module project  
to solve this thing once and for all.

My primary concerns are modularity -- such that anything can  
potentially be an Activity -- and scalability to work with 100s of  
1000s of actions and users.

After some initial review, I found both this existing module:

http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/activitystream/

However, this seems like a different concept since A) it's focused on  
external sources and B) it's based on turning each activity into a  
node. Basically, it seems better suited for an individual site  
aggregating internet-wide activity than a community reporting on itself.

On the upside, this module (and many others, e.g. userpoints) show a  
nice way to handle things modularly. I'm not too worried about that,  
really. But I am worried about scaling, and the architecture of  
activitystream got me thinking about whether or not the activity-as- 
node architecture was workable or not. I'd love opinions here.

IN FAVOR:

Nodes are functional. Facebook already lets you comment on every  
little thing that goes on. This is fun! It would be good for this  
module to do that too. It also makes future integration with  
notification/messaging updates possible, as well as every other  
wonderful thing nodes can do.

AGAINST:

This will mean huge amounts of nodes, bloating the table. It also  
means more overhead when logging activity (node_save vs a single  
optimized db_query). I'm also skeptical that the core node table  
structure has the right stuff to be queried with maximum efficiency  
(e.g. nothing to group similar queries by unless I make a ton of node  
types, etc).

There's also the question of unwanted node functions. We don't really  
want anyone to edit activities. We also don't want them to start  
showing up in search queries.

I could see a possible solution in maintaining an optimized index  
table for queries, as well as nodes for functionality, individual page  
views, etc. The bloat problem could conceivably be solved by giving  
activity nodes (and index entries) a maximum TTL ala watchdog and  
other big tables.

I've already got a table/query design down for indexing that seems to  
scale very well to 200k activity entries grouped over 20 types and  
5000 users. I suppose the next step is to do some testing around what  
the overall effects are of having short-lived nodes, and whether or  
not the other edge cases can be solved.

I'm wondering if anyone has done any of their own thinking along these  
lines and has any comments to add.

Happy New Year!
-josh

------------------------------------------
Josh Koenig, Partner, CTO
http://www.chapterthree.com
AOL IM: chap3josh
1-888-496-3238



More information about the development mailing list