[development] Scratching an itch: Machine Learning

Mark Fredrickson mfredrickson at ppmns.org
Mon Dec 5 19:11:11 UTC 2005


I have an interest in machine learning that I would like to bring to bear on
Drupal, and I am hoping to enlist the help of some other people who share
this interest - or can help me by providing data.

Briefly, machine learning is the algorithmic application of statistical
principles. A classic example is the Bayesian spam in your email
program/gateway/etc. Based on a learned model, this filter classifies
incoming mail as either SPAM or NOT SPAM based on a vector of data drawn
from the message.

I am looking for interested parties to join me in developing a series of
machine learning modules for Drupal. These modules will use data that Drupal
can collect to predict outcomes. Examples might include smart "What's
Related" type modules, better troll and spam bot protection, better
searching, auto categorization, and a wide variety of other predictive

I envision the following phases to this project:

0. Pre-model research: See what data Drupal captures, what other similar
modules exist for this purpose, and what work needs to be done to capture
the appropriate data. Also: create a wish list of machine learning related
tasks to choose from later.

1. Data gathering: Gather information from various sources to use as the
basis for generation and testing of models.

2. Model/Tool evaluation: Gather available tools to see if they can be
leveraged to generate useful models. Evaluate different modeling algorithms
for appropriateness to tasks.

3. Focus research: Pick one task from the wish list on which to concentrate.

4. Initial modeling and testing: Begin creating models to evaluate for
suitability to task.

5. Refine data collection: Address shortcomings in the data acquisition to
create better models.

6. Test model in live setting: Empirical testing on the model.

7. Module creation: Turn test code into publicly releasable module.

8. Review, evaluate, publish: Gather findings into a document for use by the
Drupal and machine learning communities.

At this time, I would place myself in the Phase 0 category. I hope to find
admins or consultants of admins who run sites that are willing to provide
data for this project. User privacy is important, and I am very mindful of
your needs in this department. Please do not think this is an attempt at
stealing users or personal data. I would also hope the admins would be
willing to install custom data gathering modules and help conduct later
empirical tests.

This has been a long email already, and I don't want to tie up the list with
messages relating to my specific pet project. If you are interested in
helping me with this project, please email off list.

I look forward to creating something really interesting and useful with you,

-Mark Fredrickson

