Scratching an itch: Machine Learning
Hello, I have an interest in machine learning that I would like to bring to bear on Drupal, and I am hoping to enlist the help of some other people who share this interest - or can help me by providing data. Briefly, machine learning is the algorithmic application of statistical principles. A classic example is the Bayesian spam in your email program/gateway/etc. Based on a learned model, this filter classifies incoming mail as either SPAM or NOT SPAM based on a vector of data drawn from the message. I am looking for interested parties to join me in developing a series of machine learning modules for Drupal. These modules will use data that Drupal can collect to predict outcomes. Examples might include smart "What's Related" type modules, better troll and spam bot protection, better searching, auto categorization, and a wide variety of other predictive tasks. I envision the following phases to this project: 0. Pre-model research: See what data Drupal captures, what other similar modules exist for this purpose, and what work needs to be done to capture the appropriate data. Also: create a wish list of machine learning related tasks to choose from later. 1. Data gathering: Gather information from various sources to use as the basis for generation and testing of models. 2. Model/Tool evaluation: Gather available tools to see if they can be leveraged to generate useful models. Evaluate different modeling algorithms for appropriateness to tasks. 3. Focus research: Pick one task from the wish list on which to concentrate. 4. Initial modeling and testing: Begin creating models to evaluate for suitability to task. 5. Refine data collection: Address shortcomings in the data acquisition to create better models. 6. Test model in live setting: Empirical testing on the model. 7. Module creation: Turn test code into publicly releasable module. 8. Review, evaluate, publish: Gather findings into a document for use by the Drupal and machine learning communities. At this time, I would place myself in the Phase 0 category. I hope to find admins or consultants of admins who run sites that are willing to provide data for this project. User privacy is important, and I am very mindful of your needs in this department. Please do not think this is an attempt at stealing users or personal data. I would also hope the admins would be willing to install custom data gathering modules and help conduct later empirical tests. This has been a long email already, and I don't want to tie up the list with messages relating to my specific pet project. If you are interested in helping me with this project, please email off list. I look forward to creating something really interesting and useful with you, -Mark Fredrickson
On Mon, 2005-12-05 at 13:11 -0600, Mark Fredrickson wrote:
Hello,
I have an interest in machine learning that I would like to bring to bear on Drupal, and I am hoping to enlist the help of some other people who share this interest - or can help me by providing data. I am interested. But everything depends on time and ability.
Briefly, machine learning is the algorithmic application of statistical principles. A classic example is the Bayesian spam in your email program/gateway/etc. Based on a learned model, this filter classifies incoming mail as either SPAM or NOT SPAM based on a vector of data drawn from the message. Jeremy uses a Bayesian classifier in spam.moduel
I am looking for interested parties to join me in developing a series of machine learning modules for Drupal. These modules will use data that Drupal can collect to predict outcomes. Examples might include smart "What's Related" type modules, better troll and spam bot protection, better searching, auto categorization, and a wide variety of other predictive tasks. Actually, this is why I started doing the relations stuff I'm currently coding. For a primitive, non-learning, feasibility test based on some simple metrics have a look at http://dikini.net/30.11.2005/relations_battle_plan_ii_and_first_results and the similar things block.
I envision the following phases to this project: I think the plan may be good, but it looks as a very legthy ang very general.
What I learned about machine learning and datamining over the years is that they are most successfull, when you have a very well defined target of what do you want to achieve/find. With drupal, we have a multitude of applications, a zillion data-models, and infinite number of "this thing is in my head, but I'll do it" todos. Having a generic catch-all module is going to fail badly. What might be useful is a framework of basic methods - bayesian learner, rule based learner, etc... which, can be used in concrete applications, but if not used - it is a waste. Or going the evolutionary approach is pick a target, adaptive behaviour for example, so a website adapts to the user preferences and the current trends and presents the most relevant and up to date information. It's a good idea overall. And good luck. Cheers, Vlado
I am interested. A looong time ago i developed a relations.module that used the search index + algorythms to define relations between nodes. http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/ber/related/ Op dinsdag 06 december 2005 10:32, schreef vlado:
On Mon, 2005-12-05 at 13:11 -0600, Mark Fredrickson wrote:
Hello,
I have an interest in machine learning that I would like to bring to bear on Drupal, and I am hoping to enlist the help of some other people who share this interest - or can help me by providing data.
I am interested. But everything depends on time and ability.
Briefly, machine learning is the algorithmic application of statistical principles. A classic example is the Bayesian spam in your email program/gateway/etc. Based on a learned model, this filter classifies incoming mail as either SPAM or NOT SPAM based on a vector of data drawn from the message.
Jeremy uses a Bayesian classifier in spam.moduel
I am looking for interested parties to join me in developing a series of machine learning modules for Drupal. These modules will use data that Drupal can collect to predict outcomes. Examples might include smart "What's Related" type modules, better troll and spam bot protection, better searching, auto categorization, and a wide variety of other predictive tasks.
Actually, this is why I started doing the relations stuff I'm currently coding. For a primitive, non-learning, feasibility test based on some simple metrics have a look at http://dikini.net/30.11.2005/relations_battle_plan_ii_and_first_results and the similar things block.
I envision the following phases to this project:
I think the plan may be good, but it looks as a very legthy ang very general.
What I learned about machine learning and datamining over the years is that they are most successfull, when you have a very well defined target of what do you want to achieve/find. With drupal, we have a multitude of applications, a zillion data-models, and infinite number of "this thing is in my head, but I'll do it" todos. Having a generic catch-all module is going to fail badly. What might be useful is a framework of basic methods - bayesian learner, rule based learner, etc... which, can be used in concrete applications, but if not used - it is a waste. Or going the evolutionary approach is pick a target, adaptive behaviour for example, so a website adapts to the user preferences and the current trends and presents the most relevant and up to date information.
It's a good idea overall.
And good luck.
Cheers, Vlado
I am interested. But everything depends on time and ability.
Excellent. For others sitting on the fence, I am not asking for a lot of time or coding (though I will not turn it away), but more for help gathering data. I do not have an active Drupal installation at my finger tips (yet), so I need assistance with data collection.
Jeremy uses a Bayesian classifier in spam.moduel
I'll check it out. Perhaps the classifiers should be factored out into a separate component module. I'll look at the feasibility of that.
Actually, this is why I started doing the relations stuff I'm currently coding. For a primitive, non-learning, feasibility test based on some simple metrics have a look at http://dikini.net/30.11.2005/relations_battle_plan_ii_and_first_results and the similar things block.
Thanks. I'll investigate.
I envision the following phases to this project: I think the plan may be good, but it looks as a very legthy ang very general.
I agree it is long, but I want to be honest about the project. My experience has been that creating a successful machine learning model is a slow, iterative process. One refines the data collection, and then modifies the model, tests, and repeats. Depending on your hunches at the beginning this can be either a quick process or a painfully lengthy one.
What I learned about machine learning and datamining over the years is that they are most successfull, when you have a very well defined target of what do you want to achieve/find.
This is good advice. I hope to go from "I have an itch" to "I have a concrete task on which to concentrate" soon. If anyone has suggestions, I'm all ears. -Mark
I hope to go from "I have an itch" to "I have a concrete task on which to concentrate" soon. If anyone has suggestions, I'm all ears. aggregator add-on - feed item classification: The problem - you have chunks of text, some feeds provide tags, you want that filtered and mapped to your own website tags/classification/taxonomy.
This screams for an AI based approach. It is text classification, very regular stream of very small text chunks. Pre-classified. What learning methods can be used? SOM/WebSOM - maybe, but too static, Bayesian learners, LVQ and other vector space methods, .... You can have a lot of different models and scenarios to play with, and a tons of data - just hook to technorati, drupal.org/planet, icerocket, .... choose your preferences. You can play with both supervised aqnd unsupervised learning algorithms - there is space for both kinds here. Actually, search has improved it's data model in HEAD. You might want to have a look at that. The data from the search table can be be used without conversion with most of the learning algorithms in the literature out there. Wide open, and best of all this is really needed. Cheers Vlado
During my studies I did some exciting simulations of a fuzzy logic control systems (simulated a control of a gasturbine and a control of a ABS for an airplane). About half a year ago I did some small tests to see if it was possible to get some AI using fuzzy logic in Drupal. After investigating this for a short time, I concluded that PHP is not suited for this, and that Drupal is too HTML/content oriented (as opossed to data-oriented) to handle this. In order to get any sort of AI and / or learning system in, we need: * external libraries, PHP is Just Not Ready for this. (performance, memory, and library wise) * approach Drupal pages more as objects and as datamodels, rather then the current way of passing around glued together strings. (theme/module/nodeapi/database all have private glueing systems, there is not a central place where one can access *all* data, including blocks etc, on a data level, where you can poke around in metadata or objects.) things that are interesting are: * theme_list lists, they tell us that any item in any such list has a certain relation to the other one. * node teaser lists: Same as above. * user lists: any user in any list has a certain relation to the other users in these lists. We use several such lists. We must fins a way to filter out tabbed browsing, then we have very valuable data on the behaviour of users on a site. * comment - node relations. The content of a comment tells a lot about a post. This is very interesting. :) Ber
In order to get any sort of AI and / or learning system in, we need: * external libraries, PHP is Just Not Ready for this. (performance, memory, and library wise)
Hmm. If not PHP, then were is the logic? I'm most familiar with the Weka toolkit: http://www.cs.waikato.ac.nz/ml/weka/ Weka is a series of Java classes that implement different learning algorithms (and some helper classes to display, format, and test datasets). Would users be able to install these .jars? Or is this asking too much of Joe P. Blogger? (Weka is GPL if you're wondering.) -M
Mark Fredrickson wrote:
Hmm. If not PHP, then were is the logic? I'm most familiar with the Weka Would users be able to install these .jars? Or is this asking too much of Joe P. Blogger?
(Weka is GPL if you're wondering.)
Definitely not. If you want Joe Blogger to use your work, you need to either write a normal Drupal module that requires, at most, a database upgrade and possibly an external PHP library that can be copied into the module's directory, or you need to offer a web service that can be accept POST data on cron runs and return analysis. On the other hand, a large number of Drupal power users would have no problem using Java libraries. -Robert
In order to get any sort of AI and / or learning system in, we need: * external libraries, PHP is Just Not Ready for this. (performance, memory, and library wise) Depends, php is perfectly feasible for a lot of AI tasks - look at the spam.module. It is a classifier. It performs reasonably well.
It is a challenge as well. Depends. From some of the evaluations I've done some time ago, it is perfectly reasonable to expect a php system to be used for user profiling and generating adaptive websites. Learning algorithms are not nessesarily heavy.
Hmm. If not PHP, then were is the logic? I'm most familiar with the Weka toolkit: http://www.cs.waikato.ac.nz/ml/weka/
Have a look at triana http://www.trianacode.org , you can get a web services environment wrapping the weka classes with that. And the guys are experimenting a lot with it. But it is heavy. It is not for Joe X. There is orange as well: http://www.ailab.si/orange And I agree with your logic. There is place for such things in a system like Drupal. That is power on your fingertips.
Op dinsdag 06 december 2005 17:46, schreef Mark Fredrickson:
In order to get any sort of AI and / or learning system in, we need: * external libraries, PHP is Just Not Ready for this. (performance, memory, and library wise)
No, I was referring to something like the way we use imagemagick now. We could use bogofilter, or some other binary. In the end, the logic indeed needs to be in the module for a big part, just leave the hard lifting to binaries that are good in that. Bèr
Mark, Your goals are very exciting and I'll provide data when the time comes, though I don't have a great number of sites that will be particularly useful. cheers, Robert Douglass
participants (4)
-
Bèr Kessels -
Mark Fredrickson -
Robert Douglass -
vlado