Jeremy, I have a suggestion to change a bit the code of the Baeysian filter, do you want me to post is as a patch/feature or send you an email. It is not ready as a patch at the moment - it is part of that classifier I was mumbling about a month ago, but it might(tm) speed up the evaluation of the spam probability. Can't benchmark it properly at the moment.
What the hell, I think it is better to paste the code, it might need more work on it. ==================================== function _naiveBayes($tokens) { $probs = array(); $drift = variable_get('spam_min_drift', 40); $max = variable_get('spam_max_tokens', 40); $num=0; //a rewrite of the original - it should not reduce the quality of the //predictions, while it has the potential to increase speed. Speed //difference depends on the speed of execution of asort() and the //number of interesting tokens considered the added comparison and //evaluation in SQL //$drift - minimal drift from the median - a soft-ish shoulder of the // filter //$max - maximum number of tokens to evaluate, a hard shoulder of the // filter foreach($tokens as $token){ //1. may be beneficial not to include $max or make it larger //2. the sql syntax should be relatively portable to postgres, // but haven't checked the details $result = db_query("SELECT probability FROM {spam_tokens} WHERE token='%s' AND (ABS(50 - probablility) >= %d) AND last >= %d SORT BY (ABS(50 - probability) LIMIT %d)", $token,$drift, $max); while($p=db_fetch_object($result)){ if( $p->probability ) { $p->probability = variable_get('spam_unknown_probability',40); } $probs += $p->probability; $num++; } } $rating = ($probs + $weight) / $num; if ($rating > 99) $rating = 99; else if ($rating < 1) $rating = 1; return $rating; } ======================================= -- Vladimir Zlatanov <vlado@dikini.net>