[drupal-devel] spam.module suggested filter improvement
Vladimir Zlatanov
vlado at dikini.net
Mon Jan 24 14:13:25 UTC 2005
> Jeremy, I have a suggestion to change a bit the code of the Baeysian
> filter, do you want me to post is as a patch/feature or send you an
> email. It is not ready as a patch at the moment - it is part of
> that classifier I was mumbling about a month ago, but it might(tm)
> speed up the evaluation of the spam probability. Can't benchmark it
> properly at the moment.
What the hell, I think it is better to paste the code, it might need
more work on it.
====================================
function _naiveBayes($tokens) {
$probs = array();
$drift = variable_get('spam_min_drift', 40);
$max = variable_get('spam_max_tokens', 40);
$num=0;
//a rewrite of the original - it should not reduce the quality of the
//predictions, while it has the potential to increase speed. Speed
//difference depends on the speed of execution of asort() and the
//number of interesting tokens considered the added comparison and
//evaluation in SQL
//$drift - minimal drift from the median - a soft-ish shoulder of the
// filter
//$max - maximum number of tokens to evaluate, a hard shoulder of the
// filter
foreach($tokens as $token){
//1. may be beneficial not to include $max or make it larger
//2. the sql syntax should be relatively portable to postgres,
// but haven't checked the details
$result = db_query("SELECT probability FROM {spam_tokens} WHERE
token='%s' AND (ABS(50 - probablility) >= %d) AND last >= %d SORT BY
(ABS(50 - probability) LIMIT %d)", $token,$drift, $max);
while($p=db_fetch_object($result)){
if( $p->probability ) {
$p->probability = variable_get('spam_unknown_probability',40);
}
$probs += $p->probability;
$num++;
}
}
$rating = ($probs + $weight) / $num;
if ($rating > 99)
$rating = 99;
else if ($rating < 1)
$rating = 1;
return $rating;
}
=======================================
--
Vladimir Zlatanov <vlado at dikini.net>
More information about the drupal-devel
mailing list