[drupal-devel] Dealing with spam (was rel=nofollow)

Mon Jan 24 15:22:05 UTC 2005

On Mon, 24 Jan 2005, Vladimir Zlatanov wrote:

[...]

> facts:
> Custom filters are immediate - equivalent to true or false.
> Bayesian filter accumulates evidence

Not a complete fact.  There are four possible settings "always spam == 
true", "usually spam == probably true", "usually not spam == probably 
false", "never spam == false".

> What I propose is to ammend the learner by using something like:
>  if the custom filter says BAD and me not this is an error, so I need
>  to learn it. This way the learner will change its behaviour to
>  accomodate the new BAD thing into its statistics.

This would be simple to do.  Certain events could force the filter into 
TEFT mode for a given spam.  ie, if matching "always spam" or "never 
spam".  In particular, I like this as it would auto-train the URL filter. 
I've given this thought already, and it's one of several improvements I 
have planned.

> There is a second possible addition to use a simple 'meta'-evaluator,
> which uses the results of all filters - beayesian and others to judge
> the content. This way it can change the weight of individual filters
> with time, so certain filters might expire. Such an evaluator
> 'theoretically' has the potential to improve the overall model, without
> adding a significant performance cost.

Currently the module simply combines the total of all filter methods, and 
decided whether or not a given message is spam based on that total.  I 
understand your proposal to adjust a given filter method if it's in 
disagreement to the overall overage of all filter methods.  However, I'm 
not sure if this is realistic.  The filter methods aren't really 
compatible -- it's perfectly normal for one filter method to suggest a 
message is not spam, and for anothe rfilter method to suggest it's not 
spam.

(For example, if a message doesn't have any URLs, the URL filter and the 
URL counter will always say this is probably not spam.  That is correct 
for these filter methods.)

> Jeremy, I have a suggestion to change a bit the code of the Baeysian
> filter, do you want me to post is as a patch/feature or send you an
> email. It is not ready as a patch at the moment - it is part of
> that classifier I was mumbling about a month ago, but it might(tm)
> speed up the evaluation of the spam probability. Can't benchmark it
> properly at the moment.

The best thing to do is to open an issue in the spam project.  I know you 
have already emailed the idea, but please still open an issue.  I will be 
busy the next few days, and by opening an issue you can be sure I don't 
forget to look at this...

Thanks,
  -Jeremy