On Mon, 24 Jan 2005, Vladimir Zlatanov wrote: [...]
facts: Custom filters are immediate - equivalent to true or false. Bayesian filter accumulates evidence
Not a complete fact. There are four possible settings "always spam == true", "usually spam == probably true", "usually not spam == probably false", "never spam == false".
What I propose is to ammend the learner by using something like: if the custom filter says BAD and me not this is an error, so I need to learn it. This way the learner will change its behaviour to accomodate the new BAD thing into its statistics.
This would be simple to do. Certain events could force the filter into TEFT mode for a given spam. ie, if matching "always spam" or "never spam". In particular, I like this as it would auto-train the URL filter. I've given this thought already, and it's one of several improvements I have planned.
There is a second possible addition to use a simple 'meta'-evaluator, which uses the results of all filters - beayesian and others to judge the content. This way it can change the weight of individual filters with time, so certain filters might expire. Such an evaluator 'theoretically' has the potential to improve the overall model, without adding a significant performance cost.
Currently the module simply combines the total of all filter methods, and decided whether or not a given message is spam based on that total. I understand your proposal to adjust a given filter method if it's in disagreement to the overall overage of all filter methods. However, I'm not sure if this is realistic. The filter methods aren't really compatible -- it's perfectly normal for one filter method to suggest a message is not spam, and for anothe rfilter method to suggest it's not spam. (For example, if a message doesn't have any URLs, the URL filter and the URL counter will always say this is probably not spam. That is correct for these filter methods.)
Jeremy, I have a suggestion to change a bit the code of the Baeysian filter, do you want me to post is as a patch/feature or send you an email. It is not ready as a patch at the moment - it is part of that classifier I was mumbling about a month ago, but it might(tm) speed up the evaluation of the spam probability. Can't benchmark it properly at the moment.
The best thing to do is to open an issue in the spam project. I know you have already emailed the idea, but please still open an issue. I will be busy the next few days, and by opening an issue you can be sure I don't forget to look at this... Thanks, -Jeremy