That was ecactly my concern and idea. I found that on all my sites that were hit by spam (the five sites I maintain) the spam was exactly the same. My most popular of the five was hit first. But the other four had to be learned to handle the *exact same spam messages*.
I beleive sites like sfx can me considered as "master brains", since they learn trought a much bigger educating audience, but also becuase they are much more interesting spam targets. I did not thing of spreading things like regexps etc. Merely about tokens etc.
I was doing some 'related' work the last couple of days. Granted it is not spam filtering, but it is about learning to classify text. Having a network of trusted RSS/RDF feeds will be great to propagate 'learnable' things, in this particular case spam. It can be even better if we have source tracking, and some content_id, which is unique for the whole network. Think of the chain site 1 RSS feed -> site 2 RSS feed -> site 3 RSS feed -> site 1. Such a network will surely be diverse enough, to confuse a generator, but interconnected enough to be efficient. I was doing some thinking on the Bayesian + regex/custom filter strategy spam.module uses. It would be really good to mod the learning and rating algorithms in order to account for the immediacy of the regex short term, and the effect it has on the learned spam. Does anyone have any statistics on this? Cheers, Vlado -- Vladimir Zlatanov <vlado@dikini.net>