All that is good, but introduces another problem: If a spammer knows a filter or a subset of it, it is trivial to write a generator for spam based on the filter.
All that is being shared with the MT blacklist and the custom filters is regexes for domains.
Speed is the key with this, not privacy. Indeed.
If spammers learn that when they spam a new url, a baysian filter will pick up their spam message, generate a regex, notify a core, which will notify all other sites, which will then block it for the future, and remove it in the past... then url spamming this kind will eventualy die. Yes, that is the scenario, what I'm about is that in drupal, wordpress you already have those bayesian filters, so exchanging actual spammy content, rather than just the regex for the url, should and will play an advantage for catching this particular spam, and similar to it.
What I don't like about exchanging hard rules is that you will end up with a unified database of urls, regexes, whatever - there are the other methods as well. Hard rules, means clever peolple will find a way around them - just look at the progress of the email spam fight. Spam's real aim is PR, but not always in a direct way. A lot of spammers are spamming in order to downgrade the PR of their competitors. They are fighting a war of their own. What I suggest is that having diversity of trained agents, regardless of implementation - hardcoded regexes, you typical statistical filter (Bayes, Fisher-Robinson, CRM114, whatever), some yet unknown filter (I'm working on some algorithms as well, a bit of shameless self promotion). This diversity creates uncertainty for what to target, what will not be picked up quickly by the community. If you are to fight it on your own it makes no point - you will loose in the long run due to lack of resources. If you use a communal exchange - then the community overall stands a very good chance. A kind of FUD measures, but this time for the spammers. In the short run the regex exchange is going to work. It is a quick radical measure. Its downside is the downside of every overtrained learning algorithm. It looses precision big time, especialy in an evolving system. It does not adapt well. A brief comparison: regex exchange - surgery; content exchange between learning systems - holistic medicine. I would prefer the second in most of the cases, but surgery still has its valid place for rapid intervention, life saving, etc...
Pipe dream? Yes, it's a war and the spammers are as resourceful and organised as the CMS/Blog engine developers. They'll find a new way to spam. But with a bit of luck, it won't involve comment/http-referer attacks. And as smart, and are fighting this for a longer time.
Cheers, Vlado -- Vladimir Zlatanov <vlado@dikini.net>