Re: [drupal-devel] Dealing with spam (was rel=nofollow)

21 Jan 2005

      ...
...
All that is good, but introduces another problem:
If a spammer knows a filter or a subset of it, it is trivial to write
a generator for spam based on the filter.
All that is being shared with the MT blacklist and the custom filters is
regexes for domains.
...
Speed is the key with this, not privacy.
Indeed.
...
If spammers learn that when they spam a new url, a baysian filter will pick
up their spam message, generate a regex, notify a core, which will notify
all other sites, which will then block it for the future, and remove it in
the past... then url spamming this kind will eventualy die.
Yes, that is the scenario, what I'm about is that in drupal, wordpress you already
have those bayesian filters, so exchanging actual spammy content, rather than
just the regex for the url, should and will play an advantage for
catching this particular spam, and similar to it.
What I don't like about exchanging hard rules is that you will end up
with a unified database of urls, regexes, whatever - there are the other
methods as well. Hard rules, means clever peolple will find a way around
them - just look at the progress of the email spam fight.

Spam's real aim is PR, but not always in a direct way. A lot of spammers
are spamming in order to downgrade the PR of their competitors. They
are fighting a war of their own.

What I suggest is that having diversity of trained agents, regardless of
implementation - hardcoded regexes, you typical statistical filter
(Bayes, Fisher-Robinson, CRM114, whatever), some yet unknown filter
(I'm working on some algorithms as well, a bit of shameless self
promotion). This diversity creates uncertainty for what to target, what
will not be picked up quickly by the community. If you are to fight it
on your own it makes no point - you will loose in the long run due to
lack of resources. If you use a communal exchange - then the community
overall stands a very good chance. A kind of FUD measures, but this time
for the spammers.

In the short run the regex exchange is going to work. It is a quick
radical measure. Its downside is the downside of every overtrained
learning algorithm. It looses precision big time, especialy in an
evolving system. It does not adapt well. 

A brief comparison: regex exchange - surgery; content exchange between
learning systems - holistic medicine. I would prefer the second in most
of the cases, but surgery still has its valid place for rapid
intervention, life saving, etc...
...
Pipe dream? Yes, it's a war and the spammers are as resourceful and
organised as the CMS/Blog engine developers. They'll find a new way to
spam. But with a bit of luck, it won't involve comment/http-referer attacks.
And as smart, and are fighting this for a longer time.
Cheers,
Vlado
-- 
Vladimir Zlatanov <vlado@dikini.net>