[drupal-devel] Dealing with spam (was rel=nofollow)

Wed Jan 19 09:16:42 UTC 2005

It appears that WordPress has a pretty killer spam killer plugin (ha!)
that we might take some hints (code?) from... Frankly, I think that us
open source CMS/blog apps should put our heads together and root this
problem out collectively. Why duplicate effort when we all need the
same tools?

>From http://unknowngenius.com/blog/wordpress/spam-karma/

*Spam Karma*

Layman's Explanation

Spam Karma works by running every new comment through a battery of
filters and checks. Each of which increase or decrease the comment's
'Karma' value. Depending on the final score, the comment is either:

    * Approved
    * Discarded silently as spam (no email is sent to you, unless you
specifically require it, but a digest is sent to you every X spams
deleted).
    * Placed in Moderation mode. With the possibility for the
commenter to auto-moderate his own comment by proving he's not a
spammer (by filling a Captcha or checking a confirmation email).

This whole process insures (by order of priority):

    * No deleted false positive (bad bad bad).
    * Extremely few moderated false positives (annoying): uses Captcha
and email auto-moderation to keep these at a minimum.
    * No published spam.
    * very little spam held in moderation (must be destroyed directly:
really annoying to have to moderate it).

Further more, Spam Karma works in an intelligent way to automatically
update its filtering database and grow stronger with each spam it
catches…

In short: blocks spam with no unnecessary annoyance, for you or your
visitors. The way it should be.

The Detailed Explanation

For our more tech oriented friends, here are a few more insights on
the rather complex process used by Spam Karma to decide what's spam
and what's not. Each of the following filter is given a weight varying
on many factors, ranking from user-controlled values (e.g.: after how
many days is a post "old"?) to the credibility that can be given to a
test (e.g.: a missing header is less important than a blacklisted IP).

Mostly, Spam Karma looks at the following things:

    * If the poster is logged in the current blog, and what his user
level is (e.g. automatically approve Admin posts).
    * Presence of HTML entities (e.g. &#123;, &#666; etc).
    * Presence of a HTTP_VIA header.
    * Proper use of the posting form (hash value must be present).
    * Time taken to fill the comment (e.g.: if it's less than a few
seconds, most likely spam).
    * Posting granularity. First time posters posting many comments at
once vs. old-timers (with comments previously approved by the admin).
    * Previous diagnostic from WP's built in comment check (set on the
'Discussion' panel).
    * IP and regex match for URLs contained inside the comment (small
weight only for non-URL text matching a URL regex).
    * Realtime Blacklist (RBL) Server check for IP and URLs.
    * Comment's age (e.g. penalize comments on very old post).

In addition to these filters, Spam Karma uses different treatments and
backup checks to insure it becomes better at stopping further spam and
that it never deletes mistakenly a legit comment:

    * Ambiguous comments (that can neither be deleted or approved) are
given a second check: commenter is asked to solve a Captcha or use the
email auto-moderation (an email containing a hash to unlock the
comment is sent to the commenter's email address). If confirmed, the
comment's Karma is bumped up and the comment is either published or
held for further review, if not confirmed within a certain period, its
Karma is lowered and it is either deleted or kept into moderation (if
it was sufficiently high to begin with).
    * When a comment is struck as spam, its IP and URL(s) are
harvested and submitted to the Admin for inclusion in the blacklist.
In the meantime, they are used as "auto-added" values, with a lesser
weight than permanent blacklist entries.
    * When destroying a spam comment, it checks for recently posted
comments that match similar values and retroactively moderate them
(e.g.: a spammer could manage to slip X numbers of spams onto a blog,
but upon reaching a certain suspicious threshold, all the comments
would get retroactively moderated, then deleted).
    * Spam Karma uses a central DB to retrieve IP and URL updates. By
default, it will query the DB automatically every 2 days (can be
disabled). Central DB can be configured. Each install of Spam Karma
can work as a sort of P2P relay in the update process (both fetching
updates and publishing its own updated list for others to grab).