It appears that WordPress has a pretty killer spam killer plugin (ha!) that we might take some hints (code?) from... Frankly, I think that us open source CMS/blog apps should put our heads together and root this problem out collectively. Why duplicate effort when we all need the same tools?
*Spam Karma* Layman's Explanation Spam Karma works by running every new comment through a battery of filters and checks. Each of which increase or decrease the comment's 'Karma' value. Depending on the final score, the comment is either: * Approved * Discarded silently as spam (no email is sent to you, unless you specifically require it, but a digest is sent to you every X spams deleted). * Placed in Moderation mode. With the possibility for the commenter to auto-moderate his own comment by proving he's not a spammer (by filling a Captcha or checking a confirmation email). This whole process insures (by order of priority): * No deleted false positive (bad bad bad). * Extremely few moderated false positives (annoying): uses Captcha and email auto-moderation to keep these at a minimum. * No published spam. * very little spam held in moderation (must be destroyed directly: really annoying to have to moderate it). Further more, Spam Karma works in an intelligent way to automatically update its filtering database and grow stronger with each spam it catches… In short: blocks spam with no unnecessary annoyance, for you or your visitors. The way it should be. The Detailed Explanation For our more tech oriented friends, here are a few more insights on the rather complex process used by Spam Karma to decide what's spam and what's not. Each of the following filter is given a weight varying on many factors, ranking from user-controlled values (e.g.: after how many days is a post "old"?) to the credibility that can be given to a test (e.g.: a missing header is less important than a blacklisted IP). Mostly, Spam Karma looks at the following things: * If the poster is logged in the current blog, and what his user level is (e.g. automatically approve Admin posts). * Presence of HTML entities (e.g. {, ʚ etc). * Presence of a HTTP_VIA header. * Proper use of the posting form (hash value must be present). * Time taken to fill the comment (e.g.: if it's less than a few seconds, most likely spam). * Posting granularity. First time posters posting many comments at once vs. old-timers (with comments previously approved by the admin). * Previous diagnostic from WP's built in comment check (set on the 'Discussion' panel). * IP and regex match for URLs contained inside the comment (small weight only for non-URL text matching a URL regex). * Realtime Blacklist (RBL) Server check for IP and URLs. * Comment's age (e.g. penalize comments on very old post). In addition to these filters, Spam Karma uses different treatments and backup checks to insure it becomes better at stopping further spam and that it never deletes mistakenly a legit comment: * Ambiguous comments (that can neither be deleted or approved) are given a second check: commenter is asked to solve a Captcha or use the email auto-moderation (an email containing a hash to unlock the comment is sent to the commenter's email address). If confirmed, the comment's Karma is bumped up and the comment is either published or held for further review, if not confirmed within a certain period, its Karma is lowered and it is either deleted or kept into moderation (if it was sufficiently high to begin with). * When a comment is struck as spam, its IP and URL(s) are harvested and submitted to the Admin for inclusion in the blacklist. In the meantime, they are used as "auto-added" values, with a lesser weight than permanent blacklist entries. * When destroying a spam comment, it checks for recently posted comments that match similar values and retroactively moderate them (e.g.: a spammer could manage to slip X numbers of spams onto a blog, but upon reaching a certain suspicious threshold, all the comments would get retroactively moderated, then deleted). * Spam Karma uses a central DB to retrieve IP and URL updates. By default, it will query the DB automatically every 2 days (can be disabled). Central DB can be configured. Each install of Spam Karma can work as a sort of P2P relay in the update process (both fetching updates and publishing its own updated list for others to grab).