[drupal-devel] Dealing with spam (was rel=nofollow)
It appears that WordPress has a pretty killer spam killer plugin (ha!) that we might take some hints (code?) from... Frankly, I think that us open source CMS/blog apps should put our heads together and root this problem out collectively. Why duplicate effort when we all need the same tools?
*Spam Karma* Layman's Explanation Spam Karma works by running every new comment through a battery of filters and checks. Each of which increase or decrease the comment's 'Karma' value. Depending on the final score, the comment is either: * Approved * Discarded silently as spam (no email is sent to you, unless you specifically require it, but a digest is sent to you every X spams deleted). * Placed in Moderation mode. With the possibility for the commenter to auto-moderate his own comment by proving he's not a spammer (by filling a Captcha or checking a confirmation email). This whole process insures (by order of priority): * No deleted false positive (bad bad bad). * Extremely few moderated false positives (annoying): uses Captcha and email auto-moderation to keep these at a minimum. * No published spam. * very little spam held in moderation (must be destroyed directly: really annoying to have to moderate it). Further more, Spam Karma works in an intelligent way to automatically update its filtering database and grow stronger with each spam it catches… In short: blocks spam with no unnecessary annoyance, for you or your visitors. The way it should be. The Detailed Explanation For our more tech oriented friends, here are a few more insights on the rather complex process used by Spam Karma to decide what's spam and what's not. Each of the following filter is given a weight varying on many factors, ranking from user-controlled values (e.g.: after how many days is a post "old"?) to the credibility that can be given to a test (e.g.: a missing header is less important than a blacklisted IP). Mostly, Spam Karma looks at the following things: * If the poster is logged in the current blog, and what his user level is (e.g. automatically approve Admin posts). * Presence of HTML entities (e.g. {, ʚ etc). * Presence of a HTTP_VIA header. * Proper use of the posting form (hash value must be present). * Time taken to fill the comment (e.g.: if it's less than a few seconds, most likely spam). * Posting granularity. First time posters posting many comments at once vs. old-timers (with comments previously approved by the admin). * Previous diagnostic from WP's built in comment check (set on the 'Discussion' panel). * IP and regex match for URLs contained inside the comment (small weight only for non-URL text matching a URL regex). * Realtime Blacklist (RBL) Server check for IP and URLs. * Comment's age (e.g. penalize comments on very old post). In addition to these filters, Spam Karma uses different treatments and backup checks to insure it becomes better at stopping further spam and that it never deletes mistakenly a legit comment: * Ambiguous comments (that can neither be deleted or approved) are given a second check: commenter is asked to solve a Captcha or use the email auto-moderation (an email containing a hash to unlock the comment is sent to the commenter's email address). If confirmed, the comment's Karma is bumped up and the comment is either published or held for further review, if not confirmed within a certain period, its Karma is lowered and it is either deleted or kept into moderation (if it was sufficiently high to begin with). * When a comment is struck as spam, its IP and URL(s) are harvested and submitted to the Admin for inclusion in the blacklist. In the meantime, they are used as "auto-added" values, with a lesser weight than permanent blacklist entries. * When destroying a spam comment, it checks for recently posted comments that match similar values and retroactively moderate them (e.g.: a spammer could manage to slip X numbers of spams onto a blog, but upon reaching a certain suspicious threshold, all the comments would get retroactively moderated, then deleted). * Spam Karma uses a central DB to retrieve IP and URL updates. By default, it will query the DB automatically every 2 days (can be disabled). Central DB can be configured. Each install of Spam Karma can work as a sort of P2P relay in the update process (both fetching updates and publishing its own updated list for others to grab).
I like this one. What's more, I'm willing to adopt (or help adopting) this to Drupal.
You're welcome to contribute patches to the spam module [1]. I recommend starting with code cleanup / optimization patches, rather than new feature patches. And if so inclined, please keep each of your patches small(ish) and focused on one logical area, to ease review and therefor ultimately to ease merging. [1] http://drupal.org/node/11104 -Jeremy
Hi, Op woensdag 19 januari 2005 10:16, schreef Chris Messina:
It appears that WordPress has a pretty killer spam killer plugin (ha!) that we might take some hints (code?) from... Frankly, I think that us open source CMS/blog apps should put our heads together and root this problem out collectively. Why duplicate effort when we all need the same tools?
Spam.module does nearly all you quoted, minus the ip-filtering and white- and -blackisting of ips. which is listed in the TODOS of spam.module. -- Regards, Bèr -- [ Bèr Kessels | Drupal services www.webschuur.com ]
Spam.module does nearly all you quoted, minus the ip-filtering and white- and -blackisting of ips. which is listed in the TODOS of spam.module.
At this time, I no longer have an intention to implement white/black listing of IP addresses. I analyzed the IPs from the thousands of spams I'm now getting on KernelTrap, and found that there was no useful pattern. Rather disappointing, I might add, as I loved the idea of minimizing the frequent spam floods. Alternative ideas are brewing, fortunately. ;) -Jeremy
Spam Karma works by running every new comment through a battery of filters and checks. Each of which increase or decrease the comment's 'Karma' value. Depending on the final score, the comment is either:
* Approved * Discarded silently as spam (no email is sent to you, unless you specifically require it, but a digest is sent to you every X spams deleted). * Placed in Moderation mode. With the possibility for the commenter to auto-moderate his own comment by proving he's not a spammer (by filling a Captcha or checking a confirmation email).
The concern I have is that when a comment looks like spam, a comment looks like spam. There's generally not a grey area in between, where a comment kinda looks like spam. Rarely, perhaps. (For example, the spam module just blocked a string of comment and forum postings by the same user of my site because he was including ~8 links back to his blog peppered through his posting. This is the first non-spam I've seen that did this, but had I been auto-deleting, I'd never have been the wiser.) Anyway, the auto/silent discarding is the only thing that is not currently supported by the spam module. That and the phrase "spam karma", instead it's internally called "weight". Call it what you like. ;) Due to the excessive amount of spam I'm getting, I intend to get back to spam module development soon, and to add some mass-deletion and even auto-mass-deletion functionality very soon. Specifically, I'll provide the option to relate one or more actions with each filter type. So, for example, any posting that matches the regex /online poker/i could be linked to "auto-delete", while any just matching /poker/i could be simply moved into the moderation queue.
This whole process insures (by order of priority):
* No deleted false positive (bad bad bad).
I disagree. It does not assure this. But by silently deleting, it could potentially make you believe that it does.
* Extremely few moderated false positives (annoying): uses Captcha and email auto-moderation to keep these at a minimum.
This has been true for me with the spam module. False posititives happen, but they are very infrequent. And as the Bayesian logic gets trained, they seem to happen less and less often.
* No published spam.
True with the spam module, in my own testing of late.
* very little spam held in moderation (must be destroyed directly: really annoying to have to moderate it).
This will be dealt with in my next round of spam module development. (post-deleting spam will also be an option, currently not a possibility)
Further more, Spam Karma works in an intelligent way to automatically update its filtering database and grow stronger with each spam it catches___
I don't know what this means. But the spam module has a Bayesian filter built into it, so it learns from its mistakes.
* Time taken to fill the comment (e.g.: if it's less than a few seconds, most likely spam).
Interesting. I think I'll look into adding something like this to the spam module.
* IP and regex match for URLs contained inside the comment (small weight only for non-URL text matching a URL regex).
The spam filter does this automatically, and provides statistics as to how well each of the URLs work. It's referred to simply as the "URL filter". Urls are located by the Bayesian filter.
* Realtime Blacklist (RBL) Server check for IP and URLs.
Nearest I can tell in the spam traffic I've seen on KernelTrap, this is more or less worthless. All the spam floods I've gotten are from unbelieviably well randomized IP addresses. I've enhanced the spam module on my site to collection session information, etc, and I found nothing that could be effectivly blacklisted. (Except for offsite links, which are already handled automatically). I will keep analyzing the data, and should I find a trend, I will look into this again. (My idea is simply to auto-block any IP that posts spam for n minutes)
* Comment's age (e.g. penalize comments on very old post).
Interesting. I think I'll look into adding something like this to the spam module.
* Ambiguous comments (that can neither be deleted or approved) are given a second check: commenter is asked to solve a Captcha or use the email auto-moderation (an email containing a hash to unlock the comment is sent to the commenter's email address). If confirmed, the comment's Karma is bumped up and the comment is either published or held for further review, if not confirmed within a certain period, its Karma is lowered and it is either deleted or kept into moderation (if it was sufficiently high to begin with).
Yes, I like the idea of tying spam filtering into a captcha system, only displaying the captcha if the posting appears to be spam. The only negative feedback I heard on this idea is potential accessibility issues. Note again that "ambiguous comments" in my experience have been rare. When the spam module thinks something is spam, it tends to really think that something is spam.
* When a comment is struck as spam, its IP and URL(s) are harvested and submitted to the Admin for inclusion in the blacklist. In the meantime, they are used as "auto-added" values, with a lesser weight than permanent blacklist entries.
As noted, this would be worthless in dealing with the spam I've seen on my websites. Otherwise it would already be implemented in the spam module. ;)
* When destroying a spam comment, it checks for recently posted comments that match similar values and retroactively moderate them (e.g.: a spammer could manage to slip X numbers of spams onto a blog, but upon reaching a certain suspicious threshold, all the comments would get retroactively moderated, then deleted).
This doesn't appear to do much good if they are posting anonymously, and each posting is from a unique IP address, which is how all spam floods appear on my website. The only thing tieing them together is content - certainly something that could be auto-analyzed.
* Spam Karma uses a central DB to retrieve IP and URL updates. By default, it will query the DB automatically every 2 days (can be disabled). Central DB can be configured. Each install of Spam Karma can work as a sort of P2P relay in the update process (both fetching updates and publishing its own updated list for others to grab).
As soon as someone creates a generic module for handling the centralization of such data, it will be implemented in the spam module... ;) Perhaps some day I'll have the time. Cheers, -Jeremy
Op woensdag 19 januari 2005 13:57, schreef Jeremy Andrews:
Specifically, I'll provide the option to relate one or more actions with each filter type.
Have a look at actions.module. IT is quite new, and made for cases like this: Adding actions trought a frontend. -- Regards, Bèr -- [ Bèr Kessels | Drupal services www.webschuur.com ]
I would like to go out on a limb and propose that we coordinate an effort to create an open source anti-spam solution that can easily be ported to the various OS CRM/Blog platforms, is centrally coordinated (possibly by Drupal, but it really doesn't matter who takes it) and provides best practices and avoidance techniques for website administrators. I know that this is going to perceived as just talk or a pipedream (I have no idea how this would practically get done programmatically) but I think that we need to consider how the open source community can, as a whole, combat the spam issue. If we can somehow innovate on this, we would gain both political clout and a reputation for ourselves as far as working together to solve big problems (Google did it, but with a focus on protecting its Blogger and search businesses). Here are the projects that I think we should try and get on board: * WordPress (I'll email Matt about this) * TextPattern * Xaraya * phpBB The goal of this effort is multifold.... Cut down on spam (duh); create a unified effort to combat this project; develop consistent UIs and module behaviors (email alerts, etc) for open source anti-spam solutions; show the power of the open source community and its ability to work together to solve shared problems. Anyway, I know some folks who would be interested in participating in this, so I just need to know what you guys think and if there's any support for this within the Drupal community. Looking forward to your feedback. Chris On Wed, 19 Jan 2005 14:21:23 +0100, Bèr Kessels <berdrupal@tiscali.be> wrote:
Op woensdag 19 januari 2005 13:57, schreef Jeremy Andrews:
Specifically, I'll provide the option to relate one or more actions with each filter type.
Have a look at actions.module. IT is quite new, and made for cases like this: Adding actions trought a frontend.
-- Regards, Bèr -- [ Bèr Kessels | Drupal services www.webschuur.com ]
On Wed, 19 Jan 2005 23:46:09 -0800, Chris Messina <chris.messina@gmail.com> wrote:
I would like to go out on a limb and propose that we coordinate an effort to create an open source anti-spam solution that can easily be ported to the various OS CRM/Blog platforms, is centrally coordinated (possibly by Drupal, but it really doesn't matter who takes it) and provides best practices and avoidance techniques for website administrators.
Sounds like a great idea. One possible downside is that it might become easier for spammers to target all participating projects if we're using nearly identical solutions. But I think the benefits outweigh that. -- Tim Altman
(cc'ing Matt Mullenweg) Well the idea in coordinating our efforts to fight spam comes down to: 1) leveraging the huge open source community: the spammers are a large community in and of themselves -- surely with our collective efforts, we can stay one step ahead of them? perhaps we could even develop new spamming techniques just to eradicate them! 2) show that open source processes work for solving real world problems 3) show that OS is "growing up" 4) protect our websites from spammers (we're all vulnerable after all) 5) we all stand to benefit collectively by obsoleting spammer's tactics -- if only Drupal or WordPress has spam protection, then there are still other OS platforms out there that are vulnerable 6) the anti-spam project could serve as a model for future collaboration There are plenty more benefits, but those strike me off the top of my head. Chris On Thu, 20 Jan 2005 08:59:29 +0100, Tim Altman <web@timaltman.com> wrote:
On Wed, 19 Jan 2005 23:46:09 -0800, Chris Messina <chris.messina@gmail.com> wrote:
I would like to go out on a limb and propose that we coordinate an effort to create an open source anti-spam solution that can easily be ported to the various OS CRM/Blog platforms, is centrally coordinated (possibly by Drupal, but it really doesn't matter who takes it) and provides best practices and avoidance techniques for website administrators.
Sounds like a great idea. One possible downside is that it might become easier for spammers to target all participating projects if we're using nearly identical solutions. But I think the benefits outweigh that.
-- Tim Altman
Chris Messina wrote:
I would like to go out on a limb and propose that we coordinate an effort to create an open source anti-spam solution that can easily be ported to the various OS CRM/Blog platforms, is centrally coordinated (possibly by Drupal, but it really doesn't matter who takes it) and provides best practices and avoidance techniques for website administrators.
Aren't "bayesian filtering", "black listing" and "captcha" examples of solutions that are _already_ available in most CMS platforms? Programmatically it is nearly impossible to cooperate: we use different database abstraction layers, a different comment system, different coding standards, different permission schemes, etc, etc. What is of interest to us are algorithmic descriptions of anti-spam solutions and experience reports. However these are _intensively_ being researched by governments, academia and the industry. Their research results are often published for all to implement and use. Random example: http://spamconference.org/. In fact, Drupal's spam module and Wordpress' spam plugin exist because we have access to such information/results. Most of the time, it simply boils down to implementing published algorithms/techniques. Furthermore, Drupal's and Wordpress' modules being released under terms of the GPL, are available for others OSS projects to copy, change, improve and distribute. If other projects are interested in our anti-spam code, they can and will use it. If developers want to cooperate or learn from each other, they can and will. With or without establishing a /formal/ entity. That said, the more we can co-operate to fight spam, the better. I'm just not sure how it would work, or whether it would be successful. -- Dries Buytaert :: http://www.buytaert.net/
Op donderdag 20 januari 2005 11:18, schreef Dries Buytaert:
That said, the more we can co-operate to fight spam, the better. I'm just not sure how it would work, or whether it would be successful.
I do not beleive in a central effort coding wise. As Dries points out, there's too much to do when centralising such an effort. What I /do/ beleive will be a great improvement is p2p sharing of the filtered tokens, regexps et al. I discuseed this on IRC lately, and Morbus had some great ideas on this. First of all it should be shared over a closed XML feed. We can use drupalIds and a special role to secure the ahraing (we dont want spammers to learn from our tokens). Both peers need to confirm sharing. If I remove something on my side, the XML feed must dictate (or propose) deletion on the other sides too. Otherwise it would be an ever growing blob. Regards, Bèr -- [ Bèr Kessels | Drupal services www.webschuur.com ]
What I /do/ beleive will be a great improvement is p2p sharing of the filtered tokens, regexps et al. I discuseed this on IRC lately, and Morbus had some great ideas on this.
First of all it should be shared over a closed XML feed. We can use drupalIds and a special role to secure the ahraing (we dont want spammers to learn from our tokens).Both peers need to confirm sharing. If I remove > something on my side, the XML feed must dictate (or propose) deletion on the other sides too. Otherwise it would be an ever growing blob.
Exactly. Restating: a) SiteA and SiteB both want to share their spam filters. b) user@SiteA signs up at SiteB site; user@SiteB does the opposite. c) both users configure their site to accept "trusted" spam filters from the user of the other site. if either user hates each other one day, they can just remove this trusted link, and their site will no longer accept (new/ deleted) spam filters from that user. d) user@SiteA configures his spam filters to be "uploaded to SiteB under the login user@SiteA"; user@SiteB does the same. e) whenever a spam filter changes, Drupal's distributed auth kicks in: SiteA distributed-auths to SiteB as user@SiteA, and uploads the new spam filters. Since SiteB has trusted user@SiteA (in step c, above), the spam filters are processed. If SiteB does not trust this user, the spam filters are ignored. The distributed auth part is important and removes the need for cryptographic keys like in PGP (it doesn't, however, remove the man in the middle attack - someone could still intercept the distributed auth and delete/add their own malicious filters. however, this possibility ALREADY exists in Drupal's distributed auth system, so it's not anything new). The distributed auth is also important because it is blatantly obvious spammers do their research: they know about Bayesian filters, so they send out garbage text to try and confuse them. They know about Spamassassin rulesets, so they tweak their mailers specifically to circumvent them. They know about RBLs, so they piggyback off of installed trojans. With distributed auth, the rulesets are ONLY known between the trusted sites - there's no public sharing of them for spammers to learn from and adapt. Likewise, each group of trusted sites will have different rulesets. -- Morbus Iff ( you are nothing without your robot car, NOTHING! ) Culture: http://www.disobey.com/ and http://www.gamegrene.com/ Spidering Hacks: http://amazon.com/exec/obidos/ASIN/0596005776/disobeycom icq: 2927491 / aim: akaMorbus / yahoo: morbus_iff / jabber.org: morbus
On Thu, 20 Jan 2005 14:32:44 +0100 Bèr Kessels <berdrupal@tiscali.be> wrote:
What I /do/ beleive will be a great improvement is p2p sharing of the filtered tokens, regexps et al. I discuseed this on IRC lately, and Morbus had some great ideas on this.
Out of curiousity, does this suggest that you're currently having a lot of spam slip through the filter? (My development efforts on the spam module have tended to be dictated by what's required to keep my website spam free -- it has been of late.) -Jeremy
Out of curiousity, does this suggest that you're currently having a lot of spam slip through the filter?
Of late, I've only had two slip by. Most of the unexpected spam is caught in the "no more than 3 URLs per message" filter. 1584 of the 1584 (100%) automatically detected spam postings were correctly marked as spam. 116 of the 120 (96.67%) automatically detected non-spam postings were correctly marked as non-spam. This is an overall filter accuracy of 99.77%. -- Morbus Iff ( you are nothing without your robot car, NOTHING! ) Culture: http://www.disobey.com/ and http://www.gamegrene.com/ Spidering Hacks: http://amazon.com/exec/obidos/ASIN/0596005776/disobeycom icq: 2927491 / aim: akaMorbus / yahoo: morbus_iff / jabber.org: morbus
On Thu, 20 Jan 2005 09:15:33 -0500 Morbus Iff <morbus@disobey.com> wrote:
Out of curiousity, does this suggest that you're currently having a lot of spam slip through the filter?
Of late, I've only had two slip by. Most of the unexpected spam is caught in the "no more than 3 URLs per message" filter.
Which makes me wonder: what problem are we solving by adding the additional complexity of shared databases? In any case, I'd only recommend sharing URLfilter data, and possibly custom filter data. I wouldn't recommend trying to syncronize the Bayesian data. -Jeremy
Which makes me wonder: what problem are we solving by adding the additional complexity of shared databases?
From my standpoint, I don't really /need/ the feature right now - my exploration of it was equivalent to "hey, i had a funny idea", "oh? how would that work?", and then furious scribbling on a napkin. With that in mind, along with the evolution of spamfight techniques, (procmail filters, bayesian, razor / spamassassin), Drupal (or the spam.module, at least) still has one step of (inevitable?) evolution to go.
I wouldn't recommend trying to syncronize the Bayesian data.
Agreed. -- Morbus Iff ( you are nothing without your robot car, NOTHING! ) Culture: http://www.disobey.com/ and http://www.gamegrene.com/ Spidering Hacks: http://amazon.com/exec/obidos/ASIN/0596005776/disobeycom icq: 2927491 / aim: akaMorbus / yahoo: morbus_iff / jabber.org: morbus
Op donderdag 20 januari 2005 15:12, schreef Jeremy Andrews:
On Thu, 20 Jan 2005 14:32:44 +0100
Bèr Kessels <berdrupal@tiscali.be> wrote:
What I /do/ beleive will be a great improvement is p2p sharing of the filtered tokens, regexps et al. I discuseed this on IRC lately, and Morbus had some great ideas on this.
Out of curiousity, does this suggest that you're currently having a lot of spam slip through the filter?
No, not really. But on my sites it took some time to have them learn. So I had site1 learn for a week. Then i imported the spam tables into site2, 3, 4, 5 and six. Saved me a lot of learning work. Once your spam filter is running for a while, it works very well. But sometimes spam /does/ slip by. And when it does, I find it did on nearly all my sites. I then have to log into all of them, and "learn" them the same thing. I would love some "central brain".
(My development efforts on the spam module have tended to be dictated by what's required to keep my website spam free -- it has been of late.)
And they are greatly appreciated. spam module does a wonderfull job, really! -- Regards, Bèr -- [ Bèr Kessels | Drupal services www.webschuur.com ]
First of all it should be shared over a closed XML feed. We can use drupalIds and a special role to secure the ahraing (we dont want spammers to learn from our tokens). Both peers need to confirm sharing. If I remove something on my side, the XML feed must dictate (or propose) deletion on the other sides too. Otherwise it would be an ever growing blob.
It sounds like a great idea, but what prevents a spammer from setting up a Drupal site, asking nicely for trust, and getting first hand updates of our spam tokens? And if the effort is shared between CMSes, then all a spammer needs to do is set up one fake site with any of the participating engines. In my opinion, any widespread anti-spam tool will get into the hands of spammers, there's no way around that. However, as the spam tokens become more refined, it will be harder for a spammer to circumvent them (although I don't doubt the smart ones craft their messages programmatically). So I don't see why we should go through all this trouble of trust and authentication, when simply making our spam tokens available for everyone will have the same net effect... Steven Wittens
participants (8)
-
Bèr Kessels -
Chris Messina -
Dries Buytaert -
Jeremy Andrews -
Morbus Iff -
Negyesi Karoly -
Steven Wittens -
Tim Altman