Re: [drupal-devel] Dealing with spam (was rel=nofollow)
Jeremy pointed out earlier that black and whitelisting made no sense. Because of the huge amount of (open proxies) domains spammers use.
Black and white listing of IPs/hosts isn't sensible no, but the MT blacklist is actualy a content list. It contains lots of regular expressions, primarily for target urls.
The best method still is to filter on content.
Which is exactly what these lists do, if a post contains a link to a known site, then drop it. Essentialy, it's exactly the same thing being discussed, a list of places to block. Only the MT approach is to share it publicly, not have small webs of trust.
What I /do/ beleive will be a great improvement is p2p sharing of the filtered tokens, regexps et al.
As I said, MT already do this, the master list is here: http://www.jayallen.org/comment_spam/blacklist.txt Latest Changes: http://www.jayallen.org/comment_spam/blacklist_changes.txt Also as RSS 1.0: http://www.jayallen.org/comment_spam/feeds/blacklist-changes.rdf And RSS 2.0: http://www.jayallen.org/comment_spam/feeds/blacklist-changes.xml I, and apparantly MT and GL think it's better to have this information public. Yes, if the spammer is blocked everywhere they may move their freeviagra.com to freeviagra.org, but if each time they do as soon as /one/ site blocks that url /all/ sites block that url (ok with the manual submission to the MT master list there is a small lag, but GL implements list sharing via xml amongst gl sites I think, or at least has the first stages of this, and with drupal, cron and what you are suggesting...) then the technique is going to be much more efficient than if when they start advertising a new site, it has to be added to the lists on hundreds of small webs of trust. Yeah, the spammers can know they are blocked and buy a new url for their viagra and animal porn emporium, but, they will be blocked faster and more efficiently. A better trade off I think. Now, what you are suggesting /also/ has it's place in this. If you set trust between sites, then those sites could automaticaly inject their content to each other with no intervention. So, you use the MT public master, and maintain this with cron'd daily fetches on drupal instances. You also have a private personal blacklist, when items are added to this, they are web-of-trust written DIRECTLY to other drupal instances via XML-RPC. (this solves Bèr's issue) You also publish your merged list, and just your personal lists so other instances NOT in your web of trust can anon pull from you. You also periodicaly submit your new items to the MT master to help maintain everyone in the world. But -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ .
On Thu, 20 Jan 2005 09:53:45 -0500 "mike@fuckingbrit.com" <mike@fuckingbrit.com> wrote:
As I said, MT already do this, the master list is here: http://www.jayallen.org/comment_spam/blacklist.txt
If this is already being done, perhaps the easiest thing would be for someone to write a parser that goes out and grabs this list every so often, parses it, and updates the custom filters and URLfilters as necessary... This begins to meet the earlier demands for sharing among projects. -Jeremy
------ If this is already being done, perhaps the easiest thing would be for someone to write a parser that goes out and grabs this list every so often, parses it, and updates the custom filters and URLfilters as necessary... This begins to meet the earlier demands for sharing among projects. ------ Sounds like a good plan to me. I'll take a look at this once I've migrated if no-one else is interested? I've been thinking, I intend to implement this as a new module that requires the spam module and works with it to keep the code modular and separate. I think this is the right way to go as it allows the functionality to be totally removed if people don't want it. I've had a swift first look at the spam schema, I'm thinking these need to go into the custom table as regexes, but also need marking as MT items for management. I'd add an itemtype column or some such, and insert MT for moveable type items. This should make it all transparent to the spam plugin, but allow the MT Blacklist plugin to know which items it shouldn't touch. I think the first iteration should provide: 1) Manually triggered one shot, load the master list. 2) Cron managed periodic update of additions and deletions. 3) Manual trigger of 2. After this, I find the idea of the web of trust push system interesting, so I'd like to add hooks for when items are added, or perhaps on cron, pushing out any new items to any trusted sites. This helps people with multiple instances do their admin once, so I think it's worthwhile for that, if not for the original reasons for inventing it. Or perhaps WOT pull... FYI: MT masterlist import (GPL From geeklog's spamx): /** * Import the blacklist * * @param array $lines The blacklist * @return int number of lines imported */ function _do_import ($lines) { global $_TABLES; $count = 0; foreach ($lines as $line) { $l = explode ('#', $line); $entry = trim ($l[0]); if (!empty ($entry)) { DB_query ('INSERT INTO ' . $_TABLES['spamx'] . ' VALUES ("MTBlacklist","' . addslashes ($entry) . '")'); $count++; } } return $count; } Likewise: importing updates: /** * Update MT Blacklist from RSS feed */ function _update_blacklist () { global $_CONF, $_TABLES, $LANG_SX00, $_SPX_CONF; require_once($_CONF['path'] . 'plugins/spamx/magpierss/rss_fetch.inc'); require_once($_CONF['path'] . 'plugins/spamx/magpierss/rss_utils.inc'); $rss = fetch_rss($_SPX_CONF['rss_url']); // entries to add and delete, according to the blacklist changes feed $to_add = array(); $to_delete = array(); foreach($rss->items as $item) { // time this entry was published (currently unused) // $published_time = parse_w3cdtf( $item['dc']['date'] ); $entry = substr($item['description'], 0, -3); // blacklist entry $subject = $item['dc']['subject']; // indicates addition or deletion // is this an addition or a deletion? if (strpos($subject, 'addition') !== false) { // save it to database $result = DB_query('SELECT * FROM ' . $_TABLES['spamx'] . ' WHERE name="MTBlacklist" AND value="' . $entry . '"'); $nrows = DB_numRows($result); if ($nrows < 1) { $result = DB_query('INSERT INTO ' . $_TABLES['spamx'] . ' VALUES ("MTBlacklist","' . $entry . '")'); $to_add[] = $entry; } } else if (strpos($subject, 'deletion') !== false) { // delete it from database $result = DB_query('SELECT * FROM ' . $_TABLES['spamx'] . ' where name="MTBlacklist" AND value="' . $entry . '"'); $nrows = DB_numRows($result); if ($nrows >= 1) { $result = DB_query('DELETE FROM ' . $_TABLES['spamx'] . ' where name="MTBlacklist" AND value="' . $entry . '"'); $to_delete[] = $entry; } } } $display = '<hr><p><b>' . $LANG_SX00['entriesadded'] . '</b></p><ul>'; foreach ($to_add as $e) { $display .= "<li>$e</li>"; } $display .= '</ul><p><b>' . $LANG_SX00['entriesdeleted'] . '</b></p><ul>'; foreach ($to_delete as $e) { $display .= "<li>$e</li>"; } $display .= '</ul>'; SPAMX_log($LANG_SX00['uMTlist'] . $LANG_SX00['uMTlist2'] . count($to_add) . $LANG_SX00['uMTlist3'] . count($to_delete) . $LANG_SX00['entries']); return $display; } Just in case anyone gets to it before me, to avoid re-invention of wheels. Any input? Mike
What I /do/ beleive will be a great improvement is p2p sharing of the filtered tokens, regexps et al. ..... So, you use the MT public master, and maintain this with cron'd daily fetches on drupal instances. You also have a private personal blacklist, when items are added to this, they are web-of-trust written DIRECTLY to other drupal instances via XML-RPC. (this solves Bèr's issue)
You also publish your merged list, and just your personal lists so other instances NOT in your web of trust can anon pull from you. You also periodicaly submit your new items to the MT master to help maintain everyone in the world.
All that is good, but introduces another problem: If a spammer knows a filter or a subset of it, it is trivial to write a generator for spam based on the filter. This means that there should be a web of trust and the various spam filters out there should be different enough for the spammer to be very hard to create such a generator. Sending, pushing, pulling, whatever regexes and other rules securely so it is not eavsedropped is difficult enough, but there always be leaks. What I suggest is sending not rules but actual spam content, having the different learning approaches should help to have very different spam filters. The spammer will be able to create one of their own, but there will be no certainty that their new content will not be caught immediately. The spam content can be distributed freely, it just has to be not available to the search engines, etc... This way you will have enough data to start your new learner, you can have immediate updates, via a p2p-like exchange between servers, you need a minimum level of trust - 'identify yourself'. I though I'll add my 2p. Cheers, Vlado -- Vladimir Zlatanov <vlado@dikini.net>
participants (4)
-
Jeremy Andrews -
Michael Jervis -
mike@fuckingbrit.com -
Vladimir Zlatanov