Possible DDoS attack on Drupal user creation
Good morning! I'm sorry if this is the wrong venue for this, but I am not sure where else to post it. As of this morning, I have good reason to suspect that one of my Drupal sites is the victim of a zombie-based DDoS attack, and I felt I should warn the Drupal development community that a new Drupal-specific bot may be out there in the wild. The site allows anyone to create a user account, with no approval needed but of course with no special privileges (all it really gains them is the ability to queue comments for approval, subscribe to node comments, and to customize their timezone). What happened is that last night a large number of new user accounts were all created with garbage-looking and undeliverable Yahoo! addresses as the email target, e.g., sdfuhgfdhghu@yahoo.com. My site is a rather narrowly- focused site related to historical reenactment, and we typically average only 1 or 2 new users per day, and there have been 57 since midnight last night. That's not a huge number, but it's *way* outside our normal statistical range. I initially thought this might be one script kiddie with a Perl bot, but I checked my logs, and there were 57 requests since midnight spread over 21 different IPs. Only one of the IP addresses has a valid reverse DNS, and it points to a dialup pool. In the Apache logs, all of the browser ID strings are identical: "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" which suggests either a bot emulating this browser, or a coordinated attack by a couple dozen individuals. I consider the latter to be unlikely, as the site is neither politically controversial nor commercial in nature, so I doubt anyone would have enough motive to work hard enough to do an attack by manual means. I post this message here for three reasons: 1. I wanted to warn others that if there is a bot to attack my site, they may attack other Drupal sites in the near future, and 2. I wanted to see if anyone has a suggestion of a module -- including one that I might create -- that could block bogus user account requests like this but not legitimate ones. Will the "Captcha" module do what I need? 3. I wanted to find out if anyone else has seen similar behavior, to see if this is part of a larger pattern that may need to be addressed in user.module. For example, if this is commonplace, should "Captcha" become part of core? The attacks aren't doing any real harm -- my server can easily cope with the load, and I'll eventually just purge the accounts that are never activated. They're a nuisance, but I still want to make this go away if I can. I'll be glad to share more details upon request -- my site is not business- related, so I have no reason to conceal logs or other pertinent data that could help the Drupal development community guard against things like this. Scott -- ------------------------------------------------------------------------------- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/
On Wednesday 08 February 2006 11:11, Syscrusher wrote:
2. I wanted to see if anyone has a suggestion of a module -- including one that I might create -- that could block bogus user account requests like this but not legitimate ones. Will the "Captcha" module do what I need?
Captcha does accomplish the 'bot blocking. I also think I found a bug in Firefox that prevents Captcha from working right, and will report this on Captcha's forum. The interesting tidbit in all this is that I had to back-port Captcha to Drupal 4.4 to make it work with my site. Guess it's time I updated Drupal on that site. Scott -- ------------------------------------------------------------------------------- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/
Op woensdag 08 februari 2006 21:05, schreef Syscrusher:
On Wednesday 08 February 2006 11:11, Syscrusher wrote:
2. I wanted to see if anyone has a suggestion of a module -- including one that I might create -- that could block bogus user account requests like this but not legitimate ones. Will the "Captcha" module do what I need?
Captcha does accomplish the 'bot blocking.
I also think I found a bug in Firefox that prevents Captcha from working right, and will report this on Captcha's forum.
The interesting tidbit in all this is that I had to back-port Captcha to Drupal 4.4 to make it work with my site. Guess it's time I updated Drupal on that site.
Scott
Captcha is bad. Evil. ;) http://drupal.org/node/46666 is a faaaar better approach. (imo). It validated the email first over SMTP on the remote server, and then reports back. Bèr
On Thursday 09 February 2006 06:57, Bèr Kessels wrote:
Captcha is bad. Evil. ;) http://drupal.org/node/46666 is a faaaar better approach. (imo). It validated the email first over SMTP on the remote server, and then reports back.
Nice module...thanks for the tip. I will probably install it when I (soon!) upgrade my site to a newer Drupal version. Unfortunately, it doesn't solve my immediate problem. The specific domain being spoofed (yahoo.com) is one of those that returns a 250 (accepted) for any recipient, whether valid or not, so the module you suggest won't block my DoS attack. That being said, since Email Verify is user-transparent, there's no reason not to use it in addition to other precautions. I agree with you that I Captcha is not an optimum solution. I'm thinking about adapting the algorithm so that instead of displaying an image and asking people to type the text, I would create a module that asks a simple, randomly-generated question that any human being could answer. I could actually draw the questions from content on my site. What I have in mind is to pull text from a random article and then ask a question that is permutated from sentence structure and possibly from some keywords found in the text. The possibilities are finite, but with careful algorithm design it could be made so that there are a very large number of possibilities. Example questions, assuming the paragraph above is the random text: * What is the last word before the first period? * What is the first hyphenated phrase in the text? * How many times does the word "create" appear? * What is 3 times 5? (who says we have to just ask about the text?) * Retype the user name you have chosen but put an extra Q on the end. * What is the third word of the second sentence? * What is the second-to-last word of the text? * What word appears right after the first occurrence of "not"? I've come up with these eight qualitatively different questions without really thinking about it. Now, in the module, there would be intentionally different ways of phrasing each one using quirks of human language, such as "Type the word that appears before the first hyphen" rather than the second question. Add into that the use of random mathematical questions, common knowledge ("What is the name of this planet?", "What ocean is between Africa and South America?", "What continent includes the country of Germany?" and so on... carefully chosen questions that are culturally neutral.) One could also allow the site admin to add in a list of Q&A that any person registering at their site would know based on the topic of the site. For Drupal.org, questions might be "CMS stands for ____ management system:" or "What word means to obtain a copy of Drupal from our web site to your computer so you can install it?" or "How many eyes does our logo have?" or "How many menu tabs are at the top-right corner of the Drupal home page?" Site admins could add general-knowledge questions that are not culturally-neutral based on the audience for their site. For example, if you are building a site about Canadian politics, it is not unreasonable to expect visitors to be able to name the Prime Minister of Canada or state how many provinces there are or name the province just west of Manitoba. The biggest challenge I can see is to make the questions patterned enough to use the t() function with replaceable parameters to allow translations, yet still have enough patterns to make a spammer's job difficult. One approach is to have multiple patterns for each question, e.g., $patterns[0] = array( "What word appears %location the first %punc?", "The first occurrence of %punc appears just %location what word?", "What's the word right %location %punc, the first time %punc occurs?", "There is a word immediately %location %punc in this text. What is it?", "Look for %punc in the text and type the word %location it:", ); $patterns[3] = array( "What is %expression?", "Compute the value of %expression:", "In mathematics, %expression equals what?", "Tell me the numeric value for %expression.", "%expression is how many?" ); The neat thing about this is that if we have multiple translations of the questions, the spammers have to follow us. And linguistic syntax variance will cause the replaceable parameters to appear at different points in different languages. Nine questions permutated five ways each is only 45 pregenerated questions, but this is improved by the fact that some of them (like the math one) can have %expression also written in multiple language-neutral ways: 3x5, 3 * 5, (2+1)x(2+3) so that the spammer now also has to have an arithmetic expression evaluator. Make it tougher by adding simple algebra: Y * (2+1) = X, and Y is 5. What is X? or grade-school-level "word problems" from math class: Johnny has 12 candies and shares 4 with his sister. How many are left? Sally shares 16 coins equally among 8 people. How many does each receive? There are lots of ways to word these, as any teacher can tell you. The system is made stronger if the same replaceable parameters are used in the same word positions in different questions that have different answers, e.g., "What word appears [after] the first [.]?" versus "First letter appearing [in] the first [sentence]?" -- two questions that parse similarly but have non-overlapping answer domains. Would it be possible to build an AI that outsmarted this system? Of course. But it would be nontrivial, and it would have to be Drupal-specific, and if the site admins are diligent in adding their own questions, it would have to be at least partially site-specific. It's not foolproof, but it would at least make things a little more challenging for the spammers, and it doesn't rely on images. Comments? -- ------------------------------------------------------------------------------- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/
Hi, All the 8+ challenges that you mention are things that it's easy to write a script for. Plus, since captcha.module is a publicly available script, script writers can use it to come up with code that solves the challenges. For every set of N problems that you pose to the user, a determined enemy will come up with something that solves the N problems. And like I said, since the same module is installed on so many servers, there's quite a benefit to writing such code. Then again, regarding your site-specifc challenges, if kerneltrap or the onion asks me questions, I don't mind writing custom code to crack them as well. I'm not fond of typing in strange words from images. It's rather demeaning, actually. However, an image captcha it is the ONLY challenge mechanism that you CAN NOT write a feasible script for. By feasible, I mean something that can execute in limited time and memory for DDoS attacks. (Image recognition requires CPU and memory). Hence, a good challenge test is something that a human requires very little effort to do, and a computer requires a lot of CPU and memory. So how do we get rid of the stupid image checks? I've been looking into trapdoor functions for a while, which are very easy to pose and check, but take time to solve. Factorization of prime multiples is a possible challenge. For example, If I take 31, and 13, and multiply them and tell you, "403". How long does it take to divide and tell me the solution? A while, if I use large prime numbers instead of 13 and 31. But if you do give me the answer, it's easy to check if you're right. This is the basis of modern cryptography, btw. Hence, if we write some code that poses this challenge on the server side, and asks a smal bit of javascript to do this on the client side, any DDoS attacker will give up, because his CPU will die. However, if you're just writing casual comments on a web page, a little CPU spike is ignorable. You computer solves the problem by the time you type things, and we validate everything on the server side, and the human is not bothered at all. But there are some problems to this approach, too lengthy to explain here :) Another possible, and simple solution is to restrict form submits by an IP address to 3 in a second, for example. However, this fails if you have a group of people behind a proxy. CONCLUSION: If you ARE interested in writing up a function like this, OR a function amongst the ones you had suggested, no problem! simply pickup the captcha.module in cvs, and start coding! It has an API that allows you to write simple _challenge() and _response parts, and it does the rest. For example, the default challenge in captcha.module cvs is the math problem you had posed (3 times 5). Easy! Cheers, Arnab On 2/9/06, Syscrusher <syscrusher@4th.com> wrote:
On Thursday 09 February 2006 06:57, Bèr Kessels wrote:
Captcha is bad. Evil. ;) http://drupal.org/node/46666 is a faaaar better approach. (imo). It validated the email first over SMTP on the remote server, and then reports back.
Nice module...thanks for the tip. I will probably install it when I (soon!) upgrade my site to a newer Drupal version.
Unfortunately, it doesn't solve my immediate problem. The specific domain being spoofed (yahoo.com) is one of those that returns a 250 (accepted) for any recipient, whether valid or not, so the module you suggest won't block my DoS attack. That being said, since Email Verify is user-transparent, there's no reason not to use it in addition to other precautions.
I agree with you that I Captcha is not an optimum solution. I'm thinking about adapting the algorithm so that instead of displaying an image and asking people to type the text, I would create a module that asks a simple, randomly-generated question that any human being could answer. I could actually draw the questions from content on my site. What I have in mind is to pull text from a random article and then ask a question that is permutated from sentence structure and possibly from some keywords found in the text. The possibilities are finite, but with careful algorithm design it could be made so that there are a very large number of possibilities.
Example questions, assuming the paragraph above is the random text:
* What is the last word before the first period? * What is the first hyphenated phrase in the text? * How many times does the word "create" appear? * What is 3 times 5? (who says we have to just ask about the text?) * Retype the user name you have chosen but put an extra Q on the end. * What is the third word of the second sentence? * What is the second-to-last word of the text? * What word appears right after the first occurrence of "not"?
I've come up with these eight qualitatively different questions without really thinking about it. Now, in the module, there would be intentionally different ways of phrasing each one using quirks of human language, such as "Type the word that appears before the first hyphen" rather than the second question. Add into that the use of random mathematical questions, common knowledge ("What is the name of this planet?", "What ocean is between Africa and South America?", "What continent includes the country of Germany?" and so on... carefully chosen questions that are culturally neutral.)
One could also allow the site admin to add in a list of Q&A that any person registering at their site would know based on the topic of the site. For Drupal.org, questions might be "CMS stands for ____ management system:" or "What word means to obtain a copy of Drupal from our web site to your computer so you can install it?" or "How many eyes does our logo have?" or "How many menu tabs are at the top-right corner of the Drupal home page?"
Site admins could add general-knowledge questions that are not culturally-neutral based on the audience for their site. For example, if you are building a site about Canadian politics, it is not unreasonable to expect visitors to be able to name the Prime Minister of Canada or state how many provinces there are or name the province just west of Manitoba.
The biggest challenge I can see is to make the questions patterned enough to use the t() function with replaceable parameters to allow translations, yet still have enough patterns to make a spammer's job difficult. One approach is to have multiple patterns for each question, e.g., $patterns[0] = array( "What word appears %location the first %punc?", "The first occurrence of %punc appears just %location what word?", "What's the word right %location %punc, the first time %punc occurs?", "There is a word immediately %location %punc in this text. What is it?", "Look for %punc in the text and type the word %location it:", ); $patterns[3] = array( "What is %expression?", "Compute the value of %expression:", "In mathematics, %expression equals what?", "Tell me the numeric value for %expression.", "%expression is how many?" );
The neat thing about this is that if we have multiple translations of the questions, the spammers have to follow us. And linguistic syntax variance will cause the replaceable parameters to appear at different points in different languages. Nine questions permutated five ways each is only 45 pregenerated questions, but this is improved by the fact that some of them (like the math one) can have %expression also written in multiple language-neutral ways: 3x5, 3 * 5, (2+1)x(2+3) so that the spammer now also has to have an arithmetic expression evaluator. Make it tougher by adding simple algebra:
Y * (2+1) = X, and Y is 5. What is X?
or grade-school-level "word problems" from math class:
Johnny has 12 candies and shares 4 with his sister. How many are left? Sally shares 16 coins equally among 8 people. How many does each receive?
There are lots of ways to word these, as any teacher can tell you.
The system is made stronger if the same replaceable parameters are used in the same word positions in different questions that have different answers, e.g., "What word appears [after] the first [.]?" versus "First letter appearing [in] the first [sentence]?" -- two questions that parse similarly but have non-overlapping answer domains.
Would it be possible to build an AI that outsmarted this system? Of course. But it would be nontrivial, and it would have to be Drupal-specific, and if the site admins are diligent in adding their own questions, it would have to be at least partially site-specific.
It's not foolproof, but it would at least make things a little more challenging for the spammers, and it doesn't rely on images.
Comments?
-- ------------------------------------------------------------------------------- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/
On Thursday 09 February 2006 13:08, Arnab Nandi wrote:
Hi,
All the 8+ challenges that you mention are things that it's easy to write a script for. Plus, since captcha.module is a publicly available script, script writers can use it to come up with code that solves the challenges. For every set of N problems that you pose to the user, a determined enemy will come up with something that solves the N problems.
All true. I don't think I'm dealing with a "determined" enemy, though. But I'll concede your point that what I proposed is scriptable. [...]
I'm not fond of typing in strange words from images. It's rather demeaning, actually. However, an image captcha it is the ONLY challenge mechanism that you CAN NOT write a feasible script for. By
Here I disagree. According to my research on the 'net, there are plenty of existing scripts, and...
feasible, I mean something that can execute in limited time and memory for DDoS attacks. (Image recognition requires CPU and memory). Hence, a good challenge test is something that a human requires very little effort to do, and a computer requires a lot of CPU and memory.
...Moore's Law will soon change that, if it hasn't already. The image is small, after all -- not that many pixels.
So how do we get rid of the stupid image checks? I've been looking into trapdoor functions for a while, which are very easy to pose and check, but take time to solve. Factorization of prime multiples is a possible challenge. For example, If I take 31, and 13, and multiply them and tell you, "403". How long does it take to divide and tell me the solution? A while, if I use large prime numbers instead of 13 and 31. But if you do give me the answer, it's easy to check if you're right. This is the basis of modern cryptography, btw.
I knew that last bit, but hadn't thought of using JavaScript to let the *browser* do the validation. That's a very cool idea!
Hence, if we write some code that poses this challenge on the server side, and asks a smal bit of javascript to do this on the client side, any DDoS attacker will give up, because his CPU will die. However, if you're just writing casual comments on a web page, a little CPU spike is ignorable. You computer solves the problem by the time you type things, and we validate everything on the server side, and the human is not bothered at all.
Especially since in this situation the C/R mechanism only happens when the human wants to create an account or post something. Most page views are read-only, requiring no activation of the trapdoor.
But there are some problems to this approach, too lengthy to explain here :)
How about this (pseudocode): // Make a random 40-character string using A-Z and a-z chars. $random_string = make_random_chars(40, 'ABCDE.....XYZabcde.....xyz'); $n = 4; // Measure of complexity of challenge; admin-adjustable $hash = md5($random_string); $challenge = substring($random_string, 0, 40-$n) . ":" . $hash . ":" . $n; Now, the client has something like this: "AxEvBw......rUs:51cae0322f00f123.....93c:4" It knows the MD5 of the full string, and it knows the first 36 characters of that string. Now it needs to guess the rest of the characters by brute force (52**$n permutations), concatenating each set with the known 36 character random string until it finds the one whose MD5 matches the MD5 supplied by the server. The correct answer is the last four characters of $random_string. A lookup table in a bot-client won't help, because the challenge is random each time -- unless the client has enough disk to store 52**40 precomputed MD5 sums. Since 52**4 is about 7.3 million, the client will have to compute an average of just over 3 million MD5s for each iteration if $n is 4. If $n is 3, that number drops to an average of about 70 thousand MD5 computes on average. You can adjust the steepness of the complexity by changing the number of allowed characters in the allowed set (e.g., using only uppercase would cause $n=3 to average about 8500 computes and $n=4 to average about 225 thousand. This, and the ability to change $n to a higher number, allow the site admin to keep up with computational speed advances over time. Just sending the MD5 of $n random characters won't help, because it just might be feasible to store 7.3 million pregenerated MD5s in a bot. :-) That's the purpose of the added one-time-pad to the string. Even though we disclose it in the clear, it's virtually unique to this transaction. The PHP session ID could be used instead to save some compute time on the server generating all those random characters. You would use the PHPSESSID as the openly-disclosed part of $random_string, and the server would generate only the $n characters to append to PHPSESSID. (Remember that the client can easily obtain PHPSESSID from the GET variable or cookie, so the concealed "answer" part of $random_string can't be from PHPSESSID.)
Another possible, and simple solution is to restrict form submits by an IP address to 3 in a second, for example. However, this fails if you have a group of people behind a proxy.
Won't help in this case. They are submitting at a slow relative rate, relying on the nuisance of the extra accounts rather than actually trying to bring down my server. (I suspect they may be probing for Drupal sites that allow authenticated users to post without moderation, so they can post spams. My sites all are fully moderated, so that tactic fails.)
CONCLUSION: If you ARE interested in writing up a function like this, OR a function amongst the ones you had suggested, no problem! simply pickup the captcha.module in cvs, and start coding! It has an API that allows you to write simple _challenge() and _response parts, and it does the rest. For example, the default challenge in captcha.module cvs is the math problem you had posed (3 times 5).
Been there, done that...My site still runs Drupal 4.4 (and a BETA at that!), and I actually had to back-port captcha to get it working with that old Drupal. I was operating in urgent-mode to get the site at least minimally protected against this 'bot. Captcha as it stands seems to be working in that regard. I'll look into coding some enhanced C/R capabilities in a newer version. First, though, I need to get this site updated at least to Drupal 4.6. :-) Thanks for the comments! Scott -- ------------------------------------------------------------------------------- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/
Captcha is bad. Evil. ;) http://drupal.org/node/46666 is a faaaar better approach. (imo). It validated the email first over SMTP on the remote server, and then reports back.
While this is an improvement, please note that it is not fool proof either. My feedback module has an option to validate the email addresses using this same approach (contact MX server for the domain supplied, send the user ID, get back a response, ...etc). I had to turn this off on my sites, since a lot of people would not be validated correctly on it (if I remember correctly, it was MSN and AOL or something like that). So, just beware that this is not a 100% solution either, despite being better than captcha.
participants (4)
-
Arnab Nandi -
Bèr Kessels -
Khalid B -
Syscrusher