[development] Captcha module -- possible alternative approach

Thu Feb 9 18:08:07 UTC 2006

Hi,

All the 8+ challenges that you mention are things that it's easy to
write a script for. Plus, since captcha.module is a publicly available
script, script writers can use it to come up with code that solves the
challenges. For every set of N problems that you pose to the user, a
determined enemy will come up with something that solves the N
problems. And like I said, since the same module is installed on so
many servers, there's quite a benefit to writing such code. Then
again, regarding your site-specifc challenges, if kerneltrap or the
onion asks me questions, I don't mind writing custom code to crack
them as well.

I'm not fond of typing in strange words from images. It's rather
demeaning, actually. However, an image captcha it is the ONLY
challenge mechanism that you CAN NOT write a feasible script for. By
feasible, I mean something that can execute in limited time and memory
for DDoS attacks. (Image recognition requires CPU and memory). Hence,
a good challenge test is something that a human requires very little
effort to do, and a computer requires a lot of CPU and memory.

So how do we get rid of the stupid image checks? I've been looking
into trapdoor functions for a while, which are very easy to pose and
check, but take time to solve. Factorization of prime multiples is a
possible challenge. For example, If I take 31, and 13, and multiply
them and tell you, "403". How long does it take to divide and tell me
the solution? A while, if I use large prime numbers instead of 13 and
31. But if you do give me the answer, it's easy to check if you're
right. This is the basis of modern cryptography, btw.

Hence, if we write some code that poses this challenge on the server
side, and asks a smal bit of javascript to do this on the client side,
any DDoS attacker will give up, because his CPU will die. However, if
you're just writing casual comments on a web page, a little CPU spike
is ignorable. You computer solves the problem by the time you type
things, and we validate everything on the server side, and the human
is not bothered at all. But there are some problems to this approach,
too lengthy to explain here :)

Another possible, and simple solution is to restrict form submits by
an IP address to 3 in a second, for example. However, this fails if
you have a group of people behind a proxy.

CONCLUSION: If you ARE interested in writing up a function like this,
OR a function amongst the ones you had suggested, no problem! simply
pickup the captcha.module in cvs, and start coding! It has an API that
allows you to write simple _challenge() and _response parts, and it
does the rest.
For example, the default challenge in captcha.module cvs is the math
problem you had posed (3 times 5).

Easy!

Cheers,
Arnab

On 2/9/06, Syscrusher <syscrusher at 4th.com> wrote:
> On Thursday 09 February 2006 06:57, Bèr Kessels wrote:
> > Captcha is bad. Evil. ;)
> > http://drupal.org/node/46666 is a faaaar better approach. (imo). It validated
> > the email first over SMTP on the remote server, and then reports back.
>
> Nice module...thanks for the tip. I will probably install it when I (soon!)
> upgrade my site to a newer Drupal version.
>
> Unfortunately, it doesn't solve my immediate problem. The specific domain
> being spoofed (yahoo.com) is one of those that returns a 250 (accepted)
> for any recipient, whether valid or not, so the module you suggest won't
> block my DoS attack. That being said, since Email Verify is user-transparent,
> there's no reason not to use it in addition to other precautions.
>
> I agree with you that I Captcha is not an optimum solution. I'm thinking
> about adapting the algorithm so that instead of displaying an image and
> asking people to type the text, I would create a module that asks a
> simple, randomly-generated question that any human being could answer.
> I could actually draw the questions from content on my site. What I have
> in mind is to pull text from a random article and then ask a question
> that is permutated from sentence structure and possibly from some keywords
> found in the text. The possibilities are finite, but with careful algorithm
> design it could be made so that there are a very large number of
> possibilities.
>
> Example questions, assuming the paragraph above is the random text:
>
> * What is the last word before the first period?
> * What is the first hyphenated phrase in the text?
> * How many times does the word "create" appear?
> * What is 3 times 5?         (who says we have to just ask about the text?)
> * Retype the user name you have chosen but put an extra Q on the end.
> * What is the third word of the second sentence?
> * What is the second-to-last word of the text?
> * What word appears right after the first occurrence of "not"?
>
> I've come up with these eight qualitatively different questions without really
> thinking about it. Now, in the module, there would be intentionally different
> ways of phrasing each one using quirks of human language, such as "Type the
> word that appears before the first hyphen" rather than the second question.
> Add into that the use of random mathematical questions, common knowledge
> ("What is the name of this planet?", "What ocean is between Africa and
> South America?", "What continent includes the country of Germany?" and so on...
> carefully chosen questions that are culturally neutral.)
>
> One could also allow the site admin to add in a list of Q&A that any person
> registering at their site would know based on the topic of the site. For
> Drupal.org, questions might be "CMS stands for ____ management system:" or
> "What word means to obtain a copy of Drupal from our web site to your computer
> so you can install it?" or "How many eyes does our logo have?" or "How many
> menu tabs are at the top-right corner of the Drupal home page?"
>
> Site admins could add general-knowledge questions that are not culturally-neutral
> based on the audience for their site. For example, if you are building a site
> about Canadian politics, it is not unreasonable to expect visitors to be able
> to name the Prime Minister of Canada or state how many provinces there are
> or name the province just west of Manitoba.
>
> The biggest challenge I can see is to make the questions patterned enough
> to use the t() function with replaceable parameters to allow translations,
> yet still have enough patterns to make a spammer's job difficult. One approach
> is to have multiple patterns for each question, e.g.,
>      $patterns[0] = array(
>          "What word appears %location the first %punc?",
>          "The first occurrence of %punc appears just %location what word?",
>          "What's the word right %location %punc, the first time %punc occurs?",
>          "There is a word immediately %location %punc in this text. What is it?",
>          "Look for %punc in the text and type the word %location it:",
>      );
>      $patterns[3] = array(
>          "What is %expression?",
>          "Compute the value of %expression:",
>          "In mathematics, %expression equals what?",
>          "Tell me the numeric value for %expression.",
>          "%expression is how many?"
>      );
>
> The neat thing about this is that if we have multiple translations of the
> questions, the spammers have to follow us. And linguistic syntax variance will
> cause the replaceable parameters to appear at different points in different
> languages. Nine questions permutated five ways each is only 45 pregenerated
> questions, but this is improved by the fact that some of them (like the
> math one) can have %expression also written in multiple language-neutral
> ways:  3x5, 3 * 5, (2+1)x(2+3)   so that the spammer now also has to have
> an arithmetic expression evaluator. Make it tougher by adding simple algebra:
>
>     Y * (2+1) = X, and Y is 5. What is X?
>
> or grade-school-level "word problems" from math class:
>
>     Johnny has 12 candies and shares 4 with his sister. How many are left?
>     Sally shares 16 coins equally among 8 people. How many does each receive?
>
> There are lots of ways to word these, as any teacher can tell you.
>
> The system is made stronger if the same replaceable parameters are used in the
> same word positions in different questions that have different answers, e.g.,
> "What word appears [after] the first [.]?" versus "First letter appearing
> [in] the first [sentence]?" -- two questions that parse similarly but have
> non-overlapping answer domains.
>
> Would it be possible to build an AI that outsmarted this system? Of course.
> But it would be nontrivial, and it would have to be Drupal-specific, and if
> the site admins are diligent in adding their own questions, it would have to
> be at least partially site-specific.
>
> It's not foolproof, but it would at least make things a little more challenging
> for the spammers, and it doesn't rely on images.
>
> Comments?
>
> --
> -------------------------------------------------------------------------------
> Syscrusher (Scott Courtney)          Drupal page:   http://drupal.org/user/9184
> syscrusher at 4th dot com            Home page:     http://4th.com/
>

--
http://www.arnab.org