Re: [development] Captcha module -- possible alternative approach

9 Feb 2006

      On Thursday 09 February 2006 06:57, Bèr Kessels wrote:
...
Captcha is bad. Evil. ;)
 http://drupal.org/node/46666 is a faaaar better approach. (imo). It validated 
the email first over SMTP on the remote server, and then reports back.
Nice module...thanks for the tip. I will probably install it when I (soon!)
upgrade my site to a newer Drupal version.

Unfortunately, it doesn't solve my immediate problem. The specific domain
being spoofed (yahoo.com) is one of those that returns a 250 (accepted)
for any recipient, whether valid or not, so the module you suggest won't
block my DoS attack. That being said, since Email Verify is user-transparent,
there's no reason not to use it in addition to other precautions.

I agree with you that I Captcha is not an optimum solution. I'm thinking
about adapting the algorithm so that instead of displaying an image and
asking people to type the text, I would create a module that asks a
simple, randomly-generated question that any human being could answer.
I could actually draw the questions from content on my site. What I have
in mind is to pull text from a random article and then ask a question
that is permutated from sentence structure and possibly from some keywords
found in the text. The possibilities are finite, but with careful algorithm
design it could be made so that there are a very large number of
possibilities.

Example questions, assuming the paragraph above is the random text:

* What is the last word before the first period?
* What is the first hyphenated phrase in the text?
* How many times does the word "create" appear?
* What is 3 times 5?         (who says we have to just ask about the text?)
* Retype the user name you have chosen but put an extra Q on the end.
* What is the third word of the second sentence?
* What is the second-to-last word of the text?
* What word appears right after the first occurrence of "not"?

I've come up with these eight qualitatively different questions without really
thinking about it. Now, in the module, there would be intentionally different
ways of phrasing each one using quirks of human language, such as "Type the
word that appears before the first hyphen" rather than the second question.
Add into that the use of random mathematical questions, common knowledge
("What is the name of this planet?", "What ocean is between Africa and
South America?", "What continent includes the country of Germany?" and so on...
carefully chosen questions that are culturally neutral.)

One could also allow the site admin to add in a list of Q&A that any person
registering at their site would know based on the topic of the site. For
Drupal.org, questions might be "CMS stands for ____ management system:" or
"What word means to obtain a copy of Drupal from our web site to your computer
so you can install it?" or "How many eyes does our logo have?" or "How many
menu tabs are at the top-right corner of the Drupal home page?"

Site admins could add general-knowledge questions that are not culturally-neutral
based on the audience for their site. For example, if you are building a site
about Canadian politics, it is not unreasonable to expect visitors to be able
to name the Prime Minister of Canada or state how many provinces there are
or name the province just west of Manitoba.

The biggest challenge I can see is to make the questions patterned enough
to use the t() function with replaceable parameters to allow translations,
yet still have enough patterns to make a spammer's job difficult. One approach
is to have multiple patterns for each question, e.g.,
     $patterns[0] = array(
         "What word appears %location the first %punc?",
         "The first occurrence of %punc appears just %location what word?",
         "What's the word right %location %punc, the first time %punc occurs?",
         "There is a word immediately %location %punc in this text. What is it?",
         "Look for %punc in the text and type the word %location it:",
     );
     $patterns[3] = array(
         "What is %expression?",
         "Compute the value of %expression:",
         "In mathematics, %expression equals what?",
         "Tell me the numeric value for %expression.",
         "%expression is how many?"
     );

The neat thing about this is that if we have multiple translations of the
questions, the spammers have to follow us. And linguistic syntax variance will
cause the replaceable parameters to appear at different points in different
languages. Nine questions permutated five ways each is only 45 pregenerated
questions, but this is improved by the fact that some of them (like the
math one) can have %expression also written in multiple language-neutral
ways:  3x5, 3 * 5, (2+1)x(2+3)   so that the spammer now also has to have
an arithmetic expression evaluator. Make it tougher by adding simple algebra:

    Y * (2+1) = X, and Y is 5. What is X?

or grade-school-level "word problems" from math class:

    Johnny has 12 candies and shares 4 with his sister. How many are left?
    Sally shares 16 coins equally among 8 people. How many does each receive?

There are lots of ways to word these, as any teacher can tell you.

The system is made stronger if the same replaceable parameters are used in the
same word positions in different questions that have different answers, e.g.,
"What word appears [after] the first [.]?" versus "First letter appearing
[in] the first [sentence]?" -- two questions that parse similarly but have
non-overlapping answer domains. 

Would it be possible to build an AI that outsmarted this system? Of course.
But it would be nontrivial, and it would have to be Drupal-specific, and if
the site admins are diligent in adding their own questions, it would have to
be at least partially site-specific.

It's not foolproof, but it would at least make things a little more challenging
for the spammers, and it doesn't rely on images.

Comments?

-- 
-------------------------------------------------------------------------------
Syscrusher (Scott Courtney)          Drupal page:   http://drupal.org/user/9184
syscrusher at 4th dot com            Home page:     http://4th.com/