[drupal-devel] [bug] Clean up plain-text checking

Bart Jansens drupal-devel at drupal.org
Tue Mar 15 13:11:01 UTC 2005


Issue status update for http://drupal.org/node/18817

 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Steven
 Reported by:  Steven
 Updated by:   Bart Jansens
 Status:       patch

Looks good.
I spent some time trying to find places where the check_plain() call
was missing and I found only one; the URL of an aggregator feed isn't
checked. 
One thing that isn't entirely clear is who is responsible for calling
check_plain() or check_output(). In some modules this is done by a
theme function, in others the data passed to the theme functions is
already checked. I'm not saying this patch should fix all this (it's
already big enough), but just something to keep in mind for the future.
Currently some themes might forget a check.


Bart Jansens



Previous comments:
------------------------------------------------------------------------

March 13, 2005 - 03:00 : Steven

Attachment: http://drupal.org/files/issues/check.patch (91.83 KB)

Back story
Bart Jansens brought up some possible XSS issues recently which
prompted me to reevaluate the text checking in Drupal.
The big issue is "plain text": what a user typically enters in
single-line boxes. When this text is put into HTML, special characters
like < > and & need to be escaped. When it is put into an HTML
attribute, " needs to be escaped too. Note that it is not about
filtering (what check_output() does), but about handling plain-text
(commonly associated with single-line editboxes).
The current 'standard' in Drupal is to do escaping when outputting the
titles in HTML, and keep the plain-text in the database. We should keep
this as it is in line with "keep exactly what the user typed and process
it later". Furthermore it is hard to undo this escaping, which is needed
when communicating with other services or when exporting data.
This brings up the problem: where should text be escaped? There are in
fact many places where this escaping is not done yet, which means there
are problems if the user enters e.g. < or & in a title. Usually these
bugs are in admin areas, so not critical for security, but they should
still be fixed in the spirit of XHTML validation and usability.
So this means that before the call to check_plain(), the data is
plain-text and no HTML can be used in it. Once the text has been run
through check_plain(), it is HTML, and tags can be added. Logically, it
is easy: data should be escaped as soon as it becomes HTML, but exactly
deciding where data becomes HTML is tricky.
Any parameter which is passed as plain-text and only checked later in
the code can not contain HTML, so I've had to decide where to add
missing escapings based on context and common sense.
This issue also affects URLs, but there the XSS issues are more
important. Plus all of this should not be confused with urlencode()
which is simply used to put arbitrary data into an URL without it
disrupting normal URL semantics (directories, anchors, query arguments,
etc).
Concrete changes

drupal_specialchars() and check_form() were almost the same, except
that one would escape " while the other wouldn't. This difference is
negligable and it does no harm to escape too much. In the interest of
cleaning up text handling, I merged these into check_plain().
check_url() had a toss at extra XSS protection in it, but it was in
fact broken and wholly incorrect. I replaced it with what was probably
the original intention.
Many URL handling functions in fact worked with HTML-escaped URLs
(using &amp; instead of & for example). I got rid of this, and did the
escaping in a more logical place (e.g. drupal_goto no longer has to
reconvert the &amp;).
Node/comment titles used to be run through strip_tags(), which means
you couldn't use HTML but were still forced to escape angle brackets
and amps. I changed them to be consistent with other plain-text fields
in Drupal.
Profile checkbox items were handled inconsistently as well and were
fixed.
The l() function now escapes the title by default, in the interest of
keeping the code cleaner. But the old behaviour is still available (and
needed in some places) with an extra parameter.

Text handling in Drupal after this patch

Any piece of user submitted text should be run through one of the
check_ functions before being put into HTML. Use check_plain() for
plain-text, check_output() for rich text.
Dynamic data in URLs should be urlencode()d, regardless of where that
URL ends up.
URLs in HTML attributes should be check_url()'d.

If you are unsure if you're doing the right thing, the easiest test is
to try out putting HTML tags in your module's fields where they are not
intended. They should be displayed on-screen and not interpreted.





More information about the drupal-devel mailing list