[drupal-devel] [bug] Clean up plain-text checking

Mon Mar 28 17:04:47 UTC 2005

Issue status update for http://drupal.org/node/18817

 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Steven
 Reported by:  Steven
 Updated by:   Dries
 Status:       patch

The approach taken in this patch looks good.  I second that this should
be committed ASAP (1) such that it gets tested before the Drupal 4.6
release and (2) such that contributed modules/themes can update their
code accordingly.  (Note that I haven't actually tested the patch -- I
just looked at the diff.)
I'm a bit worried about people not using check_plain() -- it is easy
enough to forget wrapping $node->title in check_plain().  I wonder what
we can do about that, or what the implications could be like.  The least
we could do is add a test to the code-checker script.  Either way, it
improves the current behavior so +1 from me.

Dries

Previous comments:
------------------------------------------------------------------------

March 13, 2005 - 04:00 : Steven

Attachment: http://drupal.org/files/issues/check.patch (91.83 KB)

Back story
Bart Jansens brought up some possible XSS issues recently which
prompted me to reevaluate the text checking in Drupal.
The big issue is "plain text": what a user typically enters in
single-line boxes. When this text is put into HTML, special characters
like < > and & need to be escaped. When it is put into an HTML
attribute, " needs to be escaped too. Note that it is not about
filtering (what check_output() does), but about handling plain-text
(commonly associated with single-line editboxes).
The current 'standard' in Drupal is to do escaping when outputting the
titles in HTML, and keep the plain-text in the database. We should keep
this as it is in line with "keep exactly what the user typed and process
it later". Furthermore it is hard to undo this escaping, which is needed
when communicating with other services or when exporting data.
This brings up the problem: where should text be escaped? There are in
fact many places where this escaping is not done yet, which means there
are problems if the user enters e.g. < or & in a title. Usually these
bugs are in admin areas, so not critical for security, but they should
still be fixed in the spirit of XHTML validation and usability.
So this means that before the call to check_plain(), the data is
plain-text and no HTML can be used in it. Once the text has been run
through check_plain(), it is HTML, and tags can be added. Logically, it
is easy: data should be escaped as soon as it becomes HTML, but exactly
deciding where data becomes HTML is tricky.
Any parameter which is passed as plain-text and only checked later in
the code can not contain HTML, so I've had to decide where to add
missing escapings based on context and common sense.
This issue also affects URLs, but there the XSS issues are more
important. Plus all of this should not be confused with urlencode()
which is simply used to put arbitrary data into an URL without it
disrupting normal URL semantics (directories, anchors, query arguments,
etc).
Concrete changes

drupal_specialchars() and check_form() were almost the same, except
that one would escape " while the other wouldn't. This difference is
negligable and it does no harm to escape too much. In the interest of
cleaning up text handling, I merged these into check_plain().
check_url() had a toss at extra XSS protection in it, but it was in
fact broken and wholly incorrect. I replaced it with what was probably
the original intention.
Many URL handling functions in fact worked with HTML-escaped URLs
(using &amp; instead of & for example). I got rid of this, and did the
escaping in a more logical place (e.g. drupal_goto no longer has to
reconvert the &amp;).
Node/comment titles used to be run through strip_tags(), which means
you couldn't use HTML but were still forced to escape angle brackets
and amps. I changed them to be consistent with other plain-text fields
in Drupal.
Profile checkbox items were handled inconsistently as well and were
fixed.
The l() function now escapes the title by default, in the interest of
keeping the code cleaner. But the old behaviour is still available (and
needed in some places) with an extra parameter.

Text handling in Drupal after this patch

Any piece of user submitted text should be run through one of the
check_ functions before being put into HTML. Use check_plain() for
plain-text, check_output() for rich text.
Dynamic data in URLs should be urlencode()d, regardless of where that
URL ends up.
URLs in HTML attributes should be check_url()'d.

If you are unsure if you're doing the right thing, the easiest test is
to try out putting HTML tags in your module's fields where they are not
intended. They should be displayed on-screen and not interpreted.

------------------------------------------------------------------------

March 15, 2005 - 15:10 : Bart Jansens

Looks good.
I spent some time trying to find places where the check_plain() call
was missing and I found only one; the URL of an aggregator feed isn't
checked. 
One thing that isn't entirely clear is who is responsible for calling
check_plain() or check_output(). In some modules this is done by a
theme function, in others the data passed to the theme functions is
already checked. I'm not saying this patch should fix all this (it's
already big enough), but just something to keep in mind for the future.
Currently some themes might forget a check.

------------------------------------------------------------------------

March 16, 2005 - 03:29 : Steven

For nodes/comments the responsibility lies with the theme engine anyway,
not the template. The other cases are not themed as often.

------------------------------------------------------------------------

March 28, 2005 - 12:49 : TDobes

Steven, these are excellent and important changes, but the patch is
quite large. If this is intended to land before 4.6 is released, I
recommend that it land sooner rather than later so that it receives
more widespread testing to iron out any bugs that may appear.
In the patches to theme.inc, themes/chameleon.theme, and
themes/engines/xtemplate/xtemplate.engine, you pass the page title
through strip_tags (fixing this bug [1]) but not through check_plain. 
After we run it through strip_tags, is there any reason why we wouldn't
want to run it through check_plain to escape text just like anywhere
else?  In fact, it might be useful to add a second parameter to
check_plain that would do this for us (strip_tags first, then
htmlspecialchars)  We need to update the theme documentation after this
patch lands so that contrib themes display titles in a manner consistent
with core.
In the patch for modules/comment.inc, the strip_tags function is
removed from the comment subject auto-generation code.  Because comment
bodies can contain HTML, there should probably be a strip_tags in there
somewhere to prevent undesirable results. I see that the comment
subjects are passed through a check_plain which should render the HTML
inert, but it would still be more pleasant to see something like "What
I think is" rather than "<p>What I think"
In modules/path.module, there is a typo: check_plani should be
check_plain
Note that this patch also includes a fix for this bug [2] as well.
(Invalid search results link in the watchdog.)
[1] http://drupal.org/node/19554
[2] http://drupal.org/node/19535

------------------------------------------------------------------------

March 28, 2005 - 12:52 : TDobes

I forgot to say:
After the minor issues I noted are addressed, +1.  This is very helpful
for usability as it removes the burden of escaping text from users. 
It's also a major plus for consistency; before this patch, I could name
at least one field that was escaped on some pages and not on others. 
Thanks for addressing this!

------------------------------------------------------------------------

March 28, 2005 - 18:41 : Uwe Hermann

+1. Looks good.