[drupal-devel] [bug] Clean up plain-text checking

Tue Mar 29 09:23:08 UTC 2005

Issue status update for http://drupal.org/node/18817

 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Steven
 Reported by:  Steven
 Updated by:   TDobes
 Status:       patch

Steven: After reading your comments, I agree with you and retract my
earlier complaints.  +1 for applying this soon.
However, I still think leaving the strip_tags on the comment subject
autogeneration code will do no harm and may help in some cases.  Some
sites may actually operate with the comment subject field turned off so
all comment subjects are auto-generated.  If those sites allow HTML,
HTML in comment subjects is nearly guaranteed.  For sites that don't
allow HTML, although it will not be helpful, I can't see how it will do
any harm.
Also, your patches do not include a strip_tags for the <title> element
in chameleon.theme.  You'll notice that the $title variable actually
isn't yet defined when it's used... there's a patch waiting to fix that
[1].
[1] http://drupal.org/node/19585

TDobes

Previous comments:
------------------------------------------------------------------------

March 12, 2005 - 20:00 : Steven

Attachment: http://drupal.org/files/issues/check.patch (91.83 KB)

Back story
Bart Jansens brought up some possible XSS issues recently which
prompted me to reevaluate the text checking in Drupal.
The big issue is "plain text": what a user typically enters in
single-line boxes. When this text is put into HTML, special characters
like < > and & need to be escaped. When it is put into an HTML
attribute, " needs to be escaped too. Note that it is not about
filtering (what check_output() does), but about handling plain-text
(commonly associated with single-line editboxes).
The current 'standard' in Drupal is to do escaping when outputting the
titles in HTML, and keep the plain-text in the database. We should keep
this as it is in line with "keep exactly what the user typed and process
it later". Furthermore it is hard to undo this escaping, which is needed
when communicating with other services or when exporting data.
This brings up the problem: where should text be escaped? There are in
fact many places where this escaping is not done yet, which means there
are problems if the user enters e.g. < or & in a title. Usually these
bugs are in admin areas, so not critical for security, but they should
still be fixed in the spirit of XHTML validation and usability.
So this means that before the call to check_plain(), the data is
plain-text and no HTML can be used in it. Once the text has been run
through check_plain(), it is HTML, and tags can be added. Logically, it
is easy: data should be escaped as soon as it becomes HTML, but exactly
deciding where data becomes HTML is tricky.
Any parameter which is passed as plain-text and only checked later in
the code can not contain HTML, so I've had to decide where to add
missing escapings based on context and common sense.
This issue also affects URLs, but there the XSS issues are more
important. Plus all of this should not be confused with urlencode()
which is simply used to put arbitrary data into an URL without it
disrupting normal URL semantics (directories, anchors, query arguments,
etc).
Concrete changes

drupal_specialchars() and check_form() were almost the same, except
that one would escape " while the other wouldn't. This difference is
negligable and it does no harm to escape too much. In the interest of
cleaning up text handling, I merged these into check_plain().
check_url() had a toss at extra XSS protection in it, but it was in
fact broken and wholly incorrect. I replaced it with what was probably
the original intention.
Many URL handling functions in fact worked with HTML-escaped URLs
(using &amp; instead of & for example). I got rid of this, and did the
escaping in a more logical place (e.g. drupal_goto no longer has to
reconvert the &amp;).
Node/comment titles used to be run through strip_tags(), which means
you couldn't use HTML but were still forced to escape angle brackets
and amps. I changed them to be consistent with other plain-text fields
in Drupal.
Profile checkbox items were handled inconsistently as well and were
fixed.
The l() function now escapes the title by default, in the interest of
keeping the code cleaner. But the old behaviour is still available (and
needed in some places) with an extra parameter.

Text handling in Drupal after this patch

Any piece of user submitted text should be run through one of the
check_ functions before being put into HTML. Use check_plain() for
plain-text, check_output() for rich text.
Dynamic data in URLs should be urlencode()d, regardless of where that
URL ends up.
URLs in HTML attributes should be check_url()'d.

If you are unsure if you're doing the right thing, the easiest test is
to try out putting HTML tags in your module's fields where they are not
intended. They should be displayed on-screen and not interpreted.

------------------------------------------------------------------------

March 15, 2005 - 07:10 : Bart Jansens

Looks good.
I spent some time trying to find places where the check_plain() call
was missing and I found only one; the URL of an aggregator feed isn't
checked. 
One thing that isn't entirely clear is who is responsible for calling
check_plain() or check_output(). In some modules this is done by a
theme function, in others the data passed to the theme functions is
already checked. I'm not saying this patch should fix all this (it's
already big enough), but just something to keep in mind for the future.
Currently some themes might forget a check.

------------------------------------------------------------------------

March 15, 2005 - 19:29 : Steven

For nodes/comments the responsibility lies with the theme engine anyway,
not the template. The other cases are not themed as often.

------------------------------------------------------------------------

March 28, 2005 - 04:49 : TDobes

Steven, these are excellent and important changes, but the patch is
quite large. If this is intended to land before 4.6 is released, I
recommend that it land sooner rather than later so that it receives
more widespread testing to iron out any bugs that may appear.
In the patches to theme.inc, themes/chameleon.theme, and
themes/engines/xtemplate/xtemplate.engine, you pass the page title
through strip_tags (fixing this bug [2]) but not through check_plain. 
After we run it through strip_tags, is there any reason why we wouldn't
want to run it through check_plain to escape text just like anywhere
else?  In fact, it might be useful to add a second parameter to
check_plain that would do this for us (strip_tags first, then
htmlspecialchars)  We need to update the theme documentation after this
patch lands so that contrib themes display titles in a manner consistent
with core.
In the patch for modules/comment.inc, the strip_tags function is
removed from the comment subject auto-generation code.  Because comment
bodies can contain HTML, there should probably be a strip_tags in there
somewhere to prevent undesirable results. I see that the comment
subjects are passed through a check_plain which should render the HTML
inert, but it would still be more pleasant to see something like "What
I think is" rather than "<p>What I think"
In modules/path.module, there is a typo: check_plani should be
check_plain
Note that this patch also includes a fix for this bug [3] as well.
(Invalid search results link in the watchdog.)
[2] http://drupal.org/node/19554
[3] http://drupal.org/node/19535

------------------------------------------------------------------------

March 28, 2005 - 04:52 : TDobes

I forgot to say:
After the minor issues I noted are addressed, +1.  This is very helpful
for usability as it removes the burden of escaping text from users. 
It's also a major plus for consistency; before this patch, I could name
at least one field that was escaped on some pages and not on others. 
Thanks for addressing this!

------------------------------------------------------------------------

March 28, 2005 - 10:41 : Uwe Hermann

+1. Looks good.

------------------------------------------------------------------------

March 28, 2005 - 11:04 : Dries

The approach taken in this patch looks good.  I second that this should
be committed ASAP (1) such that it gets tested before the Drupal 4.6
release and (2) such that contributed modules/themes can update their
code accordingly.  (Note that I haven't actually tested the patch -- I
just looked at the diff.)
I'm a bit worried about people not using check_plain() -- it is easy
enough to forget wrapping $node->title in check_plain().  I wonder what
we can do about that, or what the implications could be like.  The least
we could do is add a test to the code-checker script.  Either way, it
improves the current behavior so +1 from me.

------------------------------------------------------------------------

March 28, 2005 - 17:24 : Steven

Attachment: http://drupal.org/files/issues/check_0.patch (91.92 KB)

The forgetting to use check_plain() argument is IMO not valid, because
such checking needs to be done already. Because of the very
inconsistent way that node/comment titles were handled, this was not
necessary before for just this case, leading people to believe that it
was allowed to be omitted everywhere else too. Some modules (like
aggregator) had no escaping in them whatsoever.
In any case, I will write a short text explaining how to deal with text
processing in Drupal. It is an important topic, and I must say I'm quite
disappointed by how the size of this patch has scared away most people.
It's an important issue that needs to be adressed and taken into
account by everyone.
But the best way to check is to try to put HTML into plain-text fields
and making sure it comes out literally. E.g. "<u>this text should not
be underlined</u>".
"In the patches to theme.inc, themes/chameleon.theme, and
themes/engines/xtemplate/xtemplate.engine, you pass the page title
through strip_tags (fixing this bug) but not through check_plain. After
we run it through strip_tags, is there any reason why we wouldn't want
to run it through check_plain to escape text just like anywhere else?
In fact, it might be useful to add a second parameter to check_plain
that would do this for us (strip_tags first, then htmlspecialchars)
"
First of all, a plain strip_tags() should never be used to validate or
check user input, as it prevents you from using any HTML, but still
forces you to escape special characters manually. Hence you get all the
downsides of HTML but no advantages. This is a very rare case, so I
would not modify check_plain() for this.
The reason we do strip_tags() on the page title is different: the page
title, at this point, has already been converted into HTML. It may
contains HTML tags (as the theme_confirm issue shows). And it is
inserted without any extra processing into the <h1> tag of the page.
However, the <title> tag in the HTML <head> may not contain any tags,
just HTML-escaped text. So, we strip out any tags, and present only the
text. Because the title was HTML before this point, we do not need to do
any further processing. Remember, check_plain() converts from plain-text
to HTML. Strip_tags() still outputs HTML, albeit without any tags in it.
If the source of the page title was dynamic plain-text user input, then
at one point or the other it should've been check_plain()ed before being
passed to drupal_set_title().
"In the patch for modules/comment.inc, the strip_tags function is
removed from the comment subject auto-generation code. Because comment
bodies can contain HTML, there should probably be a strip_tags in there
somewhere to prevent undesirable results. I see that the comment
subjects are passed through a check_plain which should render the HTML
inert, but it would still be more pleasant to see something like "What
I think is" rather than "<p>What I think"
"
The reason I did this is because strip_tags() automatically assumes the
comment is formatted as HTML. This is not always the case due to
flexible input formats being used. On top of that, due to the comment
title extraction being run before filtering, the chance of there being
HTML inside the title is lower.
And finally, because now we auto-generate the comment on preview rather
than submit, users will see if HTML is included in the subject, and can
correct it if needed. I think this is an acceptable compromise.
I've attached an up-to-date patch.