[drupal-devel] Plain-text checking / text output in Drupal
Steven Wittens
steven at acko.net
Thu Mar 31 09:28:18 UTC 2005
I just committed the large check_plain() patch after a green light from
Dries:
http://drupal.org/node/18817
The main idea is to make sure plain text is handled as plain text... so
you can use <> or & in a taxonomy term name or comment subject, and not
mess up validation or your site. This should've in fact already been
okay, but it seems due to general confusion on what the right way was
many people didn't pay much attention to it. After my patch, every
single-line field should be plain-text in Drupal, including in
particular node/comment titles, which used to be stored as stripped HTML
instead (which was really bad for usability, see the issue for more
info). A few notable changes:
- drupal_specialchars() and check_form() have been merged into check_plain()
- node and comment titles are now plain-text and need to be
check_plain()'ed before output. Note that due to changes in l(), many
cases will be caught already.
- theme('placeholder', $text) was added for putting dynamic pieces of
text into a sentence ("are you sure you want to delete %block"). It
outputs '<em>'. check_plain($text) .'</em>' by default and should be
used when appropriate.
I've written up a short text on text output in Drupal after my patch...
once it's cleaned up a bit, it should go into the documentation
somewhere. I'll also add a short blurb to the module upgrading guide.
Steven
-------------
Text output in Drupal
When handling and outputting text to HTML, you need to be careful that
proper filtering or escaping is done. Otherwise there might be bugs when
users try to use angle brackets or ampersands, or worse you could open
up XSS exploits.
When handling data, the golden rule is to keep exactly what the user
typed. On the database and input side of things there isn't much to
worry about. Text remains text. Note that you should never use a plain
strip_tags() call to clean up user input. This would strip out all tags,
but still force you to use entities for angle brackets or ampersands.
You get all the disadvantages of HTML without any benefits.
Conversion is done on output where the text has to be placed in another
format/context, e.g. HTML. There are two types of text in Drupal:
1. Plain text
_____________
This is simple text without any markup. What the user entered is
displayed exactly on screen as is, and is not interpreted in any form.
This is almost always the format used for single-line text fields. It is
good to keep this consistency in your own code.
When outputting plain-text, you need to pass it through check_plain()
before it can be put inside HTML. This will convert quotes, ampersands
and angle brackets into entities.
Most themable functions and APIs take HTML arguments, but there are a
few which already have check_plain() in it for convenience:
* l(): the link caption should be passed as plain-text (unless
overridden with the $html parameter).
* menus: the menu item titles are plain-text.
* theme('placeholder'): the placeholder text is plain-text.
Some places require HTML which might not be obvious:
* page titles set through drupal_set_title(). The page title is
displayed in the HTML, where it makes sense to use tags like <em> for
clarity. When the page title is displayed in the HTML <title> tag
however, all tags will be stripped out.
* block titles passed in through hook_block(). For the same reason as
the page title, using HTML here is commonly done.
Note that functions which logically take 'data' and not 'output' will
almost always take plain-text and require no escaping on your side. A
good example is the value passed to form_ functions, e.g. a plain-text
field's contents. What the user entered is exactly what you should pass
to form_textfield. On the other hand, this does not count for the form
item's title or description, which are passed as HTML. This is done so
that modules can format the item title as they want.
2. Rich text
____________
This is text which is marked up in some language (HTML, Textile, etc).
It is stored in the markup-specific format, and converted to HTML with
the various filters that are enabled. This is almost always used for
multi-line text fields. All you need to do is pass the rich text to
check_output() and you'll get HTML returned, safe for outputting. You
should also allow the user to choose the input format with a format
widget through filter_form() and should pass the chosen format along to
check_output().
URLs
____
A note about URLs. URLs require special handling in two ways:
- Putting dynamic data into URLs. If you wish to put any sort of dynamic
data into an URL, you need to urlencode() it. If you don't, characters
like # will disrupt the normal URL semantics. urlencode() will escape
them with %XX syntax.
- Putting URLs into HTML. URLs are a common attack vector for XSS
exploits. Though we have an XSS filter at the beginning of the page
request, it is still smart to be careful. When putting an URL inside an
HTML attribute (e.g. <a href="...">), you should pass it through
check_url(). Check_url() is similar to check_plain(), but it contains
some extra XSS protection.
Note that all Drupal functions which return URLs (url(...),
request_uri(), etc.) output 'real' URLs which have not been HTML escaped
in any way. Remember to use check_url() to escape them when outputting
HTML (or XML). Don't use check_url() in situations where a real URL is
expected, e.g. in the HTTP 'Location: ...' header.
In practice
___________
If this sounds all confusing, there are only a limited number of
functions where this is important, and you will easily get to know them.
Usually you control your own output in your module so the output process
is quite transparent. Every piece of plain-text should be converted with
check_plain() once before going into HTML.
When in doubt, you can always put some test text like "<u>foo</u>" in
your text fields, and see how it comes out. For plain-text fields, the
underline tag should not be interpreted, but displayed as is.
When displaying a piece of user-submitted text in a message, you should
pass it through theme('placeholder', $text) to make it stand out. It
will also escape the text for you with check_plain().
Note that you cannot pass HTML entities to functions which accept
plain-text. If you need to use high Unicode characters in a plain-text
string, input them directly in the code with UTF-8 encoding. It's more
compact as well.
More information about the drupal-devel
mailing list