[drupal-devel] Plain-text checking / text output in Drupal

Steven Wittens steven at acko.net
Thu Mar 31 09:28:18 UTC 2005


I just committed the large check_plain() patch after a green light from 
Dries:
http://drupal.org/node/18817

The main idea is to make sure plain text is handled as plain text... so 
you can use  <> or & in a taxonomy term name or comment subject, and not 
mess up validation or your site. This should've in fact already been 
okay, but it seems due to general confusion on what the right way was 
many people didn't pay much attention to it. After my patch, every 
single-line field should be plain-text in Drupal, including in 
particular node/comment titles, which used to be stored as stripped HTML 
instead (which was really bad for usability, see the issue for more 
info). A few notable changes:

- drupal_specialchars() and check_form() have been merged into check_plain()
- node and comment titles are now plain-text and need to be 
check_plain()'ed before output. Note that due to changes in l(), many 
cases will be caught already.
- theme('placeholder', $text) was added for putting dynamic pieces of 
text into a sentence ("are you sure you want to delete %block"). It 
outputs '<em>'. check_plain($text) .'</em>' by default and should be 
used when appropriate.

I've written up a short text on text output in Drupal after my patch... 
once it's cleaned up a bit, it should go into the documentation 
somewhere. I'll also add a short blurb to the module upgrading guide.

Steven

-------------
Text output in Drupal

When handling and outputting text to HTML, you need to be careful that 
proper filtering or escaping is done. Otherwise there might be bugs when 
users try to use angle brackets or ampersands, or worse you could open 
up XSS exploits.

When handling data, the golden rule is to keep exactly what the user 
typed. On the database and input side of things there isn't much to 
worry about. Text remains text. Note that you should never use a plain 
strip_tags() call to clean up user input. This would strip out all tags, 
but still force you to use entities for angle brackets or ampersands. 
You get all the disadvantages of HTML without any benefits.

Conversion is done on output where the text has to be placed in another 
format/context, e.g. HTML. There are two types of text in Drupal:

1. Plain text
_____________

This is simple text without any markup. What the user entered is 
displayed exactly on screen as is, and is not interpreted in any form. 
This is almost always the format used for single-line text fields. It is 
good to keep this consistency in your own code.

When outputting plain-text, you need to pass it through check_plain() 
before it can be put inside HTML. This will convert quotes, ampersands 
and angle brackets into entities.

Most themable functions and APIs take HTML arguments, but there are a 
few which already have check_plain() in it for convenience:
* l(): the link caption should be passed as plain-text (unless 
overridden with the $html parameter).
* menus: the menu item titles are plain-text.
* theme('placeholder'): the placeholder text is plain-text.

Some places require HTML which might not be obvious:
* page titles set through drupal_set_title(). The page title is 
displayed in the HTML, where it makes sense to use tags like <em> for 
clarity. When the page title is displayed in the HTML <title> tag 
however, all tags will be stripped out.
* block titles passed in through hook_block(). For the same reason as 
the page title, using HTML here is commonly done.

Note that functions which logically take 'data' and not 'output' will 
almost always take plain-text and require no escaping on your side. A 
good example is the value passed to form_ functions, e.g. a plain-text 
field's contents. What the user entered is exactly what you should pass 
to form_textfield. On the other hand, this does not count for the form 
item's title or description, which are passed as HTML. This is done so 
that modules can format the item title as they want.

2. Rich text
____________

This is text which is marked up in some language (HTML, Textile, etc). 
It is stored in the markup-specific format, and converted to HTML with 
the various filters that are enabled. This is almost always used for 
multi-line text fields. All you need to do is pass the rich text to 
check_output() and you'll get HTML returned, safe for outputting. You 
should also allow the user to choose the input format with a format 
widget through filter_form() and should pass the chosen format along to 
check_output().

URLs
____

A note about URLs. URLs require special handling in two ways:

- Putting dynamic data into URLs. If you wish to put any sort of dynamic 
data into an URL, you need to urlencode() it. If you don't, characters 
like # will disrupt the normal URL semantics. urlencode() will escape 
them with %XX syntax.

- Putting URLs into HTML. URLs are a common attack vector for XSS 
exploits. Though we have an XSS filter at the beginning of the page 
request, it is still smart to be careful. When putting an URL inside an 
HTML attribute (e.g. <a href="...">), you should pass it through 
check_url(). Check_url() is similar to check_plain(), but it contains 
some extra XSS protection.

Note that all Drupal functions which return URLs (url(...), 
request_uri(), etc.) output 'real' URLs which have not been HTML escaped 
in any way. Remember to use check_url() to escape them when outputting 
HTML (or XML). Don't use check_url() in situations where a real URL is 
expected, e.g. in the HTTP 'Location: ...' header.

In practice
___________

If this sounds all confusing, there are only a limited number of 
functions where this is important, and you will easily get to know them. 
Usually you control your own output in your module so the output process 
is quite transparent. Every piece of plain-text should be converted with 
check_plain() once before going into HTML.

When in doubt, you can always put some test text like "<u>foo</u>" in 
your text fields, and see how it comes out. For plain-text fields, the 
underline tag should not be interpreted, but displayed as is.

When displaying a piece of user-submitted text in a message, you should 
pass it through theme('placeholder', $text) to make it stand out. It 
will also escape the text for you with check_plain().

Note that you cannot pass HTML entities to functions which accept 
plain-text. If you need to use high Unicode characters in a plain-text 
string, input them directly in the code with UTF-8 encoding. It's more 
compact as well.




More information about the drupal-devel mailing list