Why you should NOT use strip_tags()
In the world of Web 2.0, it's becoming more and more common to allow users to submit rich text in comments, profile text etc. with the possibility of pimping it with HTML formatting.
This is of course a nice thing, because it makes it possible for the users to express themselves better, but you should be careful with how you implement such a feature in your applications. I have seen a lot of code (including some of my own old code) where the cleaning procedure was done something like this:
$text = trim(strip_tags($_POST['text'], "<b><strong><i><em><a>")); // Only allow bold and italic text and links...
This method may look adequate, as it will strip out all tags which are not present in the second parameter for the strip_tags() function, but your code is still open for the users to cause some serious havoc on your site.
The problem
The problem with strip_tags() is that when allowing a tag, you're not only allowing the tag, but also any property of the tag. This means that the user can put arbitrary CSS styling on elements by using the style property, which could screw up your layout. Another way for the user to screw up your site is by writing an opening tag and simply omitting the closing tag. This would break the DOM structure and cause your site the render incorrectly and not validate.
A more serious problem with using strip_tags() is that it doesn't do any kind of validation on the property values, and therefore there would be nothing to stop the user from doing nasty things like this:
<img src="javascript:alert('Cross Site Scripting (XSS) vulnerability');" />
Yes. - The javascript inside the src property will actually execute in many browsers. But this is not the only way for a malicious user to trigger javascript execution. He can also make use of any of the event properties, like onmouseover, onmouseout and onclick to execute arbitrary code that potentially could be a serious security issue and not just a stupid alert box.
A better alternative
HTML Purifier is a PHP5 library which can be used to purify and clean up HTML and ensure that it is standards-compliant. Have a look at their own description:
HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications. Tired of using BBCode due to the current landscape of deficient or insecure HTML filters? Have a WYSIWYG editor but never been able to use it? Looking for high-quality, standards-compliant, open-source components for that application you're building? HTML Purifier is for you!
HTML Purifier may look like a bit of a mouthful, but once you get the hang of the basic configuration settings it's a piece of cake to use. Another neat thing about HTML Purifier is that it follows the same file and class naming convention as the Zend Framework, and therefore it's just a matter of dumping it into your library folder and you're ready to go!
Let's see how we would clean user input with HTML Purifier:
require_once "HTMLPurifier/Bootstrap.php"; $allowedTags = "p,em,i,strong,b,a[href],ul,ol,li,code,pre,blockquote"; $config = HTMLPurifier_Config::createDefault(); $config->set('HTML.Doctype', 'XHTML 1.0 Strict'); $config->set('HTML.TidyLevel', 'heavy'); $config->set('HTML.Allowed', $allowedTags); $config->set('AutoFormat.Linkify', 'true'); $config->set('AutoFormat.AutoParagraph', 'true'); $htmlPurifier = new HTMLPurifier($config); $text = $_POST['text']; $cleanText = $htmlPurifier->purify($text);
As you can see, cleaning user input with HTML Purifier requires a bit more than just making a simple function call, but it does a much better job at cleaning the input and making it safe for later display on your site. Let's have a look at the different configuration options I used:
HTML.Doctype: The HTML standard that HTML purifier will make input compliant with.HTML.TidyLevel: Defines how aggressive HTML Purifier will be when cleaning input.HTML.Allowed: Defines the tags and attributes that are allowed.AutoFormat.Linkify: This makes HTML Purifier automatically turn URL's into clickable links. Nice, eh?AutoFormat.AutoParagraph: This makes HTML Purifier automatically wrap all text blocks in paragraph tags. Also a nice feature!
HTML Purifier is a very powerful and thoroughly tested library, and what I showed here in this post is just a small part of what it can do - so be sure to check out the documentation for all the rest.
Final words
I want to make it clear that I'm not saying that you should ditch the strip_tags() function completely. It's still an excellent function to use in situations where you want to ensure that no HTML is present in user input. Calling one function is a much cheaper operation than making a new instance of HTML Purifier and running the input through that.
What do you do to clean up user input?
No related posts.