In PHP 5.2, filter_var() sanitizes text. In WP, esc_html() sanitizes text. The former works with a high-bit character in the text string, e.g. à , but the latter doesn’t. esc_html seems to be totally eating a string containing a high-bit character. Here’s the example, written as a simple WP plugin:
<?php
/*
Plugin Name: bugz tester
*/
class bugz_tester {
function __construct() {
if ( ! is_admin() )
return;
add_action('admin_menu', array(&$this,'admin_page'));
}
function admin_page() {
add_options_page('Bugz tester', 'bugz', 'edit_posts', 'bugz_sheet', array(&$this,'test_page'));
}
function test_page() {
?>
<div class="wrap">
<?php
$ts="blah à blah";
echo "original: " . $ts . "<br/>" ;
echo "PHP sanitized: " . $this->sanitize_txt( $ts ) . "<br/>" ;
echo "WP sanitized: " . esc_html( $ts ) . "<br/>";
die();
?>
</div>
<?php
}
function sanitize_txt ( $text ) {
$san_text = filter_var($text, FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH | FILTER_FLAG_STRIP_LOW ) ;
return $san_text;
}
}
new bugz_tester();
?>
Here’s the output:
original: blah � blah
PHP sanitized: blah à blah
WP sanitized:
I’m not obsessed with using esc_html(). But if I use instead filter_var(), the string vanishes when I add it to a WP custom field. Somehow WP sanitation is killing the string.
I’m mystified. Would be grateful for a clue.
2 Answers
Perhaps because the entity is a non-UTF8 character?
Here’s what esc_html()
does:
function esc_html( $text ) {
$safe_text = wp_check_invalid_utf8( $text );
$safe_text = _wp_specialchars( $safe_text, ENT_QUOTES );
return apply_filters( 'esc_html', $safe_text, $text );
}
If not that, then it’s getting sanitized when filtered by _wp_specialchars()
, which does double-encoding(by default,no) and all sorts of things.
For reference:
1) esc_html()
in source
2) _wp_specialchars()
in source