Marco.org

I’m : a programmer, writer, podcaster, geek, and coffee enthusiast.

Tip: XML doesn’t like control characters (\x00-\x1F)

Control characters (in the range x00-x1F) aren’t allowed in XML, and most parsers will complain or fail if they’re present.

But they are valid in UTF-8. I’ve been assuming that this is fine to make sure XML’s content is valid (when the text is already supposed to be UTF-8):

$text = iconv('UTF-8', 'UTF-8//IGNORE', $text);
$node->appendChild($dom->createTextNode($text));

That’s not enough. Control characters have to be removed, too:

$text = preg_replace("#[\\x00-\\x1f]#msi", ' ', $text);
$text = iconv('UTF-8', 'UTF-8//IGNORE', $text);
$node->appendChild($dom->createTextNode($text));

I learn some obscure new knowledge every day…