XML 1.0 does not allow all characters in unicode:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
It often trips up developers (like, today, me) that end up having, say, valid unicode, with valid characters like VT (\x1B)
, or ESC (\x1B)
, and suddenly they are producing invalid XML. A decent way to deal with this is to strip out the invalid characters. For example this stack overflow post shows how to do this with perl:
$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
Unfortunately the equivalent does not quite work with Python, since \x{10000}-\x{10FFFF}
needs to be expressed as \U00010000-\U0010FFFF
which not all versions of python seem to accept as part of a regular expression character class.
So people end up doing messy-looking things in python. But I figured out that if I invert the character class, the biggest character I need to write is \uFFFF
, which the python regex engine does acccept. Yay:
import re # xml 1.0 valid characters: # Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] # so to invert that, not in Char :: # x0 - x8 | xB | xC | xE - x1F # (most control characters, though TAB, CR, LF allowed) # | #xD800 - #xDFFF # (unicode surrogate characters) # | #xFFFE | #xFFFF | # (unicode end-of-plane non-characters) # >= 110000 # that would be beyond unicode!!! _illegal_xml_chars_RE = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]') def escape_xml_illegal_chars(val, replacement='?'): """Filter out characters that are illegal in XML. Looks for any character in val that is not allowed in XML and replaces it with replacement ('?' by default). >>> escape_illegal_chars("foo \x0c bar") 'foo ? bar' >>> escape_illegal_chars("foo \x0c\x0c bar") 'foo ?? bar' >>> escape_illegal_chars("foo \x1b bar") 'foo ? bar' >>> escape_illegal_chars(u"foo \uFFFF bar") u'foo ? bar' >>> escape_illegal_chars(u"foo \uFFFE bar") u'foo ? bar' >>> escape_illegal_chars(u"foo bar") u'foo bar' >>> escape_illegal_chars(u"foo bar", "") u'foo bar' >>> escape_illegal_chars(u"foo \uFFFE bar", "BLAH") u'foo BLAH bar' >>> escape_illegal_chars(u"foo \uFFFE bar", " ") u'foo bar' >>> escape_illegal_chars(u"foo \uFFFE bar", "\x0c") u'foo \x0c bar' >>> escape_illegal_chars(u"foo \uFFFE bar", replacement=" ") u'foo bar' """ return _illegal_xml_chars_RE.sub(replacement, val)