XML 1.0 does not allow all characters in unicode:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
It often trips up developers (like, today, me) that end up having, say, valid unicode, with valid characters like VT (\x1B), or ESC (\x1B), and suddenly they are producing invalid XML. A decent way to deal with this is to strip out the invalid characters. For example this stack overflow post shows how to do this with perl:
$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
Unfortunately the equivalent does not quite work with Python, since \x{10000}-\x{10FFFF} needs to be expressed as \U00010000-\U0010FFFF which not all versions of python seem to accept as part of a regular expression character class.
So people end up doing messy-looking things in python. But I figured out that if I invert the character class, the biggest character I need to write is \uFFFF, which the python regex engine does acccept. Yay:
import re
# xml 1.0 valid characters:
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
# so to invert that, not in Char ::
# x0 - x8 | xB | xC | xE - x1F
# (most control characters, though TAB, CR, LF allowed)
# | #xD800 - #xDFFF
# (unicode surrogate characters)
# | #xFFFE | #xFFFF |
# (unicode end-of-plane non-characters)
# >= 110000
# that would be beyond unicode!!!
_illegal_xml_chars_RE = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')
def escape_xml_illegal_chars(val, replacement='?'):
"""Filter out characters that are illegal in XML.
Looks for any character in val that is not allowed in XML
and replaces it with replacement ('?' by default).
>>> escape_illegal_chars("foo \x0c bar")
'foo ? bar'
>>> escape_illegal_chars("foo \x0c\x0c bar")
'foo ?? bar'
>>> escape_illegal_chars("foo \x1b bar")
'foo ? bar'
>>> escape_illegal_chars(u"foo \uFFFF bar")
u'foo ? bar'
>>> escape_illegal_chars(u"foo \uFFFE bar")
u'foo ? bar'
>>> escape_illegal_chars(u"foo bar")
u'foo bar'
>>> escape_illegal_chars(u"foo bar", "")
u'foo bar'
>>> escape_illegal_chars(u"foo \uFFFE bar", "BLAH")
u'foo BLAH bar'
>>> escape_illegal_chars(u"foo \uFFFE bar", " ")
u'foo bar'
>>> escape_illegal_chars(u"foo \uFFFE bar", "\x0c")
u'foo \x0c bar'
>>> escape_illegal_chars(u"foo \uFFFE bar", replacement=" ")
u'foo bar'
"""
return _illegal_xml_chars_RE.sub(replacement, val)