About XML schemas and the myriad of Unicode characters

Written by: Eliot Kimber Last Updated: 2006-08-31 =Question by Ed Benton=

One of the things that is holding us back (among others) from switching to schema from an XML DTD is this issue of how do we keep the myriad of unicode characters from being offered to our authors and subsequently showing up in some outputs and causing errors, as well as making the non-unicode special characters available. We can do this fairly easily with character entities and DTDs, but using schema and unicode is problematic.

=Eliot Kimber answered=

The quickest quick fix here is to not use Unicode as the storage encoding for your documents, use ISO-8859 (ASCII). This ensures that at least the unparsed data won't contain any Unicode characters and therefore won't break any tools that can't handle Unicode. It will mean that you may see numeric character refs in strings from the XML but they won't break non-Unicode-aware tools.

That is, while all XML documents are, semantically, made up of Unicode characters, they don't have to be stored in a Unicode encoding -- they can be stored in any encoding as long as the parser you're using can read it. After parsing, all XML data is in Unicode. If you are feeding a non-Unicode-capable system with XML data through a parsing-based process, then you have to do a Unicode-to-non-Unicode transform, which of course can be problematic because there may not be a match in your target encoding for some Unicode characters, so you either have to map to some fallback or escape the characters in some way (the details of which will depend on the application the data is going to). This is certainly the case when mapping to ASCII and you've used non-ASCII characters in your XML.

If you are getting the data from the XML without using a parser (for example, just doing regex matching on well-formed XML documents) then you can extract the data in its original encoding.

Note that XSLT 2 provides an explicit character map mechanism that lets you handle character set to character set mappings on output to at least provide fallbacks (for example, mapping Unicode \u2014 (em dash) to "--" when going to ASCII, which has no standard em-dash character. For example, if you set the output encoding for a text output to "ISO-8859" (ASCII) you also need to set up a character map to handle any unmappable characters that will occur in your data.

For rendering there are Unicode fonts that provide at least some form of glyph for all the printing Unicode characters, so apart from the pain of font configuration, there should be no problem rendering Unicode characters with Unicode-aware rendering software.

If you are using a rendition system that does not support Unicode then you have a much bigger problem and should probably not be using XML for the very reasons stated--you simply cannot easily limit the set of characters used.

[Editors comment by Karl Johan Kleist: Eliot must have been writing this before having had the day's first cup of coffee. ISO-8859 (an eight bit encoding) is of course not ASCII (seven bits).]