Do not use entities: Part I

Answered by: Eliot Kimber Last Updated: 2006-09-14

=Question by N.N.=

Adepters, Given the cross-section of SGML/XML users represented on this list, I'm hoping many of you will be willing to share your experience/opinion on this subject.

Specifically, there is an on-going debate here as to which is "better" and/or which would be preferred by more customers/end-users: character references, e.g. "&x1234;", or the potentially more readable entity references, e.g. "&mdash;".

Any feedback as to which seems to be used most, which seems to be preferred most, which drives more code maintenance, etc. ad nauseum will be most appreciated.

=Eliot Kimber answered=

Entities require a DTD. Therefore they should be used *only* if you intended to always use a DTD. You should not plan to always use DTDs (DTDs are a dead end and should be abandoned as quickly as practical).

Note that, given Unicode-aware editors and appropriate fonts (which are always available under Windows), you don't need numeric character references or entities at all--you can just put in the character, either because your keyboard supports it or by copying it from some source (the Windows CharMap utility, Unipad, etc.).

Arbortext editor will certainly display any Unicode characters for which a font is available.

Therefore, at least in theory, there should be no reason to choose--just put in the character you want.

But I would say that as a matter of preferred practice, you should not use entities at all.

The use of numeric character references or literal characters should be transparent to authors (at least in a fully XML-aware editor like Arbortext) as it's purely a function of how the XML data file is encoded, not what it means, and the processing result should be same (that is, once parsed, a numeric character reference becomes a character).

If the question is one of editor user interface, for example you want something like Arbortext's special character selector, that's different--the output of that widget should be just Unicode characters (either directly as characters or as numeric character references). But that should be transparent to authors. They just want a list of symbols to select from.

Note that if you are required by other tools in your environment to use non-Unicode encodings, such as ISO 8859 (ASCII) then you must use numeric character references for non-ASCII characters. This is often true for content management systems, older code control tools like CVS, and some legacy editors that are not fully Unicode aware. But again, this is just a matter of the encoding of the data file and should not affect authoring as long as you're not using a text editor.

(Editor's comment by Karl Johan Kleist: Eliot hadn't had his first cup of coffee... ISO 8859 is eight bits, ASCII seven.)