Character Encodings and Languages in XML

The information herein is presented primarily as an overview for those who might be interested in migrating to XHTML, or to simply shed some light on the murky questions of what is an XML document, what are characters in XML, what is Unicode, etc.

What Is Unicode?

Unicode is the standard for representing text in information systems. It was adopted as well by ISO, as ISO/IEC-10646. It is the standard used by Java, XML, and virtually all emerging information technologies that deal with text. For the complete scoop, visit www.unicode.org. Note that Unicode does not deal with the appearance of a character, or its presentation (visual or otherwise). Visual agents generally perform this function by mapping characters to font specifications, and following implicit or explicit directions on text layout, including text direction.

In XML, and therefore for XHTML as well, everything begins with the root document entity, a text document composed of characters from the Unicode Universal Character Set. What bit patterns represent which Unicode characters is known as the character encoding.

The root document also refers to other documents, generally with a URL, which may be parsed entities, like another XML file, or an external DTD; or non-parsed entities, such as an image file or an external script. Each parsed entity can define its own encoding.

ASCII, UTF-8, and UTF-16

By far the most popular character encoding is 7-bit ASCII. The core vocabulary of XML, essential punctuation, and the basic white space characters and line delimiters are all members of this 7-bit ASCII character set, which I will refer to as ASCII. ASCII is also a subset of UTF-8, and ISO 8859-1, also known as Latin-1. ASCII and Latin-1 also map directly to the first code pages of UTF-16 and the Unicode Universal Character Set.

The names used to refer to languages and encodings use only members of the ASCII character set. If no encoding is specified, either in the document, or through external means, the default for XML is UTF-8, of which ASCII is a subset, or UTF-16 when a Byte Order Mark is detected. If a document begins with hexadecimal FEFF, it will be parsed according to the UTF-16 encoding, in 16-bit codes, with Latin-1 in the first code page (decimal 0-255), and a wealth of other characters and symbols in the other code pages. In UTF-16, some less frequently used characters are represented by a pair of 16-bit codes. Those characters would be outside of the primary code plane of Unicode, known as the Basic Multilingual Plane (BMP). There are 64K code points in this plane (65,536).

UTF-8 is a variable byte encoding of Unicode. If the high bit is zero, the byte is interpreted as an ASCII character. If the high bit is set, then this is part of a multibyte sequence. The first 2 highest bits are used to chain together the multibyte sequences, and the 6 lowest bits are used to identify which Unicode character is represented.

ISO 8859-1, Latin-1, UTF-32, etc.

ISO-8859-1 (Latin-1) is an 8-bit encoding suitable for many documents using the Latin alphabet, as it includes the commonly used accented letters, and additional punctuation marks, such as ¿. However, the euro sign must still be represented by a character entity, € (€).

UTF-32 is a full 32-bit representation of the Unicode character set, useful where space is of no concern. ISO-8859-2 and other variants are other ISO 8859 character sets beyond the Latin-1 set. EUC-JP, Shift_JIS, and ISO-2022-JP are defined by the Japanese standard JIS-X-0208-1997. ISO-10646-UCS-2 and ISO-10646-UCS-4 are 2 and 4 byte encodings defined by ISO 10646. equivalent to UTF-16 and UTF-32 respectively.

Language Names

XML and HTML can also identify an element as content in a particular language. This could be simply informational, or it might affect the presentation of the document. There is no requirement that language and character encoding be compatible.

Language names are defined by the standards ISO 639 (language codes) and ISO 3166 (country/dialect codes). The language name can be simply the language code, typically a 2-letter lowercase ASCII name like "en", or a language code followed by a hyphen (-), and an ASCII qualifier, for example "en-GB". This identifies the British flavour of the English language. This syntax has been extended to allow a sub tag, another hyphen followed by a secondary qualifier, as in "zh-min-nan". This is used for a certain dialects of Taiwanese, Southern Fujian, and other Chinese (zh) variants.

Language names can be registered with the Internet Assigned Numbers Authority (IANA). These names always begin with an 'i', followed by a hyphen (-), and the remainder of the name. Private names guaranteed never to be assigned begin with 'x', followed by a hyphen (-). The prefix "sgn-" indicates a sign language.

For more information about language names, visit the IETF and see RFC 3066.

Declaring Character Encoding and Language

In most languages ASCII encoding will not suffice. Portuguese and French use many diacritics, and Asian languages use thousands of symbols, none of which are Latin characters. I would suggest that any document requiring more than the ASCII characters start by declaring its encoding with an XML text declaration:

<?xml version="1.0" encoding="ISO-8859-1"?>

ISO 8859-1 contains all the characters required for Portuguese, but lacks a few used in French (Œ œ). ISO-8859-15 contains all the characters used in modern French. For a wealth of information on character sets and Languages, see www.eki.ee/letter (Eesti Keele Instituut - Institute of the Estonian Language).

Stricly speaking, XML parsers need only support UTF-8 and UTF-16 encoding. I would be surprised if one did not support ISO 8859-1. As always, test your documents with the tools you will be using, and the browsers, user agents, and platforms you intend to support. Tools are available to convert from one encoding to another if need be.

XHTML documents can define the language in various elements. This is done with the lang attribute. Also include an xml:lang attribute for XML compatibility. To specify the language for the entire document:

<html lang="en" xml:lang="en">

For HTML compatibility, an XHTML document should once again declare its encoding with a meta http-equiv declaration:

<head>
<meta http-equiv="Content-type" content="text/html; charset=ISO-8859-1" />
...
</head>

The character encoding might be converted by the time it reaches the end user. In a process known as transcoding, a document is authored in one encoding, and then served in another. The encoding might also be converted by an XML database, or any number of conversion utilities along the way.

White Space and Line Breaks

In markup expressions (tags, DTD declarations, PI's) horizontal or vertical spacing is often necessary to delimit tokens. One could simply use the ASCII space character, but one of the niceties of XML is that it is more or less human readable. Breaking documents into reasonably short lines aids readability. Horizontal tabs are preferred by some to "beautify" source files, but I prefer not to indent XML or HTML, as its depth of nesting can result in unwieldy line width.

The characters for spacing markup come from the ASCII character set, hexadecimal 09 for horizontal tab, hex 20 for space, hex 0A for line feed and hex 0D for carriage return. Carriage return, line feed, or a sequence of a carriage return followed by a line feed is considered a line break. Any sequence of tabs, spaces, or line breaks is considered White Space. Tabs and carriage returns can present interoperability problems, so I suggest avoiding them altogether. Do not use a line break in an attribute value. You can use an entity to represent one.

Character Data (CDATA) or Parsed Character Data (PCDATA) can draw from other spacing characters available in its encoding. In XHTML, attribute values, and content not inside a markup tag, are defined as PCDATA. Some encodings have spacing that has no visual effect. Partial spaces, non breaking spaces, and zero width spaces may be available. The Zero Width Non-Breaking Space is equivalent to the Byte Order Mark, hex FEFF.

The effect of space characters is dependent on the implementation. It would be quite different for aural and somatic devices. A zero width space normally has no impact on presentation, visual or otherwise. It might affect a search, sort, or index, however.

Back to XML and HTML

Web Sites:

More Inside:

And Out:

Java Jazz Web Unix Help Etc.