XML Encoding

  •  
  •  

XML encoding common mistakes

An XML file transferred over HTTP can be assigned the mime application/xml, where a charset field SHOULD be present containing the XML file's character encoding, for example:

HTTP/1.1 200 OK
Date: Thu, 01 Jan 1970 00:00:00 UTC
Content-Length: 1234
Content-Type: application/xml; charset="utf-8"

<?xml version="1.0" encoding="utf-8"?>
[...]

Since no mime or encoding information is available when Transferring XML files over FTP or other transfer methods, the mime text/xml with charset us-ascii MUST be assumed, as per RFC2376.

Charset versus Encoding

  • Charset or Character Set: Defines the set of characters then can be used within the XML file;
  • Encoding: Defines how characters are stored and represented as bytes in the XML file.

The example below shows that the same TEST123 string uses twice as many bytes in UTF-16 as it does in UTF-8:

Hexdump of "TEST123" encoded as UTF-8
54 45 53 54 31 32 33 TEST123

Hexdump of "TEST123" encoded as UTF-16LE
5400 4500 5300 5400 3100 3200 3300 T.E.S.T.1.2.3.

As stated above, for XML files where the charset field is unknown, us-ascii MUST be assumed, thus the actual content of the XML file SHOULD be encoded as ASCII. While most XML processors do use the encoding attribute to read the XML file's content, it is not standard behaviour, therefore MUST NOT be relied upon. Even if the actual content and encoding attributes match, which SHOULD be the case, there's no way to know if the target system supports the encoding you provided.

In case an XML specifies UTF-8, while the target system only supports Windows-1252, all characters outside of the Windows-1252 charset are either represented wrong or break processing entirely. Therefore, it's best practice to default to ASCII encoding and using XHTML entities to encode / escape characters outside of the ASCII range.

Proper use of XML encoding

As stated, both transfer encoding and the XML charset header cannot be used. Therefore, any special character outside of the ASCII range, yet inside the XHTML Character Entities range (e.g. é , ö , etc.) SHOULD be replaced with the representing Entity (e.g. &eacute; , &ouml; , resp). All other characters SHOULD be avoided.

What should be avoided

CDATA sections or Character Data Sections (e.g. <![CDATA[ , ]> ) are strictly intended for using the 5 XML Predefined Entities (&, <, >, ' and ") as is. In other words, the CDATA section only removes the need for encoding characters that would otherwise be interpreted as XML. The actual text content inside CDATA sections is not parsed and read as is, therefore has no effect on special characters and their encoding.

Any special characters outside of the XHTML Character Entities range (e.g. ß ), SHOULD be avoided.

Any unsupported character found in your XML will be decoded using the local system's encoding ( Windows-1252 or ANSI Latin-1 ). This might result in unexpected characters, the character might be removed entirely or processing of the XML file may fail.

References