XML encoding common mistakes
An XML file transferred over HTTP can be assigned the mime
application/xml, where a
charset field SHOULD be present containing the XML file's character encoding, for example:
Since no mime or encoding information is available when Transferring XML files over FTP or other transfer methods, the mime
text/xml with charset
us-ascii MUST be assumed, as per RFC2376.
Charset versus Encoding
- Charset or Character Set: Defines the set of characters then can be used within the XML file;
- Encoding: Defines how characters are stored and represented as bytes in the XML file.
The example below shows that the same
TEST123 string uses twice as many bytes in UTF-16 as it does in UTF-8:
As stated above, for XML files where the
charset field is unknown,
us-ascii MUST be assumed, thus the actual content of the XML file SHOULD be encoded as ASCII. While most XML processors do use the
encoding attribute to read the XML file's content, it is not standard behaviour, therefore MUST NOT be relied upon. Even if the actual content and encoding attributes match, which SHOULD be the case, there's no way to know if the target system supports the encoding you provided.
In case an XML specifies
UTF-8, while the target system only supports
Windows-1252, all characters outside of the
Windows-1252 charset are either represented wrong or break processing entirely. Therefore, it's best practice to default to ASCII encoding and using XHTML entities to encode / escape characters outside of the ASCII range.
Proper use of XML encoding
As stated, both transfer encoding and the XML
charset header cannot be used. Therefore, any special character outside of the ASCII range, yet inside the XHTML Character Entities range (e.g.
ö , etc.) SHOULD be replaced with the representing Entity (e.g.
ö , resp). All other characters SHOULD be avoided.
- XML Predefined Entities:
- XHTML Character Entities (for special characters):
What should be avoided
CDATA sections or Character Data Sections (e.g.
]> ) are strictly intended for using the 5 XML Predefined Entities (
") as is. In other words, the CDATA section only removes the need for encoding characters that would otherwise be interpreted as XML. The actual text content inside CDATA sections is not parsed and read as is, therefore has no effect on special characters and their encoding.
Any special characters outside of the XHTML Character Entities range (e.g.
ß ), SHOULD be avoided.
Any unsupported character found in your XML will be decoded using the local system's encoding (
ANSI Latin-1 ). This might result in unexpected characters, the character might be removed entirely or processing of the XML file may fail.