ColdFusion is not UTF-8 encoded

August 10th, 2009

Yes, it's true, despite what you may have assumed, not all text within a ColdFusion application can be assumed to be UTF8 encoded unicode. While I was aware of this myself, I'd frequently forget the details, so I wrote it down and I've pasted it here so you can read it too (aren't you lucky!). If you had assumed that you were always dealing with UTF8 encoded unicode, or you're not even quite sure what UTF-8 encoded unicode means, you might just find this post interesting.

Character Sets and Encodings

Let's start by explaining what it means to refer to some text as UTF-8 encoded unicode. Without getting too pedantic about it, unicode is usually used to refer to the set of characters that is part of the Unicode Standard. The same set of characters is also referred to as the Universal Character Set (UCS), and has been standardized by the International Organization for Standardization (ISO) as part of the ISO-10646 standard.

A character set is a set of distinct, named, characters (technically referred to as a character repertoire) and a set of numeric codes used to refer to those characters. For example, the ASCII character set includes the Latin letter “lowercase a”, which has a numeric code of 97. The numeric codes are usually referred to as either character codes or code points. A character encoding specifies an algorithm for storing the characters in a particular character set as a sequence of bytes (octets really, but we'll avoiding getting into too much detail here). So, to say some text is UTF-8 encoded unicode, we mean the text contains only characters in the “unicode” character set and it has been encoded into a sequence of bytes using the UTF-8 encoding scheme.

While character sets and character encodings aren't the same thing, the terms have been used interchangeably over the years and the MIME standard even uses the term charset to refer to the combination of a character set and encoding scheme. Though ISO-10646 defines a number of possible encodings for the Universal Character Set, many earlier standards, such as ISO-8859-1 (aka Latin-1) and Windows-1252, only defined a single encoding for a character set, blurring the distinction between character sets and character encodings. It's all very confusing, but the important thing to remember is that the character encoding defines how characters are encoded into a sequence of bytes, and encoding schemes are not always compatible with one another. For example, lowercase e actute, i.e. é, is not stored using the same sequence of bytes by both UTF-8 and ISO-8859-1, in fact UTF-8 uses two bytes whereas ISO-8859-1 uses just one.

ColdFusion uses UTF-16 internally

This isn't particularly important, but it may clear up some confusion over the role of the pageEncoding attribute of the cfprocessingdirective tag. ColdFusion runs on Java, and Java uses UTF-16 to represent text internally, so ColdFusion presumably also uses UTF-16. The pageEncoding attribute of the cfprocessingdirective tag does not tell ColdFusion how to represent data internally, it just tells ColdFusion which encoding to use when processing/compiling the source file.

Although I'm pretty sure ColdFusion does use UTF-16, the ColdFusion 8 documentation states that ColdFusion uses UCS-2. Java did use UCS-2 in the distant past, so I'm guessing this is just a mistake in the docs, but even if it is correct it doesn't really matter for most people. UCS-2 is an obsolete predecessor to UTF-16 that uses 2 bytes to store each character and can represent all of the characters in the basic multilingual plane of the unicode standard. Although the unicode standard now includes over 100,000 characters, the basic multilingual plane has 65,536 code points and covers the vast majority of characters in common use around the world. UCS-2 and UTF-16 encoding of the characters in the basic multilingual plane is identical, so for most purposes, it really doesn't matter whether ColdFusion uses UCS-2 or UTF-16 internally.

If the reference to UCS-2 in the documentation is not a mistake, it might be that it does use UTF-16, but that some operations aren't safe with supplementary characters outside the basic multilingual plane, which are encoded using surrogate pairs of 2 byte code units. Again, it really doesn't matter as long as you're only using characters in the basic multilingual plane, which you probably are.

Default Encoding for IO is Platform Dependent

While ColdFusion defaults to using UTF-8 when sending output to the browser or via email, the default encoding used for other kinds of input and output can vary depending on the operating system and configuration of the java virtual machine (JVM). This matters when you are reading and writing files, and can also matter when processing text using java libraries.

Each JVM instance has a default encoding scheme (referred to in the Java documentation as the platform's default encoding), which can be set by passing an argument to the JVM when starting the instance, but by default comes from the operating system. For example, on a Windows server set to use a Western European locale, the default encoding scheme might be Windows-1252. This is the default encoding used by CFFILE when reading and writing text files (in the absence of a charset attribute or a byte order mark (BOM) at the start of the file), and will also be the default encoding used by methods of various Java objects when converting data from strings to streams and vice-versa.

So, for example, if you use CFFILE to read a UTF-8 encoded configuration file, you can't just assume that CFFILE will read the file as UTF-8, because there are circumstances in which it won't and some characters won't be decoded properly. Similarly, if you are converting Java Strings to Streams in order to pass them to processing libraries (as you might do if you are using JTidy), you can't assume that those conversions will use the UTF-8 encoding scheme. The same issue applies to your source files - ColdFusion will fall back to using the platform's default encoding when compiling the source if the encoding is not set using cfprocessingdirective and the file does not include a byte order mark.

Note that you can attempt to detect the encoding of a file using either command line tools (e.g. file command on UNIX systems, eh, kind of) or Java libraries (e.g. jdchardet), but CFFILE doesn't seem to do any character encoding detection apart from looking for a byte order mark, and character encoding detection is not 100% reliable anyway.

Content-type of Response Does Not Always Default to UTF-8

By default, ColdFusion sends http responses UTF-8 encoded, and also sends a Content-type header set to text/html; charset=UTF-8. When you use the CFCONTENT tag, and set the content type without setting a charset, ColdFusion usually figures out what the encoding should be and appends the charset, but not always. With the file attribute, this seems to work well, and CFCONTENT does actually seem to do some encoding detection that goes beyond simply checking for a byte order mark and falling back to the platform's default encoding. If you don't use the file attribute, it only seems to work when you set the content type to either text/html or text/plain. Set it to anything else, and ColdFusion sends a content-type header as you enter it, without a charset element, leaving the browser/client to decide how to interpret the response. HTTP mentions ISO-8859-1 as the default encoding in the absence of a charset element of the content-type header and, lo and behold, that's exactly how some clients interpret it. This can lead to mangled data because CF does actually send the response UTF-8 encoded, but the client interprets it as ISO-8859-1.

Verity is a Law unto Itself

The character set and encoding used by the Verity indexing engine is not directly related to the character set and encoding used by the ColdFusion server. The encoding used when indexing and searching is dependent on the language selected, which defaults to English, the version of Verity/ColdFusion and, if you are indexing documents, the default encoding for the operating system.

The version of verity included with CF6 didn't support unicode, so verity could only deal with a subset of the characters that can be represented by ColdFusion, the documents on the server and probably the database. Later versions do support unicode, but don't create multilingual indexes by default, and use either the operating system's default encoding or UTF-8 when indexing files. I'm unsure of the exact mechanism at work to decide which encoding to use when indexing files as I've only used verity for indexing database queries, but to create multilingual indexes you need to install the separate verity multi language pack and set the language to "uni".

And, I think that's it. You're probably more confused that ever now, but at least you know that can't assume that text is UTF-8 encoded. The only advice I can give is, where you know the encoding, be explicit and tell ColdFusion what to do, don't rely on the default behaviour.

Posted by thickpaddy Filed in

Sorry, comments are closed for this article.

thickpaddy.com