In an attempt to move into the field of enterprise application development I started refreshing my Java recently. I was going through a well known book when I stumbled upon the implementation of Strings in Java. I have high esteem for the developers at Sun, but I really could not digest the fact that Sun engineers thought 2 bytes would be enough for characters. It was kind of Y2kish. Now that the UTF has grown above the usual number 16 bits can represent I was eager to find how Sun tackled this problem. The book touched the matter vaguely but since that didn’t completely clear my doubts I decided to investigate. And these are my findings.
Before we dive into the Java implementation of the standard, we should understand what UTF is. At least some of us might have seen it somewhere. May be those of us who have the creepy behavior of going through the source of an HTML page might have seen the following,
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />.
And we may have a vague idea on what it does. Don’t worry let’s understand what it really is.
Unicode is an internationally accepted standard to define character sets and corresponding encoding. And the above piece of code is informing the browser that the document should be interpreted using the utf-8 character set. So what exactly is UTF or Uniform Transformation Format?
In order to completely know about Unicode, we have to traverse a few decades back when people thought the earth was the center of the universe and it was indeed flat. Oh sorry, we have to traverse a few decades back when the majority of the software was written by English speaking people. It was only quite natural they thought the only set of characters ever to be encountered in the realm of programming would be from English alphabets, numerical digits and a couple of other prominent characters. So it was logical to use 28 or 256 bits to represent the set of characters. It was enough for the normally used characters back then and also space were left for the inclusion of characters in the future. The problem started when different people started encoding different characters in the free space. What evolved was chaos. Also when the internet happened, all the people around the globe started using technology and started tweaking programs to their likes and in their own native languages. It became impossible to fit all the characters into the tiny space of 256 bits. Soon the encoding system known as ASCII ran out of space and a need for a better encoding system evolved. To make a long story short, thus evolved UTF. More on this can be read from here and here.
In UTF characters are represented by code points. It is usually a hex number preceded by U+. For example U+2122 means TM, the trademark symbol. The prominent UTF encodings are UTF-8, UTF-16 and the latest one being UTF-32. These three are methods to represent the UTF character set using 8 bit, 16 bit and 32 bit respectively. To get a more detailed and exclusive idea on Unicode please read Joel Spolsky’s post.
In Java from beginning characters were represented by 16 bits and for some time it was enough to represent all the characters. But since the characters included into UTF overgrew the 16 bit realm, Java was faced with a dilemma, either to change the char representation into 32 bit or use some other methods.. It is not of much issue since most of the characters outside the 16 bit representation is rarely used. But since Java is a language which believes a in portability very much and the engineers in Sun are much more intelligent than the average developers like us, they found a way to circumvent this issue. Java is now equipped to represent all the characters in the 32 bit realm also. So how does Java tackle the supplementary characters out side the 16 bits? What Sun employed to get out of this mess was UTF-16 encoding. So what is UTF-16?
To quote Wikipedia UTF- 18 is a variable length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding maps character to a sequence of 16 bit words. Characters are known as code points and 16 bit words are known as code units. The basic characters from the Basic Multilingual Plane’ can be represented using the 16 bits. For characters outside this we need to use a pair of 16 bit words called as the surrogate pair. Thus all the characters that can be encoded by 232 or U+0000 through U+10FFFF , except for U+D800–U+DFFF (These are not assigned any characters) can be specified using UTF-16. Why are these numbers not assigned any characters? It is an intelligent choice made by the Unicode community to design the UTF-16 encoding scheme.
The characters outside the BMP (those from U+10000 through U+10FFFF) are represented using a pair of 16 bit words as I said before. These pair is known as surrogate pair. Now 1000016 is subtracted from the original code point to make it a 20 bit number. Now it is divided to two 10 bit numbers each of which is loaded into a surrogate with the higher order bits in the first. the two surrogates will be in the range 0xD800–0xDBFF and 0xDC00-0xDFFF. Thus since we have left out those region unassigned we can be sure it isn’t a character but need processing before the original code point is found out. You can read the UTF-16 specification from Sun here.