Nicolas Richeton's blog

Unicode : the reason you need ICU for Java

☕️ 2 min read
logo

There are several ways to code accented characters in Unicode.

For instance, a word like “Rivière” can be stored as :

R i v i è r e

(composed characters) or

R i v i e ` r e

(combining characters)

At first, it seems to be only technical details. Most Java developers use Strings without taking much care of how it is stored internaly. Convertion problems happen sometimes but can be easily fixed in most cases.

But things are more complicated here. Java Strings can use the two forms and both are valid : Eclipse debugger displays the same value for each one. However if you try to use

Object#equals

on two equivalent strings which are not using the same form, the result is :

false

This means that equals cannot be used to compare Strings. Other comparators supporting Unicode forms are available in the Java API, but you can choose to use them only in your code. If the comparation is done by a third party library, you have no choice than to convert the String from one form to another.

This can be done using

java.text.Normalizer

, but this is a Java 6 API. If you have to ensure compatibility with Java 1.4 or 5, one way is to use IBM’s ICU4J, which is available in the default Eclipse distribution as a plugin :

com.ibm.icu.text.Normalizer.compose(String str, boolean compat)

and

com.ibm.icu.text.Normalizer.decompose(String str, boolean compat)

So when do you need to use theses method ? ALWAYS !! You never know when a library will choose to return a String in one form or the other because it is often unspecified.

This is exactly what happens with SWT : depending of the OS and the method you call you can get two Strings that are equivalent but return false when using equals.

Here is an example on OSX

  • Text#getText() returns the first form
  • a FileTransfer Drag and Drop returns paths in the second form. See Bug 141282

In my opinion, a library should only use a single form and returning string of the other form should be considered as a bug. Otherwise developers have to do too many convertions all over the code to ensure that strings can be correctly compared.