Saturday, March 31, 2007

Java and UTF-8 encoding

Friday I discovered an encoding problem in my Java application. First I figured out how to send data encoded in UTF-8 from html form through dojo.bind (djConfig = { bindEncoding: "utf-8" };). Then I checked the PostgreSQL database and the tables were set to UTF-8. Despite all that it was still saving data in database in some queer format - two bytes of UTF char took 2 chars. I tried almost everything to solve the problem - different methods of getting the parameters from http request:

1. request.setCharacterEncoding("UTF8");
2. ... new String(request.getParameter("myparam").getBytes(), "UTF8");
3. BufferedReader reader = new BufferedReader(
new InputStreamReader(new StringBufferInputStream(text), "UTF8"));
text = reader.readLine();
and many other combinations. Actually the 3th method with StringBufferInputStream worked but StringBufferInputStream is deprecated because it uses only the low eight bits of each character in the string. So I was thinking that there had to be better solution. At last I found this FAQ at jGuru and there is comment by Jonathan Asbell: "When a browser sends a parameter in some encoding, such as UTF-8, it encodes each character byte value as a hexadecimal string using the encoding for the page (e.g. UTF-8). At the server, however, the part of the container that interprets these character values always assumes they are 8859-1 byte values. So it created a Unicode string based on the byte values interpreted as 8859-1. Since the 8859-1 assumption is made by the container, this hack (read "fix") works independently from which platform you run it on.

In the Servlet 2.2 API, the methods that parse parameter input always assume that it's sent as ISO 8859-1 (i.e. getParameter() et al). so they create a String containing the correct bytes but incorrect charset.

If you know what the charset is, you can convert the bytes to a string using the correct charset:

new String(value.getBytes("8859_1"), "utf-8")

8859-1 is the default encoding of HTTP."

Thanks to Jonathan I can move on now and find another wiles awaiting me.
But I still don't understand why request.setCharacterEncoding("UTF8") didn't work in first place.

1 comment:

Anonymous said...

Hey foo,

It was really useful, that made me spent some hours trying to figure out how do that ... :D