I recently spent a lot of time learning about character encoding in browsers and web applications, trying to debug problems with characters not displaying correctly. I found a lot of information scattered all over the web, but no single place that really explained the things I needed to know to get a grasp of what was going on. So I wrote up the points below, hoping that someone Googling these terms and looking for answers like I was will find these points and find it useful!
Each displayable character has a numeric representation. ASCII defines the common characters, where A=65 and z=122 for example. Most character sets use values 128-255 to store their own characters such as non-English characters and punctuation. These character sets are still compatible with ASCII because they maintain the same character maps as those defined by ASCII.
Unicode is a standard that defines a huge range of printable characters across many languages. UTF-8 is a variable-length character encoding for Unicode that is backwards-compatible with ASCII. All ASCII characters are stored as their single-byte representation. Other characters are stored in 1-4 bytes. UTF-8 is the preferred character set internationally because it supports all characters and because it is backwards-compatible.
ISO-8859-1 (Latin-1) is a single-byte non-Unicode character set that defines characters 128-159 as undisplayable control characters, and lacks important characters like Euro.
windows-1252 is identical to ISO-8859-1, but it takes the range of control characters (128-159) and replaces it with useful, displayable special characters like Euro (128) and smart quotes (often used in apps like Microsoft Word). A single closing smart quote, for example, is character 146.
When browsers receive a page that says it is encoded in ISO-8859-1, they treat it as windows-1252 because in most cases it really is. This way it can display these extended characters correctly, even when the encoding is incorrectly identified as ISO-8859-1.
This special handling is so useful and common that it is actually required in draft versions of HTML5.
Browsers will submit content with the same encoding as what the page is displayed in. If content is pasted into the browser from an external app, it will convert it to the appropriate character set before submitting. It is important to always send the correct character encoding to the browser so it knows how to display it, as well as convert pasted text and submit.
When sending an AJAX request, IE will always ignore the encoding on the page and send in UTF-8 instead. There is no way to override this. When handling AJAX responses, it will default to treating the content as UTF-8 unless it is explicitly stated otherwise.
Web servers treat URL’s themselves and request bodies as ISO-8859-1 by default if no encoding is specified by the browser.
URL’s are defined to be ASCII only. If non-ASCII characters are to be sent in a URL, they must be encoded using %FF hex syntax. A windows-1252 closing smart quote character with code 146 will be sent in an encoded url as %92 from a normal submit on a page encoded in windows-1252, but will be sent as %E2%80%99 (UTF-8) if the URL is constructed via encodeURIComponent and sent via AJAX. There is no requirement about what encoding this character represents, so it is important for the browser to send the correct encoding type in the request headers.
If a GET request is received via AJAX, the parameter values will actually be encoded as UTF-8 (when encodeURIComponent is used), but the web browser treats the URL as ISO-8859-1, and encoded characters do not get mapped correctly. The web server needs to be told explicitly to treat incoming URL’s as UTF-8. The Content-type header in the request does not apply to the type of characters in the URL itself.
In Tomcat’s server.xml, for example:
<Connector port="8080" URIEncoding="UTF-8"/>
This way, Java will not attempt to incorrectly translate URL parameters from ISO-8859-1 to UTF-8, since they are treated as already being in UTF-8.
You can also make the encoding dynamic by using:
<Connector port=”8080” useBodyEncodingForURI=”true”/>
This will cause Tomcat to use the encoding sent by the browser instead.
It is highly recommend that, when possible, UTF-8 be used in every layer. This avoids any transcoding problems and confusion and it supported correctly in IE whether you’re using AJAX or not.
Character sets have different names in different environments, such as Sybase. Aliases:
ISO-8859-1 = iso_1 = cp850
windows-1252 = cp1252
UTF-8 = utf8
Depending on your environment, you may need to use a specific version of the character set name.
Useful References:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
http://en.wikipedia.org/wiki/Windows-1252
http://en.wikipedia.org/wiki/UTF-8
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
http://groups.google.com/group/comp.lang.javascript/browse_frm/thread/fd96823cbcd0c126/ca1d033bf8c8aea2