Character Encoding 101
August 12th, 2008 Amir Posted in Guides, Website tips |
127 Characters or less
European characters can also be encoded with a single byte (8 bits) per character. Older version of Windows used to do that. The mapping between the character code and its graphical representation is called the code page. If your application doesn’t support Unicode characters, it can still be localized, but would run properly only in one language. That’s because the code page on a PC can only be set to one value at a time.
More than 127 characters
Supposing you want your application to speak several languages, you’d need more than 127 different characters. This could be the case if you want a single build to support several languages. It could also happen if you just want to support an Asian language that has more than 127 characters to start with (Japanese, Chinese, Korean, etc).
So, what do you do?
The best alternative would be to use Unicode characters. The remaining question is how to encode Unicode strings in the application.
Just a small reminder: your source files are stored byte by byte. If you’ve got Unicode strings, they must be encoded as bytes. Then, during run-time, they would be decoded back to Unicode.
Different encodings for Unicode characters
Try this text document with non-English characters. Open it with Notepad (which comes standard in Windows). You’ll see some texts in different languages (English, Spanish and Hebrew). Now, click on File->Save as…
The save dialog asks you which encoding to use. The default is UTF-8, which is what this file was saved in. If you try switching it to ANSII and saving you’ll get an error message. That’s because the file’s characters don’t fit into the 127 ANSII character set.
Notepad can encode the file as ANSI, UTF-8, Big Endian Unicode (Motorola PowerPC) and Little Endian Unicode (Intel Pentium). Try saving the file using different encoding and open it in a binary or byte editor. You’ll see the same text encoded using different ways.
UTF-8 stands for Unicode Transformation Format (8). It uses sequences of bytes, of different lengths, to encode different characters. The nice thing is that English characters map to themselves. This means that plain text English text is valid UTF-8.
Bonus round - character encoding and markup languages
Supposing you’ve got some Unicode text, encoded as bytes that needs to be included in a markup languages (like HTML, XHTML or XML), what’s the right order of things?
Here’s the answer:
Unicode text <> Byte encoding <> Markup escape <> Markup
So, if you’re going to place Unicode texts in an HTML file, this is what you’ll do:
- Encode the text as UTF-8.
- Apply the HTML escape (for example, replace < with < and & with &).
- Wrap inside the HTML tags
You’ll also place the encoding meta data in the HTML head section, so that the parser (browser) later knows how to decode it:
<meta http-equiv=content-type content="text/html; charset=UTF-8">
When the browser gets the HTML, it will do the same thing, in reverse:
- Parse the HTML and extract texts
- Undo the HTML escape
- Decode the byte string as UTF-8 and produce Unicode text
The same idea applies for other escape techniques. If you’re encoding a URL, URL encoding will be required.
Let’s talk about it
What’s your experience with different character encoding? Any issues you had to tackle?





August 12th, 2008 at 7:35 am
Hi,
You are right: code page problems are very common.
With software localization - in contrast to web site localization - you can not always use UNICODE. For example the popular Delphi programming language does not support UNICODE in its GUI (Or you need third party tools). For everybody who is interested there is additional info how to master that here:
http://www.sisulizer.com/support/codepages.shtml
Btw: .NET does a much better job with UNICODE
No wonder that it is the choice of large enterprises who operate on global markets.
Markus
August 12th, 2008 at 11:08 am
Hi Markus,
Your page about controlling the code page is very helpful.
What do you tell your customers, who are using Delphi, and want a single build that supports multiple languages?
In Visual Studio, can you control the encoding of characters in resource files, or does the IDE force its own choice?
Thanks,
Amir
August 12th, 2008 at 2:54 pm
Hi Amir,
Delphi users have the help of the VCL. It has pretty good features and with some simple routines it is even possible to switch language at runtime. A good software localization tool comes with samples for that!
Windows resource files should be always UNICODE. This makes it easy to store all languages into it. The conversion happens mainly before the strings are feed into the GUI.
There is the possibility to have a multi-language resource. This type of resource has the downside that Windows chooses the language itself. With later Windows versions this is always the language of the Windows GUI. That makes it hard for developers to test localized software. It means they need to have Windows installed in all languages they want to support! It is better to use many single-language resources and switch them at runtime.
Markus