KG's 11 dimensional retrospection

Keiji's blog about Software Engineering, Computer Science, Astronomy, etc

| Happy New Year 2015 »

If you have ever dealt with "character code/encoding" e.g. ASCII, Unicode, Code Page, UTF, etc.etc. then you know it is totally pain.
Also, handling character "string" is quite a mess when working on a cross-platform and with C/C++.

I found a very nice article regarding this subject: UTF-8 everywhere (http://www.utf8everywhere.org/)

Essentially use UTF-8 as a internal string encoding.

I totally agree with this, and will going to use it for my cross-platform library.

My personal reasons:

UTF-16 doesn't solve the variable length per one character problem. (It still needs 4 bytes for some characters)
Each platform uses different standard and it is a mess: Windows=UTF-16 & wchar_t is 2 bytes, OSX/iOS=UTF-16 & wchar_t is 4 bytes (NSString uses unichar), Android=UTF-8 and wchar_t is 4 bytes (and still requires "codepage" kind of switches for NDK/Skia)

I once developed my own encoding mechanism for my library to deal with these, and now using just UTF-8 makes sense considering machine power (overhead is just no longer issue here).

Bottom

Top