If you have ever dealt with "character code/encoding" e.g. ASCII, Unicode, Code Page, UTF, etc.etc. then you know it is totally pain.
Also, handling character "string" is quite a mess when working on a cross-platform and with C/C++.
I found a very nice article regarding this subject:
UTF-8 everywhere (
http://www.utf8everywhere.org/)
Essentially use UTF-8 as a internal string encoding.
I totally agree with this, and will going to use it for my cross-platform library.
My personal reasons:
-
UTF-16 doesn't solve the variable length per one character problem. (It still needs 4 bytes for some characters)
-
Each platform uses different standard and it is a mess: Windows=UTF-16 & wchar_t is 2 bytes, OSX/iOS=UTF-16 & wchar_t is 4 bytes (NSString uses unichar), Android=UTF-8 and wchar_t is 4 bytes (and still requires "codepage" kind of switches for NDK/Skia)
I once developed my own encoding mechanism for my library to deal with these, and now using just UTF-8 makes sense considering machine power (overhead is just no longer issue here).