We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux.
We can’t decide whether the best approach:
Standardise on one of the two in all our application logic (and persistent data), and make the other platforms do the appropriate conversions
Use the natural format for the OS for application logic (and thus making calls into the OS), and convert only at the point of IPC and persistence.
To me they seem like they are both about as good as each other.
and UTF-8 in linux.
It’s mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful.
We can't decide whether the best approach
As usual it depends.
If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example – a device driver or native iOS application.
If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example – chat client or browser.
In second case you have a choice:
- Use cross platform library that provides strings support (Qt, ICU, for example)
- Use bare pointers (I consider std::string a “bare pointer” too)
If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice.
When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task).
My experience in working with strings is a little
specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that).
I agree that bare pointers approach is not for everyone. It is good when:
- You work with entire strings and splitting, searching, comparing is a rare task
- You can use same encoding in all components and need a conversion only when using platform API
All your supported platforms has API to:
- Convert from your encoding to that is used in API
- Convert from API encoding to that is used in your code
- Pointers is not a problem in your team
From my a little
specific experience it is actually a very common case.
When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects).
From my point of view, UTF-8 is an ultimate winner. If you can’t use UTF-8 – use strings library or platform API for strings – it will save you a lot of time.
Advantages of UTF-8:
- Fully ASCII compatible. Any ASCII string is a valid UTF-8 string.
- C std library works great with UTF-8 strings. (*)
- C++ std library works great with UTF-8 (std::string and friends). (*)
- Legacy code works great with UTF-8.
- Quite any platform supports UTF-8.
- Debugging is MUCH easier with UTF-8 (since it is ASCII compatible).
- No Little-Endian/Big-Endian mess.
- You will not catch a classical bug “Oh, UTF-16 is not always 2 bytes?”.
(*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this – if you do – use strings library or platform API.
Disadvantage is questionable:
- Less compact for Chinese (and other symbols with large code point numbers) than UTF-16.
- Harder (a little actually) to iterate over symbols.
So, I recommend to use UTF-8 as common encoding for project(s) that doesn’t use any strings library.
But encoding is not the only question you need to answer.
There is such thing as normalization
. To put it simple, some letters can be represented in several ways – like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time.
For example if user password contains “йёжиг” it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac.
In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It’s just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).