Blog

Technical Deep Dive: Unicode Literals

Now that JUCE officially supports unicode πŸŽ‰, you may find yourself needing to declare string literals containing unicode characters. This article dives down the rabbit-hole, exploring what that means for a cross-platform JUCE project. We'll look at some of the challenges and options available to you.

Definitions

Before getting into the details, here are some definitions used throughout this article:

  • Character: A minimal unit of text that has semantic value.
  • Character set: A collection of characters used to represent text. For example, the Latin alphabet and Greek alphabet are both character sets.
  • Coded character set / Code page / Encoding: A character set mapped to a set of unique numbers (code points). Many sources will confusingly shorten this to "character set".
  • Code point: A value or position of a character in a coded character set.
  • Code unit: The minimum bit combination that can represent a character in a character set. In some encodings, some characters are encoded using multiple code units.
  • Variable-width encoding: An encoding whose code points consist of a variable number of code units.
  • Fixed-width encoding: An encoding whose code points always consist of the same number of code units.
  • ASCII: A standard consisting of a single coded character set of 128 characters. Characters include upper-case and lower-case versions of the Latin alphabet, numbers, symbols and various non-printable control characters (used for things like a new line or a null terminator). ASCII is a fixed-width encoding in which every character in the set can be represented using just 7 bits. However, most commonly each character is stored using a single byte (8-bit) code point, meaning in ASCII the most significant bit (bit-7) is always 0.
  • Unicode: A standard consisting of a character set and three different encodings (UTF-8, UTF-16, and UTF-32). At the time of writing the latest version of Unicode (15.1) contains 149,813 characters. More characters are frequently added in new versions of the standard, with the ability to expand to more than 1.1 million characters. The first 128 characters in the Unicode character set exactly match those in the ASCII character set.
  • UTF-32: A fixed width encoding for the Unicode character set. Each code point is represented using a single 32-bit code unit.
  • UTF-16: A variable width encoding for the Unicode character set. Each code point is represented using 1 or 2 16-bit code units.
  • UTF-8: A variable width encoding for the Unicode character set. Each code point is represented using 1, 2, 3, or 4 8-bit code units. Any ASCII encoded text is also valid UTF-8 encoded text. However, not all UTF-8 encoded text is valid ASCII encoded text. UTF-8 is the most popular Unicode encoding, according to W3Tech at the time of writing "UTF-8 is used by 98.2% of all the websites whose character encoding we know."
  • Byte-order mark (BOM): An optional pre defined set of bytes at the beginning of a file. It can indicate that a file is using a Unicode encoding, which encoding it's using, and in the case of UTF-16 and UTF-32, if the file is saved with big or little endian. For UTF-8 the required bytes would be 0xEF, 0xBB, 0xBF.

The naive approach

Imagine that you want to include an emoji in a string that will be displayed in your JUCE application. Although it's tempting to include the emoji directly in a normal string literal, this approach will not work as expected:

juce::String greeting ("Hello JUCE 8 😊"); // bad, don't do this

It compiles successfully, but when this is run in a Debug build you'll be confronted with the following runtime assertion from the String constructor:

/*  If you get an assertion here, then you're trying to create a string from 8-bit data
    that contains values greater than 127. These can NOT be correctly converted to unicode
    because there's no way for the String class to know what encoding was used to
    create them. The source data could be UTF-8, ASCII or one of many local code-pages.
    To get around this problem, you must be more explicit when you pass an ambiguous 8-bit
    string to the String class - so for example if your source data is actually UTF-8,
    you'd call String (CharPointer_UTF8 ("my utf8 string..")), and it would be able to
    correctly convert the multi-byte characters to unicode. It's *highly* recommended that
    you use UTF-8 with escape characters in your source code to represent extended characters,
    because there's no other way to represent these strings in a way that isn't dependent on
    the compiler, source code editor and platform.
    Note that the Projucer has a handy string literal generator utility that will convert
    any unicode string to a valid C++ string literal, creating ascii escape sequences that will
    work in any compiler.
*/
jassert (t == nullptr || CharPointer_ASCII::isValidString (t, std::numeric_limits<int>::max()));

As the comment states, the String constructor expects only ASCII characters in its argument, but it encountered one or more non-ASCII characters. As explained above, ASCII characters are encoded using the least-significant 7 bits of a series of 8 bit code units. As there are no ASCII code points with the most-significant bit set, we can confidently say the given text is NOT ASCII encoded. Unfortunately, once a non-ASCII character is encountered it's not always possible to know for certain what encoding is expected.

For this reason we need some way of signaling to the String class what encoding the text is in. After reading the comment above, the next attempt might be:

juce::String greeting (CharPointer_UTF8 ("Hello JUCE 8 😊"));

This prevents the runtime assertion, but unfortunately there's still an issue that hasn't been addressed.

To understand the problem, consider the encoding that is used for the string literal Hello JUCE 8 😊. When the C++ code is compiled, what series of bytes will be produced in the compiled object file to represent this string?

All we can say for sure is that a string literal like this will have the type const char* but the standard doesn't state explicitly what encoding should be used for a raw string literal like this.

The u8 literal

Unicode string literals were introduced In C++11. By adding the u8 prefix to our literal we can indicate to our compiler that the string should be UTF-8 encoded in the resulting binary.

juce::String greeting (CharPointer_UTF8 (u8"Hello JUCE 8 😊"));

If you are using C++17 (the minimum version required by JUCE 8) this will compile and run without any assertions. However, if you're following along with C++20 or greater you may have noticed this fails to compile.

In C++20 the type returned for a UTF-8 string literal changed from const char* to const char8_t*. char8_t was introduced specifically for storing UTF-8 encoded strings. char and char8_t are distinct types, and there is no implicit conversion between char* and char8_t*. While the encoding of a char* string is implementation-defined, a char8_t* string can be assumed to have UTF-8 encoding. JUCE 8 has added specific support for this new type. There are two ways two ways to construct a juce::String from a const char8_t*:

auto greeting = juce::String::fromUTF8 (u8"Hello JUCE 8 😊"); // C++17, literal has type const char*
juce::String greeting (u8"Hello JUCE 8 😊");                  // C++20 onwards, literal has type const char8_t*

Now we have a UTF-8 encoded string literal, and the String class knows the text is UTF-8 encoded, but there's still a remaining problem.
To illustrate, consider this code:

int main()
{
    constexpr auto str = u8"ABC 😊";
    static_assert (str[0] == (decltype (*str)) 0x41); // A
    static_assert (str[1] == (decltype (*str)) 0x42); // B
    static_assert (str[2] == (decltype (*str)) 0x43); // C
    static_assert (str[3] == (decltype (*str)) 0x20); // [space]
    static_assert (str[4] == (decltype (*str)) 0xf0); // [smiley-face] byte-1
    static_assert (str[5] == (decltype (*str)) 0x9f); // [smiley-face] byte-2
    static_assert (str[6] == (decltype (*str)) 0x98); // [smiley-face] byte-3
    static_assert (str[7] == (decltype (*str)) 0x8a); // [smiley-face] byte-4
    static_assert (str[8] == (decltype (*str)) 0x00); // [null-terminator]
    return 0;
}

As static_assert is evaluated during compilation, we can run this test on Compiler Explorer for GCC, Clang, and MSVC, which will save us some time switching between different platforms. Here are some results:

x86-64 clang 18.1.0

ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 0

https://godbolt.org/z/vdzYEdh5M

x86-64 gcc 13.2

ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 0

https://godbolt.org/z/M893bh43n

x64 msvc v19.38

<source>(8): error C2607: static assertion failed
<source>(9): error C2607: static assertion failed
<source>(10): error C2607: static assertion failed
<source>(11): error C2607: static assertion failed
<source>(12): error C2607: static assertion failed
Compiler returned: 2

https://godbolt.org/z/ETxGe7nxc

The results are the same, whether using C++17 or C++20. Clearly MSVC has not encoded the string the way we might have expected.

It turns out that Unicode string literals only refer to what the standard calls the "execution character set". That is the coded character set (or encoding) that will be used to store the string in the binary. However, that isn't the only encoding we need to consider. The standard also refers to a "source character set", which is the encoding the compiler uses while parsing a given source file.

Source file encoding

Below are some extracts from the documentation for the compilers supported by JUCE, regarding source-file encoding:

Clang (and by virtue AppleClang)

/source-charset:Source encoding, supports only UTF-8

GCC

Use of any encoding other than plain ASCII or UTF-8, except in comments, will cause errors.

MSVC

By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes that the source file is encoded in the current user code page, unless you use the /source-charset or /utf-8 option to specify a character set name or code page. Visual Studio allows you to save your C++ source code in any of several character encodings. For more information about source and execution character sets, see Character sets in the language documentation.

Now we can see why our test file worked flawlessly on both clang and GCC. They used the same UTF-8 encoding on the source as they did the execution. Even if we remove the UTF8 string literal u8 prefix, GCC and clang would have produced the same results as they both assume UTF-8 as the default for both the source and execution character sets. However, MSVC has a more involved set of rules to determine the source character set.

BOMs

As suggested in the MSVC docs, we could set a BOM to indicate the encoding. If we do this, MSVC will indeed compile the file and all our tests pass.

Unfortunately, many text editors will remove the BOM when the file is saved, while others might trip-up reading the BOM. As the BOM is not a printable character, changes to the BOM are not always immediately apparent in a diff. For this reason the use of a BOM is generally not recommended, especially if you're working as part of a team.

How to handle MSVC

Thankfully the MSVC docs suggest an two alternative options to handle this issue.

  1. Add the compiler flag /source-charset:utf-8.
  2. Add the compiler flag /utf-8.

Not only is the second option more concise, but it also sets both the source and execution character sets to UTF-8. This ensures MSVC mirrors the default behaviour of both GCC and clang.

Now we know how to enforce consistent source and execution character sets on all platforms. However, there are some more points to consider:

  • When deploying a library, especially a header only library, it may not be possible to enforce the compiler flags being used to compile your code.
  • Apart from the source character set used by the compiler, text editors used to display the source file may assume a different encoding. For example, in my experience, Visual Studio 2022 appears to open and display files as UTF-8 regardless of the compiler settings. It can also behave differently again when hovering over string literals. Other editors may have their own quirks to consider too.
  • When copying and pasting text it may be necessary to consider the encoding from which the original text was generated or copied. As we've seen, there is rarely any guarantee as to what the encoding will be used.

Universal character escape sequences

Universal character escape sequences are a portable alternative to force a source encoding rather than a compiler flags.

It's possible to represent all non-ASCII Unicode characters using an escape sequence, which itself consist of only ASCII characters. A universal character escape sequence may be in the form \uNNNN (4 digits) or \UNNNNNNNN (8 digits), where N represents a single hexadecimal digit, for example:

auto greeting = juce::String::fromUTF8 (u8"Hello JUCE 8 \U0001f60a");

To understand how this works, consider the following code, which we'll again test in Compiler Explorer.

int main()
{
    const auto a = u8"Hello JUCE 8 😊";
    const auto b =   "Hello JUCE 8 \U0001f60a";
    const auto c = u8"Hello JUCE 8 \U0001f60a";
    return 0;
}

Here we can see when we pass the /utf-8 compiler flag to MSVC all the strings are encoded in the binary identically to each other:

$SG2779 DB        'Hello JUCE 8 ', 0f0H, 09fH, 098H, 08aH, 00H
        ORG $+6
$SG2780 DB        'Hello JUCE 8 ', 0f0H, 09fH, 098H, 08aH, 00H
        ORG $+6
$SG2781 DB        'Hello JUCE 8 ', 0f0H, 09fH, 098H, 08aH, 00H

https://godbolt.org/z/orMnjEsEn

In contrast, when we remove the /utf-8 compiler flag we end up with three different bytes sequences:

$SG2779 DB        'Hello JUCE 8 ', 0c3H, 0b0H, 0c5H, 0b8H, 0cbH, 09cH, 0c5H, 0a0H, 00H
        ORG $+6
$SG2781 DB        'Hello JUCE 8  ?? ?? ', 00H
        ORG $+2
$SG2780 DB        'Hello JUCE 8 ', 0f0H, 09fH, 098H, 08aH, 00H

https://godbolt.org/z/rqdq9nxs6

  • As we already know from previous tests, the first string does not end up containing the expected bytes, because the source file encoding is not UTF-8.
  • The second string also produces unexpected results. Although the string preserves the UTF-8 source file encoding, the compiler is then trying proceeding to store that string using a different encoding in the binary.
  • The third string correctly preserves both the source encoding, and the same UTF-8 encoding in the binary

Running the same experiment using GCC and clang, reveals all three strings are always correctly preserved. This is because by default they already read the the string as UTF-8 from source and store that string as UTF-8 in the binary.

Numeric escape sequences

Similar to named universal character sequences, numeric escape sequences are another alternative that allow us to specify string literals in our code in a cross platform way. They can be specified using either hexadecimal values in the form \xNN (2 digits) or octal values \NNN (1 to 3 digits) where N represents a single digit. The advantage being that it allows you to specify the exact byte sequence to store in binary, for example.

auto greeting = juce::String::fromUTF8 ("Hello JUCE 8 \xf0\x9f\x98\x8a");

Now consider the following code, which we'll again test in Compiler Explorer.

int main()
{
    const auto a = u8"Hello JUCE 8 \U0001f60a";
    const auto b = u8"Hello JUCE 8 \xf0\x9f\x98\x8a";
    const auto c =   "Hello JUCE 8 \xf0\x9f\x98\x8a";
    return 0;
}

Given that a numeric escape sequence allows us to specify the precise byte sequence to store in the binary, you would be forgiven for thinking all three of these examples would be equal. Unfortunately due to a known issue in MSVC, when a numeric escape sequence is embedded inside a UTF-8 string literal it is incorrectly encoded (this happens irrespective of the /utf-8 flag). Even once this bug is fixed, it will be a good idea to avoid writing code like this so that your code behaves as expected when built with older compiler versions.

MSVC

$SG2779 DB        'Hello JUCE 8 ', 0f0H, 09fH, 098H, 08aH, 00H
        ORG $+6
$SG2780 DB        'Hello JUCE 8 ', 0c3H, 0b0H, 0c2H, 09fH, 0c2H, 098H, 0c2H, 08aH, 00H
        ORG $+2
$SG2781 DB        'Hello JUCE 8 ', 0f0H, 09fH, 098H, 08aH, 00H

https://godbolt.org/z/hvsdvhWvn

Once again GCC and clang treat all three strings exactly the same.

Generating numeric escape sequences

There is a helper tool embedded in the Projucer that can be used to convert a Unicode string into a portable C++ string literal that contains any necessary numeric escape sequences.

  • Open the Projucer.
  • In the menubar select Tools > UTF-8 String-Literal Helper.
  • In the top half of the window paste your UTF-8 string.
  • In the bottom half of the screen select and copy the text into your source file.

Binary resources

There's one last option that we haven't discussed. JUCE makes it very easy to embed any file as a binary resource. By using a binary resource you can ultimately save and load a file using which ever encoding you prefer.

For example here I've created a file named strings.json with the following contents:

{
    "greeting": "Hello JUCE 8 😊",
    "cats": "🐱😸😹😻😼"
}

To add a binary resource to a Projucer project:

  1. Select the plus symbol in the File Explorer panel.
  2. Select "Add Existing Files…" from the popup menu and navigate to the file you want to add as a resource in the file chooser dialog (the file will automatically be added as a binary resource if it's not a source file).
  3. Save and reopen your IDE project.

Alternatively, if you're using CMake see the juce_add_binary_data function for details on how to add a binary resource.

Once the binary resource has been added to the project, you can load it like so (in this case we don't need to pass the BinaryData size as the file is properly null terminated).

juce::var strings = juce::JSON::parse (juce::String::fromUTF8 (BinaryData::strings_json));

Now, any UTF-8 string can easily be loaded, but please only use ASCII characters for your keys!

strings["greeting"];
strings["cats"];

As the file will also be added to your IDE project, it's easy to read and edit strings directly in your project.

You may be wondering about the encoding of the JSON file. The JSON format is specified by RFC 7159, and it has this to say regarding "Character Encoding":

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

Due to this note in the specification, most editors should load and save JSON as UTF-8.

Summary

There are many potential pitfalls when embedding Unicode string literals, but JUCE provides utilities to make it easier to embed Unicode strings in a portable, maintainable way.

Most importantly, it's a good idea to use the Projucer's string-literal helper to generate portable Unicode strings that you can include in your source code, or load Unicode strings from an embedded binary resource.

Remember that when specifying string literals in a source file, there are at least two encodings to consider: The encoding of the string in the source file (source character set), and the encoding of the string in the binary (execution character set). On top of this, you should also consider how your IDE, tools, and editors (and your team's tools and editors) behave too!

Lets go through a quick recap of what we've learned.

// Use at your own risk!!
// If you're setting the the MSVC /utf-8 compiler flag this will *probably* work
// in most cases.
auto greeting1 = juce::String::fromUTF8 ("Hello JUCE 8 😊");
// For C++20 or greater this is a more concise version of the same thing
juce::String greeting2 (u8"Hello JUCE 8 😊");
// Universal character escape sequences are better, but remember to use a UTF-8
// string literal to prevent source encoding issues
auto greeting3 = juce::String::fromUTF8 (u8"Hello JUCE 8 \U0001f60a");
// The safest option however is numeric escape sequences
auto greeting4 = juce::String::fromUTF8 ("Hello JUCE 8 \xf0\x9f\x98\x8a");
// This is potentially safer still, in C++ 20 and greater it won't compile if
// it's ever accidentally embedded in a UTF-8 string literal
// This code is also easily generated using the Projucer reducing the risk of a
// encoding issue when converting to the numeric escape sequence
juce::String greeting5 (CharPointer_UTF8 ("Hello JUCE 8 \xf0\x9f\x98\x8a"));
// If you want safety without universal character names or escape sequences
// consider trying a JSON file embedded as a binary resource
juce::var strings (juce::JSON::parse (juce::String::fromUTF8 (BinaryData::strings_json)));
auto greeting6 = strings["greeting"]

I hope you've found this article useful. If you have any comments to add please continue the discussion on the JUCE forums.

Further reading

More

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram