Unicode is harder than you think

Reading the excellent article by JeanHeyd Meneide on how broken string encoding in C/C++ is made me realise that Unicode is a topic that is often overlooked by a large number of developers. In my experience, there’s a lot of confusion and wrong expectations on what Unicode is, and what best practices to follow when dealing with strings that may contain characters outside of the ASCII range.

This article attempts to briefly summarise and clarify some of the most common misconceptions I’ve seen people struggle with, and some of the pitfalls that tend to recur in codebases that have to deal with non-ASCII text.

The convenience of ASCII

Text is usually represented and stored as a sequence of numerical values in binary form. Wherever its source is, to be represented in a way the user can understand it needs to be decoded from its binary representation, as specified by a given character encoding.

One such example of this is ASCII, the US-centric standard which has been for decades the de-facto way to represent characters and symbols in C and UNIX. ASCII is a 7-bit encoding, which means that it can represent up to 128 different characters. The first 32 characters are control characters, which are not printable, and the remaining 96 are printable characters, which include the 26 letters of the English alphabet, the 10 digits, and a few symbols:

Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex  
  0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
  1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
  2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
  3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
  4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
  5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
  6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
  7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
  8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
  9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
 10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
 11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
 12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
 13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
 14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
 15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL

This table defines a two-way transformation, in jargon a charset, which maps a certain sequence of bits (representing a number) to a given character, and vice versa. This can be easily seen by dumping some text as binary:

$ echo -n Cat! | xxd
00000000: 4361 7421                                Cat!

The first column represents the binary representation of the input string “Cat!” in hexadecimal form. Each character is mapped into a single byte (represented here as two hexadecimal digits):

  • 43 is the hexadecimal representation of the ASCII character C;
  • 61 is the hexadecimal representation of the ASCII character a;
  • 74 is the hexadecimal representation of the ASCII character t;
  • 21 is the hexadecimal representation of the ASCII character !.

This simple set of characters was for decades considered more than enough by most of the English-speaking world, which was where the vast majority of computer early computer users and pioneers came from.

An added benefit of ASCII is that it is a fixed-width encoding: each character is always represented univocally by the same number of bits, that in turn always represent the same number.

This leads to some very convenient ergonomics when handling strings in C:

#include <ctype.h>
#include <stdio.h>

int main(const int argc, const char *argv[const]) {
    // converts all arguments to uppercase
    for (const char *const *arg = argv + 1; *arg; ++arg) {
        // iterate over each character in the string, and print its uppercase
        for (const char *it = *arg; *it; ++it) {
            putchar(toupper(*it));
        }

        if (*(arg + 1)) {
            putchar(' ');
        }
    }

    if (argc > 1) {
        putchar('\n');
    }

    return 0;
}

The example above assumes, like a large amount of code written in the last few decades, that the C basic type char represents a byte-sized ASCII character. This assumption minimises the mental and runtime overhead of handling text, as strings can be treated as arrays of characters belonging to a very minimal set. Because of this, ASCII strings can be iterated on, addressed individually and transformed or inspected using simple, cheap operations such as isalpha or toupper.

The world outside

However, as computers started to spread worldwide it became clear that it was necessary to devise character sets capable to represent all the characters required in a given locale. For instance, Spanish needs the letter ñ, Japan needs the ¥ symbol and support for Kana and Kanji, and so on.

All of this led to a massive proliferation of different character encodings, usually tied to a given language, area or locale. These varied from 8-bit encodings, which either extended ASCII by using its unused eighth bit (like ISO-8859-1) or completely replaced its character set (like KOI-7), to multi-byte encodings for Asian languages with thousands of characters like Shift-JIS and Big5.

This turned into a huge headache for both developers and users, as it was necessary to know (or deduce via hacky heuristics) which encoding was used for a given piece of text, for instance when receiving a file from the Internet, which was becoming more and more common thanks to email, IRC and the World Wide Web.

Most crucially, multibyte encodings (a necessity for Asian characters) meant that the assumption “one char = one byte” didn’t hold anymore, with the small side effect of breaking all code in existence at the time.

For a while, the most common solution was to use a single encoding for each language, and then hope for the best. This often led to garbled text (who hasn’t seen the infamous character at least once), so much so that a specific term was coined to describe it - “mojibake”, from the Japanese “文字化け” (“character transformation”).

KOI8-R text mistakenly written on an envelope as ISO-8859-1 text

In general, for a long time using a non-English locale meant that you had to contend with broken third (often first) party software, patchy support for certain characters, and switching encodings on the fly depending on the context. The inconvenience was such that it was common for non-Latin Internet users to converse in their native languages with the Latin alphabet, using impromptu transliterations if necessary. A prime example of this was the Arabic chat alphabet widespread among Arabic-speaking netizens in the 90’s and 00’s 1.

Unicode

It was clear to most people back then that the situation as it was untenable, so much so that as early as the late ’80s people started proposing a universal character encoding capable to cover all modern scripts and symbols in use.

This led to the creation of Unicode, whose first version was standardised in 1991 after a few years of joint development led by Xerox and Apple (among others). Unicode main design goal was, and still is, to define a universal character set capable to represent all the aforementioned characters, alongside a character encoding capable of uniformly representing them all.

In Unicode, every character, or more properly code point, is represented by a unique number, belonging to a specific Unicode block. Crucially, the first block of Unicode (“Basic Latin”) corresponds point per point to ASCII, so that all ASCII characters correspond to equivalent Unicode codepoints.

Code points are usually represented with the syntax U+XXXX, where XXXX is the hexadecimal representation of the code point. For instance, the code point for the A character is U+0041, while the code point for the ñ character is U+00F1.

Unicode 1.0 covered 26 scripts and 7,161 characters, covering most of the world’s languages and lots of commonplace symbols and glyphs.

UCS-2, or “how Unicode made everything worse”

Alongside the first Unicode specification, which defined the character set, two2 new character encodings, called UCS-2 and UCS-4 (which came a bit later), were also introduced. UCS-2 was the original Unicode encoding, and it’s an extension of ASCII to 16 bits, representing what Unicode called the Basic Multilingual Plane (“BMP”); UCS-4 is the same but with 32-bit values. Both were fixed-width encodings, using multiple bytes to represent each single character in a string.

In particular, UCS-2’s maximum range of 65,536 possible values was good enough to cover the entire Unicode 1.0 set of characters. The storage savings compared with UCS-4 were quite enticing, also - while ’90s machines weren’t as constrained as the ones that came before, representing basic Latin characters with 4 bytes was still seen as an egregious waste.3

Thus, 16 bits quickly became the standard size for the wchar_t type recently added by the C89 standard to support wide characters for encodings like Shift-JIS. Sure, switching from char to wchar_t required developers to rewrite all code to use wide characters and wide functions, but a bit of sed was a small price to pay for the ability to resolve internationalization, right?

The C library had also introduced, alongside the new wide char type, a set of functions and types to handle wchar_t, wide strings and (poorly designed) functions locale support, including support for multibyte encodings. Some vendors, like Microsoft, even devised tricks to make it possible to optionally switch from legacy 8-bit codepages to UCS-2 by using ad-hoc types like TCHAR and LPTSTR in place of specific character types.

All of that said, the code snippet above could be rewritten on Win32 as the following:

#include <ctype.h>
#include <tchar.h>

#if !defined(_UNICODE) && !defined(UNICODE)
#   include <stdio.h>
#endif

int _tmain(const int argc, const TCHAR *argv[const]) {
    // converts all arguments to uppercase
    for (const TCHAR *const *arg = argv + 1; *arg; ++arg) {
        // iterate over each character in the string, and print its uppercase
        for (const TCHAR *it = *arg; *it; ++it) {
            _puttchar(_totupper(*it));
        }

        if (*(arg + 1)) {
            _puttchar(_T(' '));
        }
    }

    _puttchar(_T('\n'));
    return 0;
}

Neat, right? This was indeed considered so convenient that developers jumped on the UCS-2 bandwagon in droves, finally glad the encoding mess was over.

16-bit Unicode was indeed a huge success, as attested by the number of applications and libraries that adopted it during the ’90s:

  • Windows NT, 2000 and XP used UCS-2 as their internal character encoding, and exposed it to developers via the Win32 API;
  • Apple’s Cocoa, too, used UCS-2 as its internal character encoding for NSString and unichar;
  • Sun’s Java used UCS-2 as its internal character encoding for all strings, even going as far as to define its String type as an array of 16-bit characters;
  • Javascript, too, didn’t want to be left behind, and basically defined its String type the same way Java did;
  • Qt, the popular C++ GUI framework, used UCS-2 as its internal character encoding, and exposed it to developers via the QString class;
  • Unreal Engine just copied the WinAPI approach and used UCS-2 as its internal character encoding 4

and many more. Every once in a while, I still find out that some piece of code I frequently use is still using UCS-2 (or UTF-16, see later) internally. In general, every time you read something along the lines of “Unicode support” without any reference to UTF, there’s an almost 100% chance that it actually means “UCS-2”, or some borked variant of it.

Combining characters

Unicode supported since its first release the concept of combining characters (later better defined as grapheme clusters), which are clusters of characters meant to be combined with other characters in order to form a single unit by text processing tools.

In Unicode jargon, these are called composite sequences and were designed to allow Unicode to represent scripts like Arabic, which uses a lot of diacritics and other combining characters, without having to define a separate code point for each possible combination.

This could have been in principle a neat idea - grapheme clusters allow Unicode to save a massive amount of code points from being pointlessly wasted for easily combinable characters (just think about South Asian languages or Hangul). The real issue was that the Consortium, anxious to help with the transition to Unicode, did not want to drop support for dedicated codepoints for “preassembled” characters such as è and ñ, which were historically supported by the various extended ASCII encodings.

This led to Unicode supporting precomposed characters, which are codepoints that stand for a glyph that also be represented using a grapheme cluster. An example of this is the Extended Latin characters with accents or diacritics, which can all be represented by combining the base Latin character with the corresponding modifier, or by using a single code point.

For instance, let’s try testing out a few things with Python’s unicodedata and two seemingly identical strings, “caña” and “caña” (notice how they look the same):

>>> import unicodedata
>>> a, b = "caña", "caña"
>>> a == b
False

Uh?

>>> a, b
('caña', 'caña')
>>> len(a), len(b)
(4, 5)

The two strings are visually identical - they are rendered the same by our Unicode-enabled terminal - and yet, they do not evaluate as equal, and the len() function returns different lengths. This is because the ñ in the second string is a grapheme cluster composed of the U+006E LATIN SMALL LETTER N and U+0303 COMBINING TILDE character, combined by terminal into a single character.

>>> list(a), list(b)
(['c', 'a', 'ñ', 'a'], ['c', 'a', 'n', '̃', 'a'])
>>> [unicodedata.name(c) for c in a]
['LATIN SMALL LETTER C', 'LATIN SMALL LETTER A', 'LATIN SMALL LETTER N WITH TILDE', 'LATIN SMALL LETTER A']
>>> [unicodedata.name(c) for c in b]
['LATIN SMALL LETTER C', 'LATIN SMALL LETTER A', 'LATIN SMALL LETTER N', 'COMBINING TILDE', 'LATIN SMALL LETTER A']

This is obviously a big departure from the “strings are just arrays of characters” model the average developer is used to:

  1. Trivial comparisons like a == b or strcmp(a, b) are no longer trivial. A Unicode-aware algorithm must to be implemented, in order to actually compare the strings as they are rendered or printed;
  2. Random access to characters is no longer safe, because a single glyph can span over multiple code points, and thus over multiple array elements;

640k 16 bits ought to be enough for everyone”

Anyone with any degree of familiarity with Asian languages will have noticed that 7,161 characters are way too small a number to include the tens of thousands of Chinese characters in existence. This is without counting minor and historical scripts, and the thousands of symbols and glyphs used in mathematics, music, and other fields.

In the years following 1991, the Unicode character set was thus expanded with tens of thousands of new characters, and it become quickly apparent that UCS-2 was soon going to run out of 16-bit code points.5

To circumvent this issue, the Unicode Consortium decided to expand the character set from 16 to 21 bits. This was a huge breaking change that basically meant obsoleting UCS-2 (and thus breaking most software designed in the ’90s), just a few years after its introduction and widespread adoption.

While UCS-2 was still capable of representing anything inside the BMP, it became clear a new encoding was needed to support the growing set of characters in the UCS.

UTF

The acronym “UTF” stands for “Unicode Transformation Format”, and represents a family of variable-width encodings capable of representing the whole Unicode character set, up to its hypothetical supported potential 2²¹ characters. Compared to UCS, UTF encodings specify how a given stream of bytes can be converted into a sequence of Unicode code points, and vice versa (i.e., “transformed”).

Compared to a fixed-width encoding like UCS-2, a variable-width character encoding can employ a variable number of code units to encode each character. This bypasses the “one code unit per character” limitation of fixed-width encodings, and allows the representation of a much larger number of characters—potentially, an infinite number, depending on how many “lead units” are reserved as markers for multi-unit sequences.

Excluding the dead-on-arrival UTF-1, there are 4 UTF encodings in use today:

  • UTF-8, a variable-width encoding that uses 1-byte characters
  • UTF-16, a variable-width encoding that uses 2-byte characters
  • UTF-32, a variable-width encoding that uses 4-byte characters
  • UTF-EBCDIC, a variable-width encoding that uses 1-byte characters designed for IBM’s EBCDIC systems (note: I think it’s safe to argue that using EBCDIC in 2023 edges very close to being a felony)

UTF-16

To salvage the consistent investments made to support UCS-2, the Unicode Consortium created UTF-16 as a backward-compatible extension of UCS-2. When some piece of software advertises “support for UNICODE”, it almost always means that some software supported UCS-2 and switched to UTF-16 sometimes later. 6

Like UCS-2, UTF-16 can represent the entirety of the BMP using a single 16-bit value. Every codepoint above U+FFFF is represented using a pair of 16-bit values, called surrogate pairs. The first value (the “high surrogate”) is always a value in the range U+D800 to U+DBFF, while the second value (the “low surrogate”) is always a value in the range U+DC00 to U+DFFF.

This, in practice, means that the range reserved for BMP characters never overlaps with surrogates, making it trivial to distinguish between a single 16-bit codepoint and a surrogate pair, which makes UTF-16 self-synchronizing over 16-bit values.

Emojis are an example of characters that lie outside of the BMP; as such, they are always represented using surrogate pairs. For instance, the character U+1F600 (😀) is represented in UTF-16 by the surrogate pair [0xD83D, 0xDE00]:

>>> # pack the surrogate pair into bytes by hand, and then decode it as UTF-16
>>> bys = [b for cp in (0xD83D, 0xDE00) for b in list(cp.to_bytes(2,'little'))]
>>> bys
[61, 216, 0, 222]
>>> bytes(bys).decode('utf-16le')
'😀'

The BOM

Notice that in the example above I had to specify an endianness for the bytes (little-endian in this case) by writing "utf-16le" instead of just "utf-16". This is due to the fact that UTF-16 is actually two different (incompatible) encodings, UTF-16LE and UTF-16BE, which differ in the endianness of the single codepoints. 7

The standard calls for UTF-16 streams to start with a Byte Order Mark (BOM), represented by the special codepoint U+FEFF. Reading 0xFEFF indicates that the endianness of a text block is the same as the endianness of the decoding system; reading those bytes flipped, as 0xFFFE, indicates opposite endianness instead.

As an example, let’s assume a big-endian system has generated the sequence [0xFE, 0xFF, 0x00, 0x61].
All systems, LE or BE, will detect that the first two bytes are a surrogate pair, and read them as they are depending on their endianness. Then:

  • A big-endian system will decode U+FEFF, which is the BOM, and thus will assume the text is in UTF-16 in its same byte endianness (BE);
  • A little-endian system will instead read U+FFEE, which is still the BOM but flipped, so it will assume the text is in the opposite endianness (BE in the case of an LE system).

In both cases, the BOM allows the following character to be correctly parsed as U+0061 (a.k.a. a).

If no BOM is detected, then most decoders will do as they please (despite the standard recommending to assume UTF-16BE), which most of the time means assuming the endianness of the system:

>> import sys
>>> sys.byteorder
'little'
>>> # BOM read as 0xFEFF and system is LE -> will assume UTF-16LE
>>> bytes([0xFF, 0xFE, 0x61, 0x00, 0x62, 0x00, 0x63, 0x00]).decode('utf-16') 
'abc'
>>> # BOM read as 0xFFFE and system is LE -> will assume UTF-16BE
>>> bytes([0xFE, 0xFF, 0x00, 0x61, 0x00, 0x62, 0x00, 0x63]).decode('utf-16')
'abc'
>>> # no BOM, text is BE and system is LE -> will assume UTF-16LE and read garbage
>>> bytes([0x00, 0x61, 0x00, 0x62, 0x00, 0x63]).decode('utf-16')
'愀戀挀'
>>> # no BOM, text is BE and UTF-16BE is explicitly specified -> will read the text correctly
>>> bytes([0x00, 0x61, 0x00, 0x62, 0x00, 0x63]).decode('utf-16be')
'abc'

Some decoders may probe the first few codepoints for zeroes to detect the endianness of the stream, which is in general not an amazing idea. As a rule of thumb, UTF-16 text should never rely on automated endianness detection, and thus either always start with a BOM or assume a fixed endianness value (which in the vast majority of cases is UTF-16LE, which is what Windows does).

UTF-32

Just as UTF-16 is an extension of UCS-2, UTF-32 is an evolution of UCS-4. Compared to all other UTF encodings, UTF-32 is by far the simplest, because like its predecessor, it is a fixed-width encoding.

The major difference between UCS-4 and UTF-32 is that the latter has been limited down 21 bits, from its maximum of 31 bits (UCS-4 was signed). This has been done to maintain compatibility with UTF-16, which is constrained by its design to only represent codepoints up to U+10FFFF.

While UTF-32 seems convenient at first, it is not in practice all that useful, for quite a few reasons:

  1. UTF-32 is outrageously wasteful because all characters, including those belonging to the ASCII plane, are represented using 4 bytes. Given that the vast majority of text uses ASCII characters for markup, content or both, UTF-32 encoded text tends to be mostly comprised of just a few significant bytes scattered in between a sea of zeroes:

     >>> # UTF-32BE encoded text with BOM
     >>> bytes([0x00, 0x00, 0xFE, 0xFF, 0x00, 0x00, 0x00, 0x61, 0x00, 0x00, 0x00, 0x62, 0x00, 0x00, 0x00, 0x63]).decode('utf-32')
     'abc'
     >>> # The same, but in UTF-16BE
     >>> bytes([0xFE, 0xFF, 0x00, 0x61, 0x00, 0x62, 0x00, 0x63]).decode('utf-16')
     'abc'
     >>> # The same, but in ASCII
     >>> bytes([0x61, 0x62, 0x63]).decode('ascii')
     'abc'
    
  2. No major OS or software uses UTF-32 as its internal encoding as far as I’m aware of. While locales in modern UNIX systems usually define wchar_t as representing UTF-32 codepoints, they are seldom used due to most software in existence assuming that wchar_t is 16-bit wide.

    On Linux, for instance:

     #include <locale.h>
     #include <stdio.h>
     #include <wchar.h>
    
     int main(void) {
         // one of the bajilion ways to set a Unicode locale - we'll talk UTF-8 later
         setlocale(LC_ALL, "en_US.UTF-8"); 
         const wchar_t s[] = L"abc";
    
         printf("sizeof(wchar_t) == %zu\n", sizeof *s); // 4
         printf("wcslen(s) == %zu\n", wcslen(s)); // 3
         printf("bytes in s == %zu\n", sizeof s); // 16 (12 + 4, due to the null terminator)
    
         return 0;    
     }
    
  3. The fact UTF-32 is a fixed-width encoding is only marginally useful, due to grapheme clusters still being a thing. This means that the equivalence between codepoints and rendered glyphs is still not 1:1, just like in UCS-4:

     // GNU/Linux, x86_64
    
     #include <locale.h>
     #include <stdio.h>
     #include <wchar.h>
    
     int main(void) {
         setlocale(LC_ALL, "en_US.UTF-8");
    
         // "caña", with 'ñ' written as the grapheme cluster "n" + "combining tilde"
         const wchar_t string[] = L"can\u0303a";
    
         wprintf(L"`%ls`\n", string); // prints "caña" as 4 glyphs
         wprintf(L"`%ls` is %zu codepoints long\n", string, wcslen(string)); // 5 codepoints
         wprintf(L"`%ls` is %zu bytes long\n", string, sizeof string); // 24 bytes (5 UCS-4 codepoints + null)
    
         // this other string is the same as the previous one, but with the precomposed "ñ" character
         const wchar_t probe[] = L"ca\u00F1a";
    
         const _Bool different = wcscmp(string, probe);
    
         // this will always print "different", because the two strings are not the same despite being identical
         wprintf(L"`%ls` and `%ls` are %s\n", string, probe, different ? "different" : "equal");
    
         return 0;
     }
    
     $ cc -o widestr_test widestr_test.c -std=c11
     $ ./widestr_test
     `caña`
     `caña` is 5 codepoints long
     `caña` is 24 bytes long
     `caña` and `caña` are different
    

    This is by far the biggest letdown about UTF-32: it is not the ultimate “extended ASCII” encoding most people wished for, because it is still incorrect so iterate over characters, and it still requires normalization (see below) in order to be safely operated on character by character.

UTF-8

I left UTF-8 as last because it is by far the best among the crop of Unicode encodings 8. UTF-8 is a variable width encoding, just like UTF-16, but with the crucial advantage that UTF-8 uses byte-sized (8-bit) code units, just like ASCII.

This is a major advantage, for a series of reasons:

  1. All ASCII text is valid UTF-8, and ASCII itself is in UTF-8, limited to the codepoints between U+0000 and U+007F.
    • This also implies that UTF-8 can encode ASCII text with one byte per character, even when mixed up with non-Latin characters;
    • Editors, terminals and other software can just support UTF-8 without having to support a separate ASCII mode;
  2. UTF-8 doesn’t require bothering with endianness, because bytes are just that - bytes. This means that UTF-8 does not require a BOM, even though poorly designed software may still add one (see below);

  3. UTF-8 doesn’t need a wide char type, like wchar_t or char16_t. Old APIs can use classic byte-sized chars, and just disregard characters above U+007F.

The following is an arguably poorly designed C program that parses a basic key-value file format defined as follows:

key1:value1
key2:value2
key\:3:value3
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BUFFER_SIZE 1024

int main(const int argc, const char* const argv[]) {
    if (argc != 2) {
        fprintf(stderr, "usage: %s <file>\n", argv[0]);
        return EXIT_FAILURE;
    }

    FILE* const file = fopen(argv[1], "r");

    if (!file) {
        fprintf(stderr, "error: could not open file `%s`\n", argv[1]);
        return EXIT_FAILURE;
    }

    int retval = EXIT_SUCCESS;

    char* line = malloc(BUFFER_SIZE);
    if (!line) {
        fprintf(stderr, "error: could not allocate memory\n");
        
        goto end;
    }

    size_t line_size = BUFFER_SIZE;
    ptrdiff_t key_offs = -1, pos = 0;
    _Bool escape = 0;

    for (;;) {
        const int c = fgetc(file);

        switch (c) {
        case EOF:
            goto end;
        
        case '\\':
            if (!escape) {
                escape = 1;
                continue;
            }

            break;

        case ':':
            if (!escape) {
                if (key_offs >= 0) {
                    fprintf(stderr, "error: extra `:` at position %td\n", pos);
                    
                    goto end;
                }

                key_offs = pos;

                continue;
            }

            break;

        case '\n':
            if (escape) {
                break;
            }

            if (key_offs < 0) {
                fprintf(stderr, "error: missing `:`\n");

                goto end;
            }

            printf("key: `%.*s`, value: `%.*s`\n", (int)key_offs, line, (int)(pos - key_offs), line + key_offs);

            key_offs = -1;
            pos = 0;

            continue;
        }

        if ((size_t) pos >= line_size) {
            line_size = line_size * 3 / 2;
            line = realloc(line, line_size);

            if (!line) {
                fprintf(stderr, "error: could not allocate memory\n");

                goto end;
            }
        }

        line[pos++] = c;
        escape = 0;
    }

end:
    free(line);
    fclose(file);

    return EXIT_SUCCESS;
}
$ cc -o kv kv.c -std=c11
$ cat kv_test.txt
key1:value1
key2:value2
key\:3:value3
$ ./kv kv_test.txt
key: `key1`, value: `value1`
key: `key2`, value: `value2`
key: `key:3`, value: `value3`

This program operates on files char by char (or rather, int by int—that’s a long story), using whatever the “native” 8-bit (“narrow”) execution character set is to match for basic ASCII characters such as :, \ and \n.

The beauty of UTF-8 is that code that splits, searches, or synchronises using ASCII symbols9 will work fine as-is, with little to no modification, even with Unicode text.

Standard C character literals will still be valid Unicode codepoints, as long as the encoding of the source file is UTF-8. In the file above, ':' and other ASCII literals will fit in a char (int, really) as long as they are encoded as ASCII (: is U+003A).

Like UTF-16, UTF-8 is self-synchronizing: the code-splitting logic above will never match a UTF-8 codepoint in the middle, given that ASCII is reserved all of the codepoints between U+0000 and U+007F. The text can then be returned to the UTF-8 compliant system as it is, and the Unicode text will be correctly rendered.

$ cat kv_test_utf8.txt
tcp:127.0.0.1
Affet, affet:Yalvarıyorum
Why? 😒:blåbær
Spla\:too:3u33
$ ./kv kv_test_utf8.txt
key: `tcp`, value: `127.0.0.1`
key: `Affet, affet`, value: `Yalvarıyorum`
key: `Why? 😒`, value: `blåbær`
key: `Spla:too`, value: `3u33`

Unicode Normalization

As I previously mentioned, Unicode codepoints can be modified using combining characters, and the standard supports precomposed forms of some characters which have decomposed forms. The resulting glyphs are visually indistinguishable after being rendered, and there’s no limitation on using both forms alongside each other in the same text bit of text:

>>> import unicodedata
>>> s = 'Störfälle'
>>> len(s)
10
>>> [unicodedata.name(c) for c in s]
['LATIN CAPITAL LETTER S', 'LATIN SMALL LETTER T', 'LATIN SMALL LETTER O WITH DIAERESIS', 'LATIN SMALL LETTER R', 'LATIN SMALL LETTER F', 'LATIN SMALL LETTER A', 'COMBINING DIAERESIS', 'LATIN SMALL LETTER L', 'LATIN SMALL LETTER L', 'LATIN SMALL LETTER E']
>>> # getting the last 4 characters actually picks the last 3 glyphs, plus a combining character
>>> # sometimes the combining character may be mistakenly rendered over the `'` Python prints around the string
>>> s[-4:]
'̈lle'
>>> [unicodedata.name(c) for c in s[-4:]]
['COMBINING DIAERESIS', 'LATIN SMALL LETTER L', 'LATIN SMALL LETTER L', 'LATIN SMALL LETTER E']

This is a significant issue, given how character-centric our understanding of text is: users (and by extension, developers) expect to be able to count what they see as “letters”, in a way that is consistent with how they are printed, shown on screen or inputted in a text field.

Another headache is the fact Unicode also may define special forms for the same letter or group of letters, which are visibly different but understood by humans to be derived from the same symbol.

A very common example of this is the (U+FB01), (U+FB02), (U+FB00) and (U+FB03) ligatures, which are ubiquitous in Latin text as a “more readable” form of the fi, fl and ffi digraphs. In general, users expect office, office and office to be treated and rendered similarly, because they all represent the same identical word, but not necessarily without any visual difference. 10

Canonical and Compatibility Equivalence

To solve this issue, Unicode defines two different types of equivalence between codepoints (or sequences thereof):

  • Canonical equivalence, when two combinations of one or more codepoints represent the same “abstract” character, like in the case of “ñ” and “n + combining tilde”;

  • Compatibility equivalence, when two combinations of one or more codepoints more or less represent the same “abstract” character, while being rendered differently or having different semantics, like in the case of “fi”, or mathematical signs such as “Mathematical Bold Capital A” (𝐀).

Canonical equivalence is generally considered a stronger form of equivalence than compatibility equivalence: it is critical for text processing tools to be able to treat canonically equivalent characters as the same, otherwise, users may be unable to search, edit or operate on text properly.11 On the other end, users are aware of compatibility-equivalent characters due to their different semantic and visual features, so their equivalence becomes relevant only in specific circumstances (like textual search, for instance, or when the user tries to copy “fancy” characters from Word to a text box that only accepts plain text).

Normalization Forms

Unicode defines four distinct normalization forms, which are specific forms a Unicode text can be in, and which allow for safe comparisons between strings. The standard describes how text can be transformed into any form, following a specific normalization algorithm based on per-glyph mappings.

The four normalization forms are:

  • NFD, or Normalization Form D, which applies a single canonical decomposition to all characters of a string. In general, this can be assumed to mean that every character that has a canonically-equivalent decomposed form is in it, with all of its modifiers sorted into a canonical order.

    For instance,

      >>> "e\u0302\u0323"
      'ệ'
      >>> [unicodedata.name(c) for c in "e\u0302\u0323"]
      ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT', 'COMBINING DOT BELOW']
      >>> normalized = unicodedata.normalize('NFD', "e\u0302\u0323")
      >>> normalized
      'ệ'
      >>> [unicodedata.name(c) for c in normalized]
      ['LATIN SMALL LETTER E', 'COMBINING DOT BELOW', 'COMBINING CIRCUMFLEX ACCENT']
    

    Notice how the circumflex and the dot below were in a noncanonical order and were swapped by the normalization algorithm.

  • NFC, or Normalization Form C, which first applies a canonical decomposition, followed by a canonical composition. In NFC, all characters are composed into a precomposed character, if possible:

      >>> precomposed = unicodedata.normalize('NFC', "e\u0302\u0323")
      >>> precomposed
      'ệ'
      >>> [unicodedata.name(c) for c in precomposed]
      ['LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW']
    

    Notice that normalizing to NFC is not enough to “count” glyphs, given that some may not be representable with a single codepoint. An example of this is ẹ̄, which has no associated precomposed character:

      >>> [unicodedata.name(c) for c in unicodedata.normalize('NFC', "ẹ̄")]
      ['LATIN SMALL LETTER E WITH DOT BELOW', 'COMBINING MACRON']
    

    A particularly nice property of NFC is that by definition all ASCII text is by definition already in NFC, which means that compilers and other tools do not necessarily have to bother with normalization when dealing with source code or scripts. 12

  • NFKD, or Normalization Form KD, which applies a compatibility decomposition to all characters of a string. Alongside canonical equivalence, Unicode also defines compatibility-equivalent decompositions for certain characters, like the previously mentioned ligature, which is decomposed into f and i.

      >>> fi = "fi"
      >>> unicodedata.name(fi)
      'LATIN SMALL LIGATURE FI'
      >>> unicodedata.name(unicodedata.normalize('NFD', fi)) # doesn't do anything, `fi` has no canonical decomposition
      'LATIN SMALL LIGATURE FI'
      >>> decomposed = unicodedata.normalize('NFKD', "fi")
      >>> decomposed
      'fi'
      >>> [unicodedata.name(c) for c in decomposed]
      ['LATIN SMALL LETTER F', 'LATIN SMALL LETTER I']
    

    Characters that don’t have a compatibility decomposition are canonically decomposed instead:

      >>> "\u1EC7"
      'ệ'
      >>> [unicodedata.name(c) for c in unicodedata.normalize('NFKD', "\u1EC7")
      ['LATIN SMALL LETTER E', 'COMBINING DOT BELOW', 'COMBINING CIRCUMFLEX ACCENT']
    
  • NFKC, or Normalization Form KC, which first applies a compatibility decomposition, followed by a canonical composition. In NFKC, all characters are composed into a precomposed character, if possible:

      >>> precomposed = unicodedata.normalize('NFKC', "fi") # this is U+FB01, "LATIN SMALL LIGATURE FI"
      >>> precomposed
      'fi'
      >>> [unicodedata.name(c) for c in precomposed]
     ['LATIN SMALL LETTER F', 'LATIN SMALL LETTER I'] 
    

    Notice how the composition performed is canonical: there’s no such thing as “compatibility composition” as far as my understanding goes. This means that NFKC never recombines characters into compatibility-equivalent forms, which are thus permanently lost:

      >>> s = "Souffl\u0065\u0301" # notice the `ff` ligature
      >>> s
      'Soufflé'
      >>> norm = unicodedata.normalize('NFKC', s) 
      >>> norm
      'Soufflé'
      >>> # the ligature is gone, but the accent is still there
    

All in all, normalization is a fairly complex topic, and it’s especially tricky to implement right due to the sheer amount of special cases, so it’s always best to rely on libraries in order to get it right.

Unicode in the wild: caveats

Unicode is really the only relevant character set in existence, with UTF-8 holding the status of “best encoding”.

Unfortunately, internationalization support introduces a great deal of complexity into text handling, something that developers are often unaware of:

  1. First and foremost, there’s still a massive amount of software that doesn’t default to (or outright does not support) UTF-8, because it was either designed to work with legacy 8-bit encodings (like ISO-8859-1) or because it was designed in the ’90s to use UCS-2 and it’s permanently stuck with it or with faux “UTF-16”. Software libraries and frameworks like Qt, Java, Unreal Engine and the Win32 API are constantly converting text from UTF-8 (which is the sole Internet standard) to their internal UTF-16 representation. This is a massive waste of CPU cycles, which while more abundant than in the past, are still a finite resource.

     // Linux x86_64, Qt 6.5.1. Encoding is `en_US.UTF-8`.
     #include <iostream>
    
     #include <QCoreApplication>
     #include <QDebug>
    
     int main(int argc, char *argv[]) {
         QCoreApplication app{argc, argv};
    
         // converts UTF-8 (the source file's encoding) to the internal QString representation
         const QString s{"caña"}; 
    
         // prints `"caña"``, using Qt's debugging facilities. This will convert back to UTF-8 in order
         // to print the string to the console
         qDebug() << s;
    
         // prints `caña`, using C++'s IOStreams. This will force Qt to convert the string to
         // a UTF-8 encoded std::string, which will then be printed to the console
         std::cout << s.toStdString() << '\n';
    
         return 0;
     }
    
  2. Case insensitivity in Unicode is a massive headache. First and foremost, the concept itself of “ignoring case” is deeply European-centric due to it being chiefly limited to bicameral scripts such as Latin, Cyrillic or Greek. What is considered the opposite case of a letter may vary as well, depending on the system’s locale:

     public class Up {
         public static void main(final String[] args) {
             final var uc = "CIAO";
             final var lc = "ciao";
    
             System.out.println(uc.toLowerCase());
             System.out.println(lc.toUpperCase());
    
             System.out.printf("uc(\"%s\") == \"%s\": %b\n", lc, uc, lc.toUpperCase().equals(uc));
         }
     }
    
     $ echo $LANG
     en_US.UTF-8
     $ java Up
     ciao
     CIAO
     uc("ciao") == "CIAO": true
    

    This seems working fine until the runtime locale is switched to Turkish:

     $ env LANG='tr_TR.UTF-8' java Up
     cıao
     CİAO
     uc("ciao") == "CIAO": false
    

    In Turkish, the uppercase of i is İ, and the lowercase of I is ı, which breaks the ASCII-centric assumption the Java13 snippet above is built on. There is a multitude of such examples of “naive” implementations of case insensitivity in Unicode that inevitably end up being incorrect under unforeseen circumstances.

    Taking all edge cases related to Unicode case folding into account is a lot of work, especially since it’s very hard to properly test all possible locales. This is the reason why Unicode handling is always best left to a library. For C/C++ and Java, the Unicode Consortium itself provides a reference implementation of the Unicode algorithms, called ICU, which is used by a large number of frameworks and shipped by almost every major OS.

    While quite tricky to get right at times and at times more UTF-16 centric than I’d like, using ICU is still way saner than any self-written alternative:

     #include <stdint.h>
     #include <stdio.h>
     #include <stdlib.h>
     #include <string.h>
    
     #include <unicode/ucasemap.h>
     #include <unicode/utypes.h>
    
     int main(const int argc, const char *const argv[]) {
         // Support custom locales
         const char* const locale = argc > 1 ? argv[1] : "en_US";
    
         UErrorCode status = U_ZERO_ERROR;
    
         // Create a UCaseMap object for case folding
         UCaseMap* const caseMap = ucasemap_open(locale, 0, &status);
         if (U_FAILURE(status)) {
             printf("Error creating UCaseMap: %s\n", u_errorName(status));
             return EXIT_FAILURE;
         }
    
         // Case fold the input string using the default settings
         const char input[] = "CIAO";
         char lc[100];
         const int32_t lcLength = ucasemap_utf8ToLower(caseMap, lc, sizeof lc, input, sizeof input, &status);
    
         if (U_FAILURE(status)) {
             printf("Error performing case folding: %s\n", u_errorName(status));
             return 1;
         }
    
         // Print the lower case string
         printf("lc(\"%s\") == %.*s\n", input, lcLength, lc);
    
         // Clean up resources
         ucasemap_close(caseMap);
    
         return EXIT_SUCCESS;
     }
    
     $ cc -o casefold casefold.c -std=c11 $(icu-config --ldflags)
     $ ./casefold
     lc("CIAO") == ciao
     $ ./casefold tr_TR
     lc("CIAO") == cıao
    

    Unicode generalises “case insensitivity” into the broader concept of character folding, which boils down to a set of rules that define how characters can be transformed into other characters, in order to make them comparable.

  3. Similarly to folding, sorting text in a well-defined order (for instance alphabetical), an operation better known as collation, is also not trivial with Unicode.

    Different languages (and thus locales) may have different sorting rules, even with the Latin scripts.

    If, perchance, someone wanted to sort the list of words [ "tuck", "löwe", "luck", "zebra"]:

    • In German, ‘Ö’ is placed between ‘O’ and ‘P’, and the rest of the alphabet follows the same order as in English. The correct sorting for that list is thus [ "löwe", "luck", "tuck", "zebra"];
    • In Estonian, ‘Z’ is placed between ‘S’ and ‘T’, and ‘Ö’ is the penultimate letter of the alphabet. The list is then sorted as [ "luck", "löwe", "zebra", "tuck"];
    • In Swedish, ‘Ö’ is the last letter of the alphabet, with the classical Latin letters in their usual order. The list is thus [ "luck", "löwe", "tuck", "zebra"].

    Unicode defines a complex set of rules for collation and provides a reference implementation in ICU through the ucol API (and its relative C++ and Java equivalents).

     #define _GNU_SOURCE // for qsort_r
    
     #include <stdint.h>
     #include <stdio.h>
     #include <stdlib.h>
     #include <string.h>
    
     #include <unicode/ustring.h>
     #include <unicode/ucol.h>
     #include <unicode/uloc.h>
    
     int strcmp_helper(const void *const a, const void *const b, void *const ctx) {
         const char *const str1 = *(const char**) a, *const str2 = *(const char**) b;
    
         UErrorCode status = U_ZERO_ERROR;
            
         const UCollationResult cres = ucol_strcollUTF8(ctx, str1, strlen(str1), str2, strlen(str2), &status);
    
         return (cres == UCOL_GREATER) - (cres == UCOL_LESS);
     }
    
     void sort_strings(UCollator *const collator, const char **const strings, const ptrdiff_t n) {
         qsort_r(strings, n, sizeof *strings, strcmp_helper, collator);
     }
    
     int main(const int argc, const char *argv[]) {
         // Support custom locales
         const char* locale = getenv("ICU_LOCALE");
    
         if (!locale) {
             locale = "en_US";
         }
    
         UErrorCode status = U_ZERO_ERROR;
            
         // Create a UCaseMap object for case folding
         UCollator *const coll = ucol_open(locale, &status);
         if (U_FAILURE(status)) {
             printf("Error creating UCollator: %s\n", u_errorName(status));
             return EXIT_FAILURE;
         }
            
         sort_strings(coll, ++argv, argc - 1);
    
         // Clean up resources
         ucol_close(coll);
            
         while (*argv) {
             puts(*argv++);
         }
        
         return EXIT_SUCCESS;
     }
    
     $ env ICU_LOCALE=de_DE ./coll "tuck" "löwe" "luck" "zebra" # German
     löwe
     luck
     tuck
     zebra
     $ env ICU_LOCALE=et_EE ./coll "tuck" "löwe" "luck" "zebra" # Estonian
     luck
     löwe
     zebra
     tuck
     $ env ICU_LOCALE=sv_SE ./coll "tuck" "löwe" "luck" "zebra" # Swedish
     luck
     löwe
     tuck
     zebra
     $ # more complex case: sorting Japanese Kana using the Japanese locale's gojūon order
     $ env ICU_LOCALE=ja ./coll "パンダ" "ありがとう" "パソコン" "さよなら" "カード"
     ありがとう
     カード
     さよなら
     パソコン
     パンダ
    
  4. To facilitate UTF-8 detection when other encodings may be in use, some platforms annoyingly add a UTF-8 BOM (EF BB BF) at the beginning of text files. Microsoft’s Visual Studio is historically a major offender in this regard:

     $  file OldProject.sln
     OldProject.sln: Unicode text, UTF-8 (with BOM) text, with CRLF line terminators
     $ xxd OldProject.sln | head -n 1
     00000000: efbb bf0d 0a4d 6963 726f 736f 6674 2056  .....Microsoft V
    

    The sequence is simply U+FEFF, just like in UTF-16 and 32, but encoded in UTF-8. While it’s not forbidden by the standard per se, it has no real utility besides signaling that the file is in UTF-8 (it makes no sense talking about endianness with single bytes). Programs that need to parse or operate on UTF-8 encoded files should always be aware that a BOM may be present, and probe for it to avoid exposing users to unnecessary complexity they probably don’t care about.

  5. Because of all of the reasons listed above, random, array-like access to Unicode strings is almost always broken—this is true even with UTF-32, due to grapheme clusters. It also follows that operations such as string slicing are not trivial to implement correctly, and the way languages such as Python and JavaScript do it (codepoint by codepoint) is IMHO arguably problematic.

    A good example of a modern language that attempts to mitigate this issue is Rust, which has UTF-8 strings that disallow indexed access and only support slicing at byte indices, with UTF-8 validation at runtime:

     fn main() {
         let s = "caña";
    
         // error[E0277]: the type `str` cannot be indexed by `{integer}`
         // let c = s[1];
    
         // char-by-char access requires iterators
         println!("{}", s.chars().nth(2).unwrap()); // OK: ñ
    
         // this will crash the program at runtime:
         // "byte index 3 is not a char boundary; it is inside 'ñ' (bytes 2..4) of `caña`"
         // let slice = &s[1..3]);
    
         // the user needs to check UTF-8 character bounds beforehand
         println!("{}", &s[1..4]); // OK: "añ"
     }
    

    The stabilisation of the .chars() method took quite a long time, reflecting the fact that deducing what is or is not a character in Unicode is complex and quite controversial. The method itself ended up implementing iteration over Rust’s chars (aka, Unicode scalar codepoints) instead of grapheme clusters, which is rarely what the user wants. The fact it returns an iterator does at least effectively express that character-by-character access in Unicode is not, indeed, the “simple” operation developers have been so long accustomed to.

Wrapping up

Unicode is a massive standard, and it’s constantly adding new characters14, so for everybody’s safety it’s always best to rely on libraries to provide Unicode support, and if necessary ship fonts that support all the characters you may need (such as Noto Fonts). As previously introduced, C and C++ do not provide great support for Unicode, so it’s always best to just use ICU, which is widely supported and shipped by every major OS (including Windows).

When handling text that may contain non-English characters, it’s always best to stick to UTF-8 when possible and use Unicode-aware libraries for text processing. While writing custom text processing code may seem doable, it’s easy to miss a few corner cases and confuse end users in the process.

This is especially important because the main users of localized text and applications tend to often be the least technically savvy—those who may lack the ability to understand why the piece of software they are using is misbehaving, and can’t search for help in a language they don’t understand.

I hope this article may have been useful to shed some light on what is, in my opinion, an often overlooked topic in software development, especially among C++ users. If I had to be honest, I was striving for a shorter article, but I guess I had to make up for all those years I didn’t post a thing :)

As always, feel free to comment underneath or send me a message if anything does not look right, and hopefully, the next post will come before 2025…

  1. This wacky yet ingenious system made it possible to write in Arabic on ASCII-only channels, by using a mixture of Latin script and Western numerals with a passing resemblance with letters not present in English (i.e.,3 in place of ع, …). 

  2. Three actually: there was also UTF-1, a variable-width encoding that used 1 byte characters. It was pretty borked, so it never really saw much use. 

  3. 32-bit Unicode was initially resisted by both the Unicode consortium and the industry, due to its wastefulness while representing Latin text and everybody’s heavy investment in 16-bit Unicode. 

  4. And they still do it as of today. They do claim UTF-16 support, but it’s a bald-faced lie given that they don’t support anything outside of the BMP. 

  5. It was basically IPv4 all over again. I guess we’ll never learn. 

  6. A good example of this is Unreal Engine, which pretends to support UTF-16 even though it is actually UCS-2 

  7. UCS-2 also had the same issue, and so it was also in practice two different encodings, UCS-2LE and UCS-2BE. My opinions on this matter can thankfully be represented using Unicode itself with codepoint U+1F92E

  8. Or rather, it is the one Unicode encoding people want to use, as opposed to UTF-16, which is a scourge we’ll (probably) never get rid of. 

  9. I’ve specified “ASCII symbols” because letters may potentially be part of a grapheme cluster, so splitting on an e may, for instance, split an in two. 

  10. For instance, you most definitely expect that searching for “office” in a PDF also matches the words containing the ligature “fi”—string search is another tricky topic by itself

  11. And not only that: just think of how hard would it be to find a file, or to check a password or username, if there weren’t ways to verify the canonical equivalence between characters. 

  12. While most programming languages are somewhat standardizing around UTF-8 encoded source code, C and C++ still don’t have a standard encoding. Modern languages like Rust, Swift and Go also support Unicode in identifiers, which introduces some interesting challenges - see the relative Unicode specification for identifiers and parsing for more details. 

  13. I’ve used Java as an example here because it hits the right spot as a poster child of all the wrong assumptions of the ’90s: it’s old enough to easily provide naive, Western-centric built-in concepts such as “toUpperCase” and “toLowerCase”, while also attempting to implement them in a “Unicode” way. Unicode support in C and C++ is too barebones to really work as an example (despite C and C++ locales being outstandingly broken), and modern ones such as Rust or Go are usually locale agnostic; they also tend to implement case folding in a “saner” way (for instance, Rust only supports it on ASCII in its standard library). 

  14. A prime example of this is emojis, which have been ballooning in number since they were first introduced in 2010. 

Cross compiling made easy, using Clang and LLVM

Anyone who ever tried to cross-compile a C/C++ program knows how big a PITA the whole process could be. The main reasons for this sorry state of things are generally how byzantine build systems tend to be when configuring for cross-compilation, and how messy it is to set-up your cross toolchain in the first place.

One of the main culprits in my experience has been the GNU toolchain, the decades-old behemoth upon which the POSIXish world has been built for years. Like many compilers of yore, GCC and its binutils brethren were never designed with the intent to support multiple targets within a single setup, with he only supported approach being installing a full cross build for each triple you wish to target on any given host.

For instance, assuming you wish to build something for FreeBSD on your Linux machine using GCC, you need:

  • A GCC + binutils install for your host triplet (i.e., x86_64-pc-linux-gnu or similar);
  • A GCC + binutils complete install for your target triplet (i.e. x86_64-unknown-freebsd12.2-gcc, as, nm, etc)
  • A sysroot containing the necessary libraries and headers, which you can either build yourself or promptly steal from a running installation of FreeBSD.

This process is sometimes made simpler by Linux distributions or hardware vendors offering a selection of prepackaged compilers, but this will never suffice due to the sheer amount of possible host-target combinations. This sometimes means you have to build the whole toolchain yourself, something that, unless you rock a quite beefy CPU, tends to be a massive waste of time and power.

Clang as a cross compiler

This annoying limitation is one of the reasons why I got interested in LLVM (and thus Clang), which is by-design a full-fledged cross compiler toolchain and is mostly compatible with GNU. A single install can output and compile code for every supported target, as long as a complete sysroot is available at build time.

I found this to be a game-changer, and, while it can’t still compete in convenience with modern language toolchains (such as Go’s gc and GOARCH/GOOS), it’s night and day better than the rigmarole of setting up GNU toolchains. You can now just fetch whatever your favorite package management system has available in its repositories (as long as it’s not extremely old), and avoid messing around with multiple installs of GCC.

Until a few years ago, the whole process wasn’t as smooth as it could be. Due to LLVM not having a full toolchain yet available, you were still supposed to provide a binutils build specific for your target. While this is generally much more tolerable than building the whole compiler (binutils is relatively fast to build), it was still somewhat of a nuisance, and I’m glad that llvm-mc (LLVM’s integrated assembler) and lld (universal linker) are finally stable and as flexible as the rest of LLVM.

With the toolchain now set, the next step becomes to obtain a sysroot in order to provide the needed headers and libraries to compile and link for your target.

Obtaining a sysroot

A super fast way to find a working system directory for a given OS is to rip it straight out of an existing system (a Docker container image will often also do). For instance, this is how I used tar through ssh as a quick way to extract a working sysroot from a FreeBSD 13-CURRENT AArch64 VM 1:

$ mkdir ~/farm_tree
$ ssh FARM64 'tar cf - /lib /usr/include /usr/lib /usr/local/lib /usr/local/include' | bsdtar xvf - -C $HOME/farm_tree/

Invoking the cross compiler

With everything set, it’s now only a matter of invoking Clang with the right arguments:

$  clang++ --target=aarch64-pc-freebsd --sysroot=$HOME/farm_tree -fuse-ld=lld -stdlib=libc++ -o zpipe zpipe.cc -lz --verbose
clang version 11.0.1
Target: aarch64-pc-freebsd
Thread model: posix
InstalledDir: /usr/bin
 "/usr/bin/clang-11" -cc1 -triple aarch64-pc-freebsd -emit-obj -mrelax-all -disable-free -disable-llvm-verifier -discard-value-names -main-file-name zpipe.cc -mrelocation-model static -mframe-pointer=non-leaf -fno-rounding-math -mconstructor-aliases -munwind-tables -fno-use-init-array -target-cpu generic -target-feature +neon -target-abi aapcs -fallow-half-arguments-and-returns -fno-split-dwarf-inlining -debugger-tuning=gdb -v -resource-dir /usr/lib/clang/11.0.1 -isysroot /home/marco/farm_tree -internal-isystem /home/marco/farm_tree/usr/include/c++/v1 -fdeprecated-macro -fdebug-compilation-dir /home/marco/dummies/cxx -ferror-limit 19 -fno-signed-char -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -faddrsig -o /tmp/zpipe-54f1b1.o -x c++ zpipe.cc
clang -cc1 version 11.0.1 based upon LLVM 11.0.1 default target x86_64-pc-linux-gnu
#include "..." search starts here:
#include <...> search starts here:
 /home/marco/farm_tree/usr/include/c++/v1
 /usr/lib/clang/11.0.1/include
 /home/marco/farm_tree/usr/include
End of search list.
 "/usr/bin/ld.lld" --sysroot=/home/marco/farm_tree --eh-frame-hdr -dynamic-linker /libexec/ld-elf.so.1 --enable-new-dtags -o zpipe /home/marco/farm_tree/usr/lib/crt1.o /home/marco/farm_tree/usr/lib/crti.o /home/marco/farm_tree/usr/lib/crtbegin.o -L/home/marco/farm_tree/usr/lib /tmp/zpipe-54f1b1.o -lz -lc++ -lm -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /home/marco/farm_tree/usr/lib/crtend.o /home/marco/farm_tree/usr/lib/crtn.o
$ file zpipe
zpipe: ELF 64-bit LSB executable, ARM aarch64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 13.0 (1300136), FreeBSD-style, with debug_info, not stripped

In the snipped above, I have managed to compile and link a C++ file into an executable for AArch64 FreeBSD, all while using just the clang and lld I had already installed on my GNU/Linux system.

More in detail:

  1. --target switches the LLVM default target (x86_64-pc-linux-gnu) to aarch64-pc-freebsd, thus enabling cross-compilation.
  2. --sysroot forces Clang to assume the specified path as root when searching headers and libraries, instead of the usual paths. Note that sometimes this setting might not be enough, especially if the target uses GCC and Clang somehow fails to detect its install path. This can be easily fixed by specifying --gcc-toolchain, which clarifies where to search for GCC installations.
  3. -fuse-ld=lld tells Clang to use lld instead whatever default the platform uses. As I will explain below, it’s highly unlikely that the system linker understands foreign targets, while LLD can natively support almost every binary format and OS 2.
  4. -stdlib=libc++ is needed here due to Clang failing to detect that FreeBSD on AArch64 uses LLVM’s libc++ instead of GCC’s libstdc++.
  5. -lz is also specified to show how Clang can also resolve other libraries inside the sysroot without issues, in this case, zlib.

The final test is now to copy the binary to our target system (i.e. the VM we ripped the sysroot from before) and check if it works as expected:

$ rsync zpipe FARM64:"~"
$ ssh FARM64
FreeBSD-ARM64-VM $ chmod +x zpipe
FreeBSD-ARM64-VM $ ldd zpipe
zpipe:
        libz.so.6 => /lib/libz.so.6 (0x4029e000)
        libc++.so.1 => /usr/lib/libc++.so.1 (0x402e4000)
        libcxxrt.so.1 => /lib/libcxxrt.so.1 (0x403da000)
        libm.so.5 => /lib/libm.so.5 (0x40426000)
        libc.so.7 => /lib/libc.so.7 (0x40491000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x408aa000)
FreeBSD-ARM64-VM $ ./zpipe -h
zpipe usage: zpipe [-d] < source > dest

Success! It’s now possible to use this cross toolchain to build larger programs, and below I’ll give a quick example to how to use it to build real projects.

Optional: creating an LLVM toolchain directory

LLVM provides a mostly compatible counterpart for almost every tool shipped by binutils (with the notable exception of as 3), prefixed with llvm-.

The most critical of these is LLD, which is a drop in replacement for a plaform’s system linker, capable to replace both GNU ld.bfd and gold on GNU/Linux or BSD, and Microsoft’s LINK.EXE when targeting MSVC. It supports linking on (almost) every platform supported by LLVM, thus removing the nuisance to have multiple specific linkers installed.

Both GCC and Clang support using ld.lld instead of the system linker (which may well be lld, like on FreeBSD) via the command line switch -fuse-ld=lld.

In my experience, I found that Clang’s driver might get confused when picking the right linker on some uncommon platforms, especially before version 11.0. For some reason, clang sometimes decided to outright ignore the -fuse-ld=lld switch and picked the system linker (ld.bfd in my case), which does not support AArch64.

A fast solution to this is to create a toolchain directory containing symlinks that rename the LLVM utilities to the standard binutils programs:

$  ls -la ~/.llvm/bin/
Permissions Size User  Group Date Modified Name
lrwxrwxrwx    16 marco marco  3 Aug  2020  ar -> /usr/bin/llvm-ar
lrwxrwxrwx    12 marco marco  6 Aug  2020  ld -> /usr/bin/lld
lrwxrwxrwx    21 marco marco  3 Aug  2020  objcopy -> /usr/bin/llvm-objcopy
lrwxrwxrwx    21 marco marco  3 Aug  2020  objdump -> /usr/bin/llvm-objdump
lrwxrwxrwx    20 marco marco  3 Aug  2020  ranlib -> /usr/bin/llvm-ranlib
lrwxrwxrwx    21 marco marco  3 Aug  2020  strings -> /usr/bin/llvm-strings

The -B switch can then be used to force Clang (or GCC) to search the required tools in this directory, stopping the issue from ever occurring:

$  clang++ -B$HOME/.llvm/bin -stdlib=libc++ --target=aarch64-pc-freebsd --sysroot=$HOME/farm_tree -std=c++17 -o mvd-farm64 mvd.cc
$ file mvd-farm64
mvd-farm64: ELF 64-bit LSB executable, ARM aarch64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 13.0, FreeBSD-style, with debug_info, not stripped

Optional: creating Clang wrappers to simplify cross-compilation

I happened to notice that certain build systems (and with “certain” I mean some poorly written Makefiles and sometimes Autotools) have a tendency to misbehave when $CC, $CXX or $LD contain spaces or multiple parameters. This might become a recurrent issue if we need to invoke clang with several arguments. 4

Given also how unwieldy it is to remember to write all of the parameters correctly everywhere, I usually write quick wrappers for clang and clang++ in order to simplify building for a certain target:

$ cat ~/.local/bin/aarch64-pc-freebsd-clang
#!/usr/bin/env sh

exec /usr/bin/clang -B$HOME/.llvm/bin --target=aarch64-pc-freebsd --sysroot=$HOME/farm_tree "$@"
$ cat ~/.local/bin/aarch64-pc-freebsd-clang++
#!/usr/bin/env sh

exec /usr/bin/clang++ -B$HOME/.llvm/bin -stdlib=libc++ --target=aarch64-pc-freebsd --sysroot=$HOME/farm_tree "$@"	

If created in a directory inside $PATH, these script can used everywhere as standalone commands:

$ aarch64-pc-freebsd-clang++ -o tst tst.cc -static
$ file tst
tst: ELF 64-bit LSB executable, ARM aarch64, version 1 (FreeBSD), statically linked, for FreeBSD 13.0 (1300136), FreeBSD-style, with debug_info, not stripped

Cross-building with Autotools, CMake and Meson

Autotools, CMake, and Meson are arguably the most popular building systems for C and C++ open source projects (sorry, SCons). All of three support cross-compiling out of the box, albeit with some caveats.

Autotools

Over the years, Autotools has been famous for being horrendously clunky and breaking easily. While this reputation is definitely well earned, it’s still widely used by most large GNU projects. Given it’s been around for decades, it’s quite easy to find support online when something goes awry (sadly, this is not also true when writing .ac files). When compared to its more modern breathren, it doesn’t require any toolchain file or extra configuration when cross compiling, being only driven by command line options.

A ./configure script (either generated by autoconf or shipped by a tarball alongside source code) usually supports the --host flag, allowing the user to specify the triple of the host on which the final artifacts are meant to be run.

This flags activates cross compilation, and causes the “auto-something” array of tools to try to detect the correct compiler for the target, which it generally assumes to be called some-triple-gcc or some-triple-g++.

For instance, let’s try to configure binutils version 2.35.1 for aarch64-pc-freebsd, using the Clang wrapper introduced above:

$ tar xvf binutils-2.35.1.tar.xz
$ mkdir binutils-2.35.1/build # always create a build directory to avoid messing up the source tree
$ cd binutils-2.35.1/build
$ env CC='aarch64-pc-freebsd-clang' CXX='aarch64-pc-freebsd-clang++' AR=llvm-ar ../configure --build=x86_64-pc-linux-gnu --host=aarch64-pc-freebsd --enable-gold=yes
checking build system type... x86_64-pc-linux-gnu
checking host system type... aarch64-pc-freebsd
checking target system type... aarch64-pc-freebsd
checking for a BSD-compatible install... /usr/bin/install -c
checking whether ln works... yes
checking whether ln -s works... yes
checking for a sed that does not truncate output... /usr/bin/sed
checking for gawk... gawk
checking for aarch64-pc-freebsd-gcc... aarch64-pc-freebsd-clang
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... yes
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether aarch64-pc-freebsd-clang accepts -g... yes
checking for aarch64-pc-freebsd-clang option to accept ISO C89... none needed
checking whether we are using the GNU C++ compiler... yes
checking whether aarch64-pc-freebsd-clang++ accepts -g... yes
[...]

The invocation of ./configure above specifies that I want autotools to:

  1. Configure for building on an x86_64-pc-linux-gnu host (which I specified using --build);
  2. Build binaries that will run on aarch64-pc-freebsd, using the --host switch;
  3. Use the Clang wrappers made above as C and C++ compilers;
  4. Use llvm-ar as the target ar.

I also specified to build the Gold linker, which is written in C++ and it’s a good test for well our improvised toolchain handles compiling C++.

If the configuration step doesn’t fail for some reason (it shouldn’t), it’s now time to run GNU Make to build binutils:

$ make -j16 # because I have 16 theads on my system
[ lots of output]
$ mkdir dest
$ make DESTDIR=$PWD/dest install # install into a fake tree

There should now be executable files and libraries inside of the fake tree generated by make install. A quick test using file confirms they have been correctly built for aarch64-pc-freebsd:

$ file dest/usr/local/bin/ld.gold
dest/usr/local/bin/ld.gold: ELF 64-bit LSB executable, ARM aarch64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 13.0 (1300136), FreeBSD-style, with debug_info, not stripped

CMake

The simplest way to set CMake to configure for an arbitrary target is to write a toolchain file. These usually consist of a list of declarations that instructs CMake on how it is supposed to use a given toolchain, specifying parameters like the target operating system, the CPU architecture, the name of the C++ compiler, and such.

One reasonable toolchain file for the aarch64-pc-freebsd triple written as follows:

set(CMAKE_SYSTEM_NAME FreeBSD)
set(CMAKE_SYSTEM_PROCESSOR aarch64)

set(CMAKE_SYSROOT $ENV{HOME}/farm_tree)

set(CMAKE_C_COMPILER aarch64-pc-freebsd-clang)
set(CMAKE_CXX_COMPILER aarch64-pc-freebsd-clang++)
set(CMAKE_AR llvm-ar)

# these variables tell CMake to avoid using any binary it finds in 
# the sysroot, while picking headers and libraries exclusively from it 
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)

In this file, I specified the wrapper created above as the cross compiler for C and C++ for the target. It should be possible to also use plain Clang with the right arguments, but it’s much less straightforward and potentially more error-prone.

In any case, it is very important to indicate the CMAKE_SYSROOT and CMAKE_FIND_ROOT_PATH_MODE_* variables, or otherwise CMake could wrongly pick packages from the host with disastrous results.

It is now only a matter of setting CMAKE_TOOLCHAIN_FILE with the path to the toolchain file when configuring a project. To better illustrate this, I will now also build {fmt} (which is an amazing C++ library you should definitely use) for aarch64-pc-freebsd:

$  git clone https://github.com/fmtlib/fmt
Cloning into 'fmt'...
remote: Enumerating objects: 45, done.
remote: Counting objects: 100% (45/45), done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 24446 (delta 17), reused 12 (delta 7), pack-reused 24401
Receiving objects: 100% (24446/24446), 12.08 MiB | 2.00 MiB/s, done.
Resolving deltas: 100% (16551/16551), done.
$ cd fmt
$ cmake -B build -G Ninja -DCMAKE_TOOLCHAIN_FILE=$HOME/toolchain-aarch64-freebsd.cmake -DBUILD_SHARED_LIBS=ON -DFMT_TEST=OFF .
-- CMake version: 3.19.4
-- The CXX compiler identification is Clang 11.0.1
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /home/marco/.local/bin/aarch64-pc-freebsd-clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Version: 7.1.3
-- Build type: Release
-- CXX_STANDARD: 11
-- Performing Test has_std_11_flag
-- Performing Test has_std_11_flag - Success
-- Performing Test has_std_0x_flag
-- Performing Test has_std_0x_flag - Success
-- Performing Test SUPPORTS_USER_DEFINED_LITERALS
-- Performing Test SUPPORTS_USER_DEFINED_LITERALS - Success
-- Performing Test FMT_HAS_VARIANT
-- Performing Test FMT_HAS_VARIANT - Success
-- Required features: cxx_variadic_templates
-- Performing Test HAS_NULLPTR_WARNING
-- Performing Test HAS_NULLPTR_WARNING - Success
-- Looking for strtod_l
-- Looking for strtod_l - not found
-- Configuring done
-- Generating done
-- Build files have been written to: /home/marco/fmt/build

Compared with Autotools, the command line passed to cmake is very simple and doesn’t need too much explanation. After the configuration step is finished, it’s only a matter to compile the project and get ninja or make to install the resulting artifacts somewhere.

$ cmake --build build
[4/4] Creating library symlink libfmt.so.7 libfmt.so
$ mkdir dest
$ env DESTDIR=$PWD/dest cmake --build build -- install
[0/1] Install the project...
-- Install configuration: "Release"
-- Installing: /home/marco/fmt/dest/usr/local/lib/libfmt.so.7.1.3
-- Installing: /home/marco/fmt/dest/usr/local/lib/libfmt.so.7
-- Installing: /home/marco/fmt/dest/usr/local/lib/libfmt.so
-- Installing: /home/marco/fmt/dest/usr/local/lib/cmake/fmt/fmt-config.cmake
-- Installing: /home/marco/fmt/dest/usr/local/lib/cmake/fmt/fmt-config-version.cmake
-- Installing: /home/marco/fmt/dest/usr/local/lib/cmake/fmt/fmt-targets.cmake
-- Installing: /home/marco/fmt/dest/usr/local/lib/cmake/fmt/fmt-targets-release.cmake
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/args.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/chrono.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/color.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/compile.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/core.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/format.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/format-inl.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/locale.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/os.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/ostream.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/posix.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/printf.h
-- Installing: /home/marco/fmt/dest/usr/local/include/fmt/ranges.h
-- Installing: /home/marco/fmt/dest/usr/local/lib/pkgconfig/fmt.pc
$  file dest/usr/local/lib/libfmt.so.7.1.3
dest/usr/local/lib/libfmt.so.7.1.3: ELF 64-bit LSB shared object, ARM aarch64, version 1 (FreeBSD), dynamically linked, for FreeBSD 13.0 (1300136), with debug_info, not stripped

Meson

Like CMake, Meson relies on toolchain files (here called “cross files”) to specify which tools should be used when building for a given target. Thanks to being written in a TOML-like language, they are very straightforward:

$ cat meson_aarch64_fbsd_cross.txt
[binaries]
c = '/home/marco/.local/bin/aarch64-pc-freebsd-clang'
cpp = '/home/marco/.local/bin/aarch64-pc-freebsd-clang++'
ld = '/usr/bin/ld.lld'
ar = '/usr/bin/llvm-ar'
objcopy = '/usr/bin/llvm-objcopy'
strip = '/usr/bin/llvm-strip'

[properties]
ld_args = ['--sysroot=/home/marco/farm_tree']

[host_machine]
system = 'freebsd'
cpu_family = 'aarch64'
cpu = 'aarch64'
endian = 'little'

This cross-file can then be specified to meson setup using the --cross-file option 5, with everything else remaining the same as with every other Meson build.

And, well, this is basically it: like with CMake, the whole process is relatively painless and foolproof. For the sake of completeness, this is how to build dav1d, VideoLAN’s AV1 decoder, for aarch64-pc-freebsd:

$ git clone https://code.videolan.org/videolan/dav1d
Cloning into 'dav1d'...
warning: redirecting to https://code.videolan.org/videolan/dav1d.git/
remote: Enumerating objects: 164, done.
remote: Counting objects: 100% (164/164), done.
remote: Compressing objects: 100% (91/91), done.
remote: Total 9377 (delta 97), reused 118 (delta 71), pack-reused 9213
Receiving objects: 100% (9377/9377), 3.42 MiB | 54.00 KiB/s, done.
Resolving deltas: 100% (7068/7068), done.
$ meson setup build --cross-file ../meson_aarch64_fbsd_cross.txt --buildtype release
The Meson build system
Version: 0.56.2
Source dir: /home/marco/dav1d
Build dir: /home/marco/dav1d/build
Build type: cross build
Project name: dav1d
Project version: 0.8.1
C compiler for the host machine: /home/marco/.local/bin/aarch64-pc-freebsd-clang (clang 11.0.1 "clang version 11.0.1")
C linker for the host machine: /home/marco/.local/bin/aarch64-pc-freebsd-clang ld.lld 11.0.1
[ output cut ]
$ meson compile -C build
Found runner: ['/usr/bin/ninja']
ninja: Entering directory `build'
[129/129] Linking target tests/seek_stress
$ mkdir dest
$ env DESTDIR=$PWD/dest meson install -C build
ninja: Entering directory `build'
[1/11] Generating vcs_version.h with a custom command
Installing src/libdav1d.so.5.0.1 to /home/marco/dav1d/dest/usr/local/lib
Installing tools/dav1d to /home/marco/dav1d/dest/usr/local/bin
Installing /home/marco/dav1d/include/dav1d/common.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/include/dav1d/data.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/include/dav1d/dav1d.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/include/dav1d/headers.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/include/dav1d/picture.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/build/include/dav1d/version.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/build/meson-private/dav1d.pc to /home/marco/dav1d/dest/usr/local/lib/pkgconfig
$ file dest/usr/local/bin/dav1d
dest/usr/local/bin/dav1d: ELF 64-bit LSB executable, ARM aarch64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 13.0 (1300136), FreeBSD-style, with debug_info, not stripped

Bonus: static linking with musl and Alpine Linux

Statically linking a C or C++ program can sometimes save you a lot of library compatibility headaches, especially when you can’t control what’s going to be installed on whatever you plan to target. Building static binaries is however quite complex on GNU/Linux, due to Glibc actively discouraging people from linking it statically. 6

Musl is a very compatible standard library implementation for Linux that plays much nicer with static linking, and it is now shipped by most major distributions. These packages often suffice in building your code statically, at least as long as you plan to stick with plain C.

The situation gets much more complicated if you plan to use C++, or if you need additional components. Any library shipped by a GNU/Linux system (like libstdc++, libz, libffi and so on) is usually only built for Glibc, meaning that any library you wish to use must be rebuilt to target Musl. This also applies to libstdc++, which inevitably means either recompiling GCC or building a copy of LLVM’s libc++.

Thankfully, there are several distributions out there that target “Musl-plus-Linux”, everyone’s favorite being Alpine Linux. It is thus possible to apply the same strategy we used above to obtain a x86_64-pc-linux-musl sysroot complete of libraries and packages built for Musl, which can then be used by Clang to generate 100% static executables.

Setting up an Alpine container

A good starting point is the minirootfs tarball provided by Alpine, which is meant for containers and tends to be very small:

$ wget -qO - https://dl-cdn.alpinelinux.org/alpine/v3.13/releases/x86_64/alpine-minirootfs-3.13.1-x86_64.tar.gz | gunzip | sudo tar xfp - -C ~/alpine_tree

It is now possible to chroot inside the image in ~/alpine_tree and set it up, installing all the packages you may need. I prefer in general to use systemd-nspawn in lieu of chroot due to it being vastly better and less error prone. 7

$ $  sudo systemd-nspawn -D alpine_tree
Spawning container alpinetree on /home/marco/alpine_tree.
Press ^] three times within 1s to kill container.
alpinetree:~# 

We can now (optionally) switch to the edge branch of Alpine for newer packages by editing /etc/apk/repositories, and then install the required packages containing any static libraries required by the code we want to build:

alpinetree:~# cat /etc/apk/repositories
https://dl-cdn.alpinelinux.org/alpine/edge/main
https://dl-cdn.alpinelinux.org/alpine/edge/community
alpinetree:~# apk update
fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
v3.13.0-1030-gbabf0a1684 [https://dl-cdn.alpinelinux.org/alpine/edge/main]
v3.13.0-1035-ga3ac7373fd [https://dl-cdn.alpinelinux.org/alpine/edge/community]
OK: 14029 distinct packages available
alpinetree:~# apk upgrade
OK: 6 MiB in 14 packages
alpinetree:~# apk add g++ libc-dev
(1/14) Installing libgcc (10.2.1_pre1-r3)
(2/14) Installing libstdc++ (10.2.1_pre1-r3)
(3/14) Installing binutils (2.35.1-r1)
(4/14) Installing libgomp (10.2.1_pre1-r3)
(5/14) Installing libatomic (10.2.1_pre1-r3)
(6/14) Installing libgphobos (10.2.1_pre1-r3)
(7/14) Installing gmp (6.2.1-r0)
(8/14) Installing isl22 (0.22-r0)
(9/14) Installing mpfr4 (4.1.0-r0)
(10/14) Installing mpc1 (1.2.1-r0)
(11/14) Installing gcc (10.2.1_pre1-r3)
(12/14) Installing musl-dev (1.2.2-r1)
(13/14) Installing libc-dev (0.7.2-r3)
(14/14) Installing g++ (10.2.1_pre1-r3)
Executing busybox-1.33.0-r1.trigger
OK: 188 MiB in 28 packages
alpinetree:~# apk add zlib-dev zlib-static
(1/3) Installing pkgconf (1.7.3-r0)
(2/3) Installing zlib-dev (1.2.11-r3)
(3/3) Installing zlib-static (1.2.11-r3)
Executing busybox-1.33.0-r1.trigger
OK: 189 MiB in 31 packages

In this case I installed g++ and libc-dev in order to get a static copy of libstdc++, a static libc.a (Musl) and their respective headers. I also installed zlib-dev and zlib-static to install zlib’s headers and libz.a, respectively. As a general rule, Alpine usually ships static versions available inside -static packages, and headers as somepackage-dev. 8

Also, remember every once in a while to run apk upgrade inside the sysroot in order to keep the local Alpine install up to date.

Compiling static C++ programs

With everything now set, it’s only a matter of running clang++ with the right --target and --sysroot:

$ clang++ -B$HOME/.llvm/bin --gcc-toolchain=$HOME/alpine_tree/usr --target=x86_64-alpine-linux-musl --sysroot=$HOME/alpine_tree -L$HOME/alpine_tree/lib -std=c++17 -o zpipe zpipe.cc -lz -static
$ file zpipe
zpipe: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, with debug_info, not stripped

The extra --gcc-toolchain is optional, but may help solving issues where compilation fails due to Clang not detecting where GCC and the various crt*.o files reside in the sysroot. The extra -L for /lib is required because Alpine splits its libraries between /usr/lib and /lib, and the latter is not automatically picked up by clang, which both usually expect libraries to be located in $SYSROOT/usr/bin.

Writing a wrapper for static linking with Musl and Clang

Musl packages usually come with the upstream-provided shims musl-gcc and musl-clang, which wrap the system compilers in order to build and link with the alternative libc. In order to provide a similar level of convenience, I quickly whipped up the following Perl script:

#!/usr/bin/env perl

use strict;
use utf8;
use warnings;
use v5.30;

use List::Util 'any';

my $ALPINE_DIR = $ENV{ALPINE_DIR} // "$ENV{HOME}/alpine_tree";
my $TOOLS_DIR = $ENV{TOOLS_DIR} // "$ENV{HOME}/.llvm/bin";

my $CMD_NAME = $0 =~ /\+\+/ ? 'clang++' : 'clang';
my $STATIC = $0 =~ /static/;

sub clang {
	exec $CMD_NAME, @_ or return 0;
}

sub main {
	my $compile = any { /^\s*-c|-S\s*$/ } @ARGV;

	my @args = (
		 qq{-B$TOOLS_DIR},
		 qq{--gcc-toolchain=$ALPINE_DIR/usr},
		 '--target=x86_64-alpine-linux-musl',
		 qq{--sysroot=$ALPINE_DIR},
		 qq{-L$ALPINE_DIR/lib},
		 @ARGV,
	);

	unshift @args, '-static' if $STATIC and not $compile;

	exit 1 unless clang @args;
}

main;

This wrapper is more refined than the FreeBSD AArch64 wrapper above. For instance, it can infer C++ if invoked as clang++, or always force -static if called from a symlink containing static in its name:

$ ls -la $(which musl-clang++)
lrwxrwxrwx    10 marco marco 26 Jan 21:49  /home/marco/.local/bin/musl-clang++ -> musl-clang
$ ls -la $(which musl-clang++-static)
lrwxrwxrwx    10 marco marco 26 Jan 22:03  /home/marco/.local/bin/musl-clang++-static -> musl-clang
$ musl-clang++-static -std=c++17 -o zpipe zpipe.cc -lz # automatically infers C++ and -static
$ file zpipe
zpipe: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, with debug_info, not stripped

It is thus possible to force Clang to only ever link -static by setting $CC to musl-clang-static, which can be useful with build systems that don’t play nicely with statically linking. From my experience, the worst offenders in this regard are Autotools (sometimes) and poorly written Makefiles.

Conclusions

Cross-compiling C and C++ is and will probably always be an annoying task, but it has got much better since LLVM became production-ready and widely available. Clang’s -target option has saved me countless man-hours that I would have instead wasted building and re-building GCC and Binutils over and over again.

Alas, all that glitters is not gold, as is often the case. There is still code around that only builds with GCC due to nasty GNUisms (I’m looking at you, Glibc). Cross compiling for Windows/MSVC is also bordeline unfeasible due to how messy the whole Visual Studio toolchain is.

Furthermore, while targeting arbitrary triples with Clang is now definitely simpler that it was, it still pales in comparison to how trivial cross compiling with Rust or Go is.

One special mention among these new languages should go to Zig, and its goal to also make C and C++ easy to build for other platforms.

The zig cc and zig c++ commands have the potential to become an amazing swiss-army knife tool for cross compiling, thanks to Zig shipping a copy of clang and large chunks of projects such as Glibc, Musl, libc++ and MinGW. Any required library is then built on-the-fly when required:

$ zig c++ --target=x86_64-windows-gnu -o str.exe str.cc
$ file str.exe
str.exe: PE32+ executable (console) x86-64, for MS Windows

While I think this is not yet perfect, it already feels almost like magic. I dare to say, this might really become a killer selling point for Zig, making it attractive even for those who are not interested in using the language itself.

  1. If the transfer is happening across a network and not locally, it’s a good idea to compress the output tarball. 

  2. Sadly, macOS is not supported anymore by LLD due to Mach-O support being largely unmaintained and left to rot over the last years. This leaves ld64 (or a cross-build thereof, if you manage to build it) as the only way to link Mach-O executables (unless ld.bfd from binutils still supports it). 

  3. llvm-mc can be used as a (very cumbersome) assembler but it’s poorly documented. Like gcc, the clang frontend can act as an assembler, making as often redundant. 

  4. This is without talking about those criminals who hardcode gcc in their build scripts, but this is a rant better left for another day. 

  5. In the same fashion, it is also possible to tune the native toolchain for the current machine using a native file and the --native-file toggle. 

  6. Glibc’s builtin name resolution system (NSS) is one of the main culprits, which heavily uses dlopen()/dlsym(). This is due to its heavy usage of plugins, which is meant to provide support for extra third-party resolvers such as mDNS. 

  7. systemd-nspawn can also double as a lighter alternative to VMs, using the --boot option to spawn an init process inside the container. See this very helpful gist to learn how to make bootable containers for distributions based on OpenRC, like Alpine. 

  8. Sadly, Alpine for reasons unknown to me, does not ship the static version of certain libraries (like libfmt). Given that embedding a local copy of third party dependencies is common practice nowadays for C++, this is not too problematic. 

NAT66: The good, the bad, the ugly

NAT (and NAPT) is one of those technologies anyone has a strong opinion about. It has been for years the necessary evil and invaluable (yet massive) hack that kept IPv4 from falling apart in the face of its abysmally small 32-bit address space - which was, to be honest, an absolute OK choice for the time the protocol was designed, when computers cost a small fortune, and were as big as lorries.

The Internet Protocol, version 4, has been abused for quite too long now. We made it into the fundamental building block of the modern Internet, a network of a scale it was never designed for. We are well in due time to put it at rest and replace it with its controversial, yet problem-solving 128-bit grandchild, IPv6.

So, what should be the place for NAT in the new Internet, which makes the return to the end-to-end principle one of its main tenets?

NAT66 misses the point

Well, none, according to the IETF, which has for years tried to dissuade everyone with dabbing with NAT66 (the name NAT is known on IPv6); this is not without good reasons, though. For too long, the supposedly stateless, connectionless level 3 IP protocol has been made into an impromptu “stateful”, connection-oriented protocol by NAT gateways, just for the sake to meet the demands of an infinite number of devices trying to connect to the Internet.

This is without considering the false sense of security that address masquerading provides; I cannot recall how many times I’ve heard people say that (gasp!) NAT is a fundamental piece in the security of their internal networks (it’s not).

Given that the immensity of the IPv6 address space allows providers to give out full /64s to customers, I’d always failed to see the point in NAT66: it always felt to me as a feature fundamentally dead in the water, a solution seeking a problem, ready to be misused.

Well, this was before discovering how cheap some hosting services could be.

Being cheap: the root of all evils

I was quite glad to see a while ago that my VPS provider had announced IPv6 support; thanks to this, I would have been finally able to provide IPv6 access to the guests of the VPNs I host on that VPS, without having to incur into the delay penalties caused by tunneling the traffic on good old services such as Hurrican Electric and SixXS 1. Hooray!

My excitement was unfortunately not going to last for long, and it was indeed barbarically butchered when I discovered that despite having been granted a full /32 (296 IPs), my provider decided to give its VPS customers just a single /128 address.

JUST. A. SINGLE. ONE.

Oh. God. Why.

Given that IPv6 connectivity was something I really wished for my OpenVPN setup, this was quite a setback. I was left with fundamentally only two reasonable choices:

  1. Get a free /64 from a Hurricane Electric tunnel, and allocate IPv6s for VPN guests from there;
  2. Be a very bad person, set up NAT66, and feel ashamed.

Hurricane Electric is, without doubt, the most orthodox option between the two; it’s free of charge, it gives out /64s, and it’s quite easy to set up.

The main showstopper here is definitely the increased network latency added by two layers of tunneling (VPN -> 6to4 -> IPv6 internet), and, given that by default native IPv6 source IPs are preferred to IPv4, it would have been bad if having a v6 public address incurred in a slow down of connections with usually tolerable latencies. Especially if there was a way to get decent RTTs for both IPv6 and IPv4…

And so, with a pang of guilt, I shamefully committed the worst crime.

How to get away with NAT66

The process of setting up NAT usually relies on picking a specially reserved privately-routable IP range, to avoid our internal network structure to get in conflict with the outer networking routing rules (it still may happen, though, if under multiple misconfigured levels of masquerading).

The IPv6 equivalent to 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 has been defined in 2005 by the IETF, not without a whole deal of confusion first, with the Unique Local Addresses (ULA) specification. This RFC defines the unique, not publicly routable fc00::/7 that is supposed to be used to define local subnets, without the unicity guarantees of 2000::/3 (the range from which Global Unicast Addresses (GUA) - i.e. the Internet - are allocated from for the time being). From it, fd00::/8 is the only block really defined so far, and it’s meant to define all of the /48s your private network may ever need.

The next step was to configure my OpenVPN instances to give out ULAs from subnets of my choice to clients, by adding at the end of to my config the following lines:

server-ipv6 fd00::1:8:0/112
push "route-ipv6 2000::/3"

I resorted to picking fd00::1:8:0/112 for the UDP server and fd00::1:9:0/112 for the TCP one, due to a limitation in OpenVPN only accepting masks from /64 to /112.

Given that I also want traffic towards the Internet to be forwarded via my NAT, it is also necessary to instruct the server to push a default route to its clients at connection time.

$ ping fd00::1:8:1
PING fd00::1:8:1(fd00::1:8:1) 56 data bytes
64 bytes from fd00::1:8:1: icmp_seq=1 ttl=64 time=40.7 ms

The clients and servers were now able to ping each other through their local addresses without any issue, but the outer network was still unreachable.

I continued the creation of this abomination by configuring the kernel to forward IPv6 packets; this is achieved by setting the net.ipv6.conf.all.forwarding = 1 with sysctl or in sysctl.conf (from now on, the rest of this article assumes that you are under Linux).

# cat /etc/sysctl.d/30-ipforward.conf 
net.ipv4.ip_forward=1
net.ipv6.conf.default.forwarding=1
net.ipv6.conf.all.forwarding=1
# sysctl -p /etc/sysctl.d/30-ipforward.conf

Afterwards, the only step left was to set up NAT66, which can be easily done by configuring the stateful firewall provided by Linux’ packet filter.
I personally prefer (and use) the newer nftables to the {ip,ip6,arp,eth}tables mess it is supposed to supersede, because I find it tends to be quite less moronic and clearer to understand (despite the relatively scarce documentation available online, which is sometimes a pain. I wish Linux had the excellent OpenBSD’s pf…).
Feel free to use ip6tables, if that’s what you are already using, and you don’t really feel the need to migrate your ruleset to nft.

This is a shortened, summarised snippet of the rules that I’ve had to put into my nftables.conf to make NAT66 work; I’ve also left the IPv4 rules in for the sake of completeness.

PS: Remember to change MY_EXTERNAL_IPVx with your IPv4/6!

table inet filter {
  [...]
  chain forward {
    type filter hook forward priority 0;

    # allow established/related connections                                                                                                                                                                                                 
    ct state {established, related} accept
    
    # early drop of invalid connections                                                                                                                                                                                                     
    ct state invalid drop

    # Allow packets to be forwarded from the VPNs to the outer world
    ip saddr 10.0.0.0/8 iifname "tun*" oifname eth0 accept
    
    # Using fd00::1:0:0/96 allows to match for
    # every fd00::1:xxxx:0/112 I set up
    ip6 saddr fd00::1:0:0/96 iifname "tun*" oifname eth0 accept
  }
  [...]
}
# IPv4 NAT table
table ip nat {
  chain prerouting {
    type nat hook prerouting priority 0; policy accept;
  }
  chain postrouting {
    type nat hook postrouting priority 100; policy accept;
    ip saddr 10.0.0.0/8 oif "eth0" snat to MY_EXTERNAL_IPV4
  }
} 

# IPv6 NAT table
table ip6 nat {
  chain prerouting {
    type nat hook prerouting priority 0; policy accept;
  }
  chain postrouting {
    type nat hook postrouting priority 100; policy accept;

    # Creates a SNAT (source NAT) rule that changes the source 
    # address of the outbound IPs with the external IP of eth0
    ip6 saddr fd00::1:0:0/96 oif "eth0" snat to MY_EXTERNAL_IPV6
  }
}

table ip6 nat table and chain forward in table inet filter are the most important things to notice here, given that they respectively configure the packet filter to perform NAT66 and to forward packets from the tun* interfaces to the outer world.

After applying the new ruleset with nft -f <path/to/ruleset> command, I was ready to witness the birth of our my little sinful setup. The only thing left was to ping a known IPv6 from one of the clients, to ensure that forwarding and NAT are working fine. One of the Google DNS servers would suffice:

$ ping 2001:4860:4860::8888
PING 2001:4860:4860::8888(2001:4860:4860::8888) 56 data bytes
64 bytes from 2001:4860:4860::8888: icmp_seq=1 ttl=54 time=48.7 ms
64 bytes from 2001:4860:4860::8888: icmp_seq=2 ttl=54 time=47.5 ms
$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=55 time=49.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=55 time=50.8 ms

Perfect! NAT66 was working, in its full evil glory, and the client was able to reach the outer IPv6 Internet with round-trip times as fast as IPv4. What was left now was to check if the clients were able to resolve AAAA records; given that I was already using Google’s DNS in /etc/resolv.conf, it should have worked straight away:

$ ping facebook.com
PING facebook.com (157.240.1.35) 56(84) bytes of data.
^C
$ ping -6 facebook.com
PING facebook.com(edge-star-mini6-shv-01-lht6.facebook.com (2a03:2880:f129:83:face:b00c:0:25de)) 56 data bytes
^C

What? Why is ping trying to reach Facebook on its IPv4 address by default instead of trying IPv6 first?

One workaround always leads to another

Well, it turned out that Glibc’s getaddrinfo() function, which is generally used to perform DNS resolution, uses a precedence system to correctly prioritise source-destination address pairs.

I started to suspect that the default behaviour of getaddrinfo() could be to consider local addresses (including ULA) as a separate case than global IPv6 ones; so, I tried to check gai.conf, the configuration file for the IPv6 DNS resolver.

label ::1/128       0  # Local IPv6 address
label ::/0          1  # Every IPv6
label 2002::/16     2 # 6to4 IPv6
label ::/96         3 # Deprecated IPv4-compatible IPv6 address prefix
label ::ffff:0:0/96 4  # Every IPv4 address
label fec0::/10     5 # Deprecated 
label fc00::/7      6 # ULA
label 2001:0::/32   7 # Teredo addresses

What is shown in the snippet above is the default label table used by getaddrinfo().
As I suspected, a ULA address is labeled differently (6) than a global Unicast one (1), and, because the default behaviour specified by RFC 3484 is to prefer pairs of source-destination addresses with the same label, the IPv4 is picked over the IPv6 ULA every time.
Damn, I was so close to committing the perfect crime.

To make this mess finally functional, I had to make yet another ugly hack (as if NAT66 using ULAs wasn’t enough), by setting a new label table in gai.conf that didn’t make distinctions between addresses.

label ::1/128       0  # Local IPv6 address
label ::/0          1  # Every IPv6
label 2002::/16     2 # 6to4 IPv6
label ::/96         3 # Deprecated IPv4-compatible IPv6 address
label ::ffff:0:0/96 4  # Every IPv4 address
label fec0::/10     5 # Deprecated 
label 2001:0::/32   7 # Teredo addresses

By omitting the label for fc00::/7, ULAs are now grouped together with GUAs, and natted IPv6 connectivity is used by default.

$ ping google.com
PING google.com(par10s29-in-x0e.1e100.net (2a00:1450:4007:80f::200e)) 56 data bytes

In conclusion

So, yes, NAT66 can be done and it works, but that doesn’t make it any less than the messy, dirty hack it is. For the sake of getting IPv6 connectivity behind a provider too cheap to give its customers a /64, I had to forgo end-to-end connectivity, hacking Unique Local Addresses to achieve something they weren’t really devised for.

Was it worthy? Perhaps. My ping under the VPN is now as good on IPv6 as it is on IPv4, and everything works fine, but this came at the cost of an overcomplicated network configuration. This could have been much simpler if everybody had simply understood how IPv6 differs from IPv4, and that giving out a single address is simply not the right way to allocate addresses to your subscribers anymore.

The NATs we use today are relics of a past where the address space was so small that we had to break the Internet in order to save it. They were a mistake made to fix an even bigger one, a blunder whose effects we have now the chance to undo. We should just start to take the ongoing transition period as seriously as it deserves, to avoid falling into the same wrong assumptions yet again.

  1. Ironically, SixXS closed last June because “many ISPs offer IPv6 now”. 

First post!

Welcome, internet stranger, into my humble blog!

I hope I’ll be able to find the time to post at least once a month a new story or tutorial about Linux, FreeBSD, system administration or similar CS-related topics, which will, more often than not, involve a full report on something I’ve been tinkering on during my research activity (or, just because I liked it).
Everything I publish is written without any arrogance about it being in any way relevant, correct or even interesting; the only thing I hope for is for this blog to be at least in some way useful to myself, to avoid forgetting what I’ve learned, and which mistakes I have already committed.

Why?

From the very first moment I turned on a PC in the ’90s, I’ve been hooked with computers, and anything revolving around them. Exploring and better understanding how these machines work has been an immense source of entertainment and learning for me, leading to countless hours spent in trying every piece of software, gadget or device I was able to lay my hands onto.
I cannot state for certain how many times I found myself delving heart and soul into some convoluted install of fundamentally every Linux and BSD distribution I could find, sometimes even resorting into compiling some of them by scratch, just for the sake of better understanding how these complex yet fascinating software packages tied together into creating a fully-fledged and functional operating system.

Being passionate as I was (and still am) about software made the choice of enrolling in Computer Engineering extremely simple. During my university years, I had the time and opportunity to further improve my coding skills, especially focusing on striving to master C and C++, Go, and recently, Rust. I have a passion for compiler technology, and I’ve dabbled in programming language design for a while, implementing a functioning self-hosting compiler, which I hope will be the topic of a future, fully dedicated blog post.

What do you do?

After working for two years at the University of Bologna as both a researcher on distributed ledgers and as a system administrator, I decided to change my professional path and become an embedded developer. I now work as an embedded developer, mostly on the ESP32 platform.

My other hobbies are also languages (the ones spoken by people, at least for now!), cooking, writing, astronomy, biology, and science in general.

You wrote something wrong!

If you notice something is amiss with either my writing or the contents of the blog, do not esitate to contact me (in any way you prefer). I plan to add Disqus support directly on blog posts, but in the meantime don’t be shy to simply fork and PR me on Github, if you wish so.