<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>mcilloni's blog</title>
 <link href="https://mcilloni.ovh/atom.xml" rel="self"/>
 <link href="https://mcilloni.ovh/"/>
 <updated>2024-08-24T16:25:37+00:00</updated>
 <id>https://mcilloni.ovh</id>
 <author>
   <name>Marco Cilloni</name>
   <email>cillonimarco@gmail.com</email>
 </author>

 
 <entry>
   <title>Bring your Arch Linux install everywhere</title>
   <link href="https://mcilloni.ovh/2023/10/25/install-arch-on-usb/"/>
   <updated>2023-10-25T00:00:00+00:00</updated>
   <id>https://mcilloni.ovh/2023/10/25/install-arch-on-usb</id>
   <content type="html">&lt;p&gt;The first time I installed Arch Linux was in 2007. In that foregone time, the only supported architecture was 32-bit x86, and the ISOs carried dubious release names such as &lt;em&gt;“0.8 Voodoo”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Despite the fact Arch shipped an installer that looked &lt;em&gt;suspiciously alike&lt;/em&gt; FreeBSD’s, you still had to configure a big deal of stuff by hand, using a slew of files with BSD-sounding names (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rc.conf&lt;/code&gt;, anyone?). Xorg was the biggest PITA&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, but tools such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xorgconfigure&lt;/code&gt; and shoddy patched Xorg servers helped users achieve the &lt;em&gt;“Linux dream”&lt;/em&gt;, which at the time mostly consisted of wobbly Beryl windows and spinny desktop cubes. That was the real deal back then, and nothing gave you more street cred than having windows that wobbled like cubes of jelly.&lt;/p&gt;

&lt;p&gt;Those days are (somewhat sadly) long gone. Today’s GNU/Linux distros are rather simple to install and setup, with often little to no configuration required (unless you are unlucky, of course). Distros targeted to advanced users, such as Arch, still require you to configure everything to your liking by yourself, but the overall stack (kernel, udev, Xorg, Wayland, …) is now exceptionally good at automatically configuring itself based on the current hardware. UEFI also smoothens a lot of warts about the booting process.&lt;/p&gt;

&lt;p&gt;This, alongside ultra-fast USB drive bays, makes self-configuring portable installs a concrete reality. I have now been using Arch Linux installs from SSDs in USB caddies for years now, for both work, system recovery and easy access to a ready-to-use environment from any computer. Despite the tradeoffs, it’s remarkably solid and convenient.&lt;/p&gt;

&lt;p&gt;In this post, I’ll show step-by-step (with examples) how to install Arch Linux on a USB drive, and how to make it bootable everywhere &lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, including virtual machines. I will try to cover as much corner cases as possible, but as always feel free to comment or contact me if you think something may be missing.&lt;/p&gt;

&lt;p&gt;With a few adaptations, this guide may also be helpful to install Arch Linux on a non-mobile drive, if you so desire.&lt;/p&gt;

&lt;h1 id=&quot;prerequisites&quot;&gt;Prerequisites&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;An SSD drive in a USB enclosure&lt;/strong&gt;. In my experience, &lt;em&gt;“pre-made”&lt;/em&gt; external USB disks are designed for storage and tend to perform way worse than internal drives in an external caddy. Enclosures also have the extra advantage improves durability &lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, and allows you to pop the SSD in a computer if you &lt;em&gt;really&lt;/em&gt; need to.&lt;/p&gt;

    &lt;p&gt;The SSD can be either SATA or NVMe, with the latter being the &lt;em&gt;obviously&lt;/em&gt; better choice. In general I tend to often run out of space on my machines, so I tend to use whatever SSD I have lying around. I honestly wouldn’t bother with anything smaller than 128GB, unless you really plan to only use it for system recovery or light browsing.&lt;/p&gt;

    &lt;p&gt;NEVER use mechanical drives anymore or, at least, avoid them for anything that isn’t cold storage or NAS. This is even more true for this project: not only spinning rust is &lt;em&gt;atrociously slow&lt;/em&gt;, because most if not all 2.5” drives only spin at 5400 RPM, but they are also outrageously fragile and they almost always take more power that it’s supplied by the average USB port.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;An x86-64 computer&lt;/strong&gt;. It doesn’t have to run Arch Linux, but it makes the whole process easier (especially if you wish to clone your current system to the USB drive - see below). A viable option is also to boot from the Arch Linux ISO and install the system from there, as long as you have another machine with a working internet connection to Google any issues you might encounter.&lt;/p&gt;

    &lt;p&gt;Note: I will only cover x86-64 with UEFI because that’s by far the easiest and more reliable setup. BIOS requires tinkering with more complex bootloaders (such as SYSLINUX), while 32-bit x86 is not supported by Arch Linux anymore. &lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;A working internet connection&lt;/strong&gt; - for obvious reasons. If you don’t have an internet connection, then you probably have bigger problems to worry about than installing Arch Linux on a USB drive.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;setting-up-the-drive&quot;&gt;Setting up the drive&lt;/h1&gt;

&lt;p&gt;Given that we are talking about a portable install, disk encryption is nothing short of mandatory. In general, I think that encrypting your system is ALWAYS a good idea, even if you don’t plan to carry it around much &lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;The choices of filesystem and encryption scheme are up to you, but there are basically three options I’ve used and I can recommend:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LUKS&lt;/code&gt; with a classic filesystem&lt;/strong&gt;, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ext4&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;F2FS&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XFS&lt;/code&gt;. This is the simplest option, and it is probably more than enough for most people.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ZFS&lt;/code&gt; with native encryption&lt;/strong&gt;. I must admit, this &lt;em&gt;may&lt;/em&gt; be somewhat overkill, but it’s also my favourite because due to it being such a great experience overall. While ZFS isn’t probably the best choice for a removable hard drive, it’s outstandingly solid, supports compression, snapshots and checksumming - all things I do want from a system that runs from what’s potentially a flimsy USB cable. &lt;sup id=&quot;fnref:6&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt; I am yet to lose any data due to ZFS itself, and I have been using it for the best part of a decade now.&lt;/p&gt;

    &lt;p&gt;ZFS is also what I use on all my installs, so I can easily migrate datasets from/to a USB-installed system using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs send&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs receive&lt;/code&gt; commands if I need to, or quickly back up the whole system.&lt;/p&gt;

    &lt;p&gt;Native ZFS encryption, while not as thoroughly tested and secure as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LUKS&lt;/code&gt;, is still probably fine for most people, while also ridiculously convenient to set up. If that’s not enough for you, using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ZFS&lt;/code&gt; on top of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LUKS&lt;/code&gt; is still an acceptable choice (albeit more complicated to pull off).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LUKS&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BTRFS&lt;/code&gt;&lt;/strong&gt;. I have also used this setup in the past, and there’s a lot to like about it, such as the fact that BTRFS supports lots of ZFS’s best features without requiring to install any out-of-tree kernel modules - a very nice plus indeed.&lt;/p&gt;

    &lt;p&gt;Sadly, I have been burnt by BTRFS so many times in the past 12 years that I can’t honestly say I would want to entrust it with my data any time soon. YMMV, so maybe give it a try if you’re curious.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Regardless of that, I will now cover all three options in the next sections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One important note&lt;/strong&gt;: I &lt;em&gt;deliberately&lt;/em&gt; decided to leave kernel images unencrypted (in UKI form) in the ESP, sticking with full encryption for just the root filesystem. My main concern is about protecting the data stored on the drive in case it’s lost or broken, and I assume nobody will attempt evil maid attacks. &lt;sup id=&quot;fnref:7&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;7&lt;/a&gt;&lt;/sup&gt; Encrypting the kernel is also probably rather pointless without a signed bootloader and kernel - something that’s very hard to setup for a portable USB setup.&lt;/p&gt;

&lt;p&gt;I also will not show how to set-up UEFI Secure Boot. While having Secure Boot enabled is a good thing in general, it makes setting the system up vastly more complex, for debatable benefits. This setup is in general not meant to be used for security critical systems, but to provide a convenient way to carry a working environment around between machines you have complete control of.&lt;/p&gt;

&lt;h1 id=&quot;0-optional-obtaining-a-viable-zfs-setup-environment&quot;&gt;0. (Optional) Obtaining a viable ZFS setup environment&lt;/h1&gt;

&lt;p&gt;Unfortunately, ZFS on Linux is an out-of-tree filesystem. This basically means that it’s not bundled with the kernel as with all other filesystem, but instead it’s distributed by an independent project and has to be compiled and installed separately. This is due to a complex licensing incompatibility between the &lt;strong&gt;CDDL&lt;/strong&gt; license used by OpenZFS, and the &lt;strong&gt;GPLv2&lt;/strong&gt; license used by the Linux kernel, which makes it impossible to ever bundle ZFS and Linux together.&lt;/p&gt;

&lt;p&gt;If you intend on using ZFS, you must follow these steps first; if not, just skip to the section 1.&lt;/p&gt;

&lt;p&gt;This procedure varies depending on the distribution you are using:&lt;/p&gt;

&lt;h2 id=&quot;01-arch-linux&quot;&gt;0.1. Arch Linux&lt;/h2&gt;

&lt;p&gt;Arch doesn’t distribute ZFS due to the aforementioned licensing issues, but it’s readily available and readily maintained by the &lt;a href=&quot;https://github.com/archzfs/archzfs&quot;&gt;ArchZFS project&lt;/a&gt;, both in form of AUR &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PKGBUILD&lt;/code&gt;s and in the third party repository &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;archzfs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The packages you are going to need are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs-utils&lt;/code&gt; and a module compatible with your current kernel; the latter can either come from a kernel-specific package (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs-linux-lts&lt;/code&gt;), or a DKMS one (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs-dkms&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;If you opt to install the packages from ArchZFS, add the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[archzfs]&lt;/code&gt; repository to your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman.conf&lt;/code&gt; (look at &lt;a href=&quot;https://wiki.archlinux.org/title/Unofficial_user_repositories#archzfs&quot;&gt;Arch Wiki&lt;/a&gt; for the correct URL), rembering to import the PGP key using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman-key -r KEY&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman-key --lsign-key KEY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you need to boot from an ISO, it’s a bit more complicated, so I won’t specify the details here because it would be quite long. Give a look at &lt;a href=&quot;https://github.com/eoli3n/archiso-zfs&quot;&gt;this repo&lt;/a&gt; for a quick way to generate one. If you feel adventurous, you can also try to use an ISO with ZFS support from another distribution (such as Ubuntu) and follow the instructions below to set up a working environment.&lt;/p&gt;

&lt;h2 id=&quot;02-other-linux-distributions&quot;&gt;0.2. Other Linux distributions&lt;/h2&gt;

&lt;p&gt;If you are starting from another distribution, you will need to visit the &lt;a href=&quot;https://openzfs.github.io/openzfs-docs/Getting%20Started/index.html&quot;&gt;OpenZFS on Linux&lt;/a&gt; website and follow the instructions for your distribution (if included).&lt;/p&gt;

&lt;p&gt;This will generally involve adding a third party repository (except for Ubuntu, which has ZFS in its main repos), and following the instructions.&lt;/p&gt;

&lt;p&gt;For instance, on Debian it’s recommended to enable the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backports&lt;/code&gt; repository, in order to install a more up to date version. This also requires to modify APT’s settings by pinning the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backports&lt;/code&gt; repository to a higher priority for the ZFS packages.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# cat &amp;lt;&amp;lt;&apos;EOF&apos; &amp;gt; /etc/apt/sources.list.d/bookworm-backports.list
deb http://deb.debian.org/debian bookworm-backports main contrib
deb-src http://deb.debian.org/debian bookworm-backports main contrib
EOF
# cat &amp;lt;&amp;lt;&apos;EOF&apos; &amp;gt; /etc/apt/preferences.d/90_zfs
Package: src:zfs-linux
Pin: release n=bookworm-backports
Pin-Priority: 990
# apt update
# apt install dpkg-dev linux-headers-generic linux-image-generic
# apt install zfs-dkms zfsutils-linux
[...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Regardless of what you are using, you should now have a working ZFS setup. You can verify this by running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zpool status&lt;/code&gt;; if it prints &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;no pools available&lt;/code&gt; instead of complaining about missing kernel modules, you are good to go, and you may start setting up the drive.&lt;/p&gt;

&lt;h1 id=&quot;1-partitioning&quot;&gt;1. Partitioning&lt;/h1&gt;

&lt;p&gt;From either the Arch Linux ISO or your existing system, run a disk partitioning tool. I’m personally partial to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gdisk&lt;/code&gt;, but &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parted&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fdisk&lt;/code&gt; are also fine &lt;sup id=&quot;fnref:8&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parted&lt;/code&gt; also has a graphical frontend, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gparted&lt;/code&gt;, which is very easy to use, in case you are afraid to mess up the partitioning and prefer having clear feedback on what you’re doing &lt;sup id=&quot;fnref:9&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;The partitioning scheme is generally up to you, with the bare minimum being:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;A FAT32 &lt;strong&gt;EFI System Partition&lt;/strong&gt; (ESP), possibly with a comfortable size of at least 300 MiB. This is where the UKI (and optionally, a bootloader) will be stored. I do not recommend going for BIOS/MBR, given that x64 computers have supported UEFI for more than a decade now.&lt;/p&gt;

    &lt;p&gt;The ESP will be mounted at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/boot/efi&lt;/code&gt; in the final system.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A &lt;strong&gt;root partition&lt;/strong&gt;. This is where the system will be installed and all files will be stored. The size is up to you, but I would recommend at least 20 GiB for a &lt;em&gt;very&lt;/em&gt; minimal system. While the system itself doesn’t necessarily need a lot of space, with limited storage space you will find yourself often cleaning up the package caches, logs and temporary files that litter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/var&lt;/code&gt;.&lt;/p&gt;

    &lt;p&gt;The root partition will also be our encryption root, and it will be formatted with either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LUKS&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ZFS&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While some guides may suggest also creating a swap partition, I generally don’t recommend using one when booting from USB,. Swapping to storage will quickly turn into a massive bottleneck and slow down the whole system to a crawl. If you really need swap, I would recommend looking into alternatives such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zram&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zswap&lt;/code&gt;, which are probably a wiser choice.&lt;/p&gt;

&lt;p&gt;Also, it goes without saying, do not hibernate a system that runs from a USB drive, unless you plan on resuming it on the same machine.&lt;/p&gt;

&lt;h2 id=&quot;11-creating-the-partition-label&quot;&gt;1.1 Creating the partition label&lt;/h2&gt;

&lt;p&gt;Feel free to skip to the next step if you already have a partition label on your drive, with two sufficiently sized partitions for the ESP and the root filesystem, and you don’t want to use the whole drive.&lt;/p&gt;

&lt;p&gt;First, identify the device name of your USB drive. In my case, it’s a Samsung 960 EVO 250 GB NVMe drive inside a USB 3.2 enclosure:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-l1&lt;/span&gt; /dev/disk/by-id/usb&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;
lrwxrwxrwx 1 root root  9 Aug 10 18:42 /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0 -&amp;gt; ../../sdb
lrwxrwxrwx 1 root root 10 Aug 10 18:42 /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part1 -&amp;gt; ../../sdb1
lrwxrwxrwx 1 root root 10 Aug 10 18:42 /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part2 -&amp;gt; ../../sdb2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I can’t stress this enough: &lt;strong&gt;&lt;em&gt;DO NOT USE RAW DEVICE NAMES WHEN PARTITIONING!&lt;/em&gt;&lt;/strong&gt;. Always use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/disk&lt;/code&gt; when performing destructive operations of block devices - it’s not a matter of &lt;em&gt;if&lt;/em&gt; you will lose data, but &lt;em&gt;when&lt;/em&gt;. &lt;sup id=&quot;fnref:10&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:10&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;10&lt;/a&gt;&lt;/sup&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/disk/by-id&lt;/code&gt; is by far the best good choice due to how it clearly names devices by bus type, which makes very hard to mix up devices by mistake.&lt;/p&gt;

&lt;p&gt;Once you have identified the device name, run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gdisk&lt;/code&gt; (or whatever you prefer) and create a new GPT label in order to wipe the existing partition table.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-gdisk&quot;&gt;# gdisk /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0
GPT fdisk (gdisk) version 1.0.9.1

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: not present

Creating new GPT entries in memory.

Command (? for help): o
This option deletes all partitions and creates a new protective MBR.
Proceed? (Y/N): Y

Command (? for help): p
Disk /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0: 488397168 sectors, 232.9 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 55F7C0C7-35B3-44C5-A2C4-790FE33014FD
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 488397134
Partitions will be aligned on 2048-sector boundaries
Total free space is 488397101 sectors (232.9 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name

Command (? for help):
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&quot;12-creating-the-efi-system-partition-esp&quot;&gt;1.2. Creating the EFI System Partition (ESP)&lt;/h2&gt;

&lt;p&gt;With now a clear slate, we can create an EFI partition (GUID type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EF00&lt;/code&gt;). The ESP not be encrypted and will contain the Unified Kernel Image the system will boot from; for this reason, I recommend giving it at least 300 MiB of space in order to avoid unpleasant surprises when updating the kernel.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-gdisk&quot;&gt;Command (? for help): n               
Partition number (1-128, default 1): 1
First sector (34-488397134, default = 2048) or {+-}size{KMGTP}: 
Last sector (2048-488397134, default = 488396799) or {+-}size{KMGTP}: +300M
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): ef00
Changed type of partition to &apos;EFI system partition&apos;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notice how I left the first sector blank, and I specified &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;+300M&lt;/code&gt; as the last sector. This is because I want &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gdisk&lt;/code&gt; to automatically align the partition to the nearest sector boundary (2048 in this case). &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gdisk&lt;/code&gt; tends to be quite good at automatically deducing the correct alignment, a process that tends to be finicky with USB enclosures.&lt;/p&gt;

&lt;p&gt;I also highly recommend giving the partition a GPT name (which will be visible under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/disk/by-partlabel&lt;/code&gt;):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-gdisk&quot;&gt;Command (? for help): c
Using 1
Enter name: ExtESP

Command (? for help):
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&quot;13-creating-the-root-partition&quot;&gt;1.3. Creating the root partition&lt;/h2&gt;

&lt;p&gt;Finally, we can create the root partition. This partition will be encrypted, and will contain the system and all of user data. I recommend giving it at least 20 GiB of space, but feel free to use more if you have some spare room.&lt;/p&gt;

&lt;p&gt;For instance, the following command will create a partition using all of the remaining space on the drive:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-gdisk&quot;&gt;Command (? for help): n
Partition number (2-128, default 2): 2
First sector (34-488397134, default = 616448) or {+-}size{KMGTP}: 
Last sector (616448-488397134, default = 488396799) or {+-}size{KMGTP}: 
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): 
Changed type of partition to &apos;Linux filesystem&apos;

Command (? for help): c
Partition number (1-2): 2
Enter name: ExtRoot
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Feel free to leave use the default GUID type (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8300&lt;/code&gt;) for the root partition, as it will be changed when formatting the partition later.&lt;/p&gt;

&lt;h2 id=&quot;14-writing-the-partition-table&quot;&gt;1.4. Writing the partition table&lt;/h2&gt;

&lt;p&gt;Once you are done, you should have a partition table resembling the following:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-gdisk&quot;&gt;Command (? for help): p
Disk /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0: 488397168 sectors, 232.9 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): AABE7D47-3477-4AB6-A7C1-BC66F87CB1C1
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 488397134
Partitions will be aligned on 2048-sector boundaries
Total free space is 2349 sectors (1.1 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048          616447   300.0 MiB   EF00  ExtESP
   2          616448       488396799   232.6 GiB   8300  ExtRoot
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If everything looks OK, proceed to commit the partition table to disk. Again, ensure that you are writing to the correct device, that it does not contain any important data, and no old partition is mounted:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-gdisk&quot;&gt;Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): Y
OK; writing new GUID partition table (GPT) to /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0.
The operation has completed successfully.
&lt;/code&gt;&lt;/pre&gt;

&lt;h1 id=&quot;2-creating-the-filesystems&quot;&gt;2. Creating the filesystems&lt;/h1&gt;

&lt;p&gt;Now that we created a viable partition layout, we can proceed by creating filesystems on it.&lt;/p&gt;

&lt;p&gt;As I’ve mentioned before, there are several potential choices regarding what filesystems and encryption schemes to use. Regardless of what you’ll end up choosing, the ESP must always be formatted as FAT (either FAT32 or FAT16):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mkfs.fat -F 32 -n EXTESP /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part1
mkfs.fat 4.2 (2021-01-31)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After doing this, proceed depending on what filesystem you want to use.&lt;/p&gt;

&lt;h2 id=&quot;21-straighforward-luks-with-a-native-filesystem&quot;&gt;2.1. Straighforward: LUKS with a native filesystem&lt;/h2&gt;

&lt;p&gt;LUKS with a simple filesystem is by far the simplest solution, and (probably) the “safest” for what regards setup complexity. LUKS can also be used with LVM2 for more “advanced” solutions, but it goes beyond the scope of this post.&lt;/p&gt;

&lt;p&gt;As I’ve mentioned previously, we are going to set up full encryption for system and user data, but not for the kernel, which will reside in UKI form inside the ESP. If you are interested in a more “paranoid” setup, you can find more information in the &lt;a href=&quot;https://wiki.archlinux.org/title/Dm-crypt/Encrypting_an_entire_system#Encrypted_boot_partition_(GRUB)&quot;&gt;Arch Wiki&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;211-creating-the-luks-container&quot;&gt;2.1.1. Creating the LUKS container&lt;/h3&gt;

&lt;p&gt;First, we need to format the previously created partition as a LUKS container, picking a good passphrase in the process. What makes a good passphrase &lt;a href=&quot;https://www.schneier.com/blog/archives/2014/03/choosing_secure_1.html&quot;&gt;is a whole topic in itself&lt;/a&gt;, and recommendations tend to change frequently following the current trends and cracking techniques. Personally, I recommend using a passphrase that is easy to remember but computationally hard to guess for a computer, such as a (very) long password full of spaces, letters, numbers and special characters.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# cryptsetup luksFormat --label ExtLUKS /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part2
WARNING!
========
This will overwrite data on /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part2 irrevocably.

Are you sure? (Type &apos;yes&apos; in capital letters): YES 
Enter passphrase for /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part2: 
Verify passphrase: 
Ignoring bogus optimal-io size for data device (33553920 bytes).
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that I deliberately stuck with the default settings, which are good enough for most use cases.&lt;/p&gt;

&lt;p&gt;After creating the container, we need to open it in order to format it with a filesystem:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# cryptsetup open /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part2 ExtLUKS
Enter passphrase for /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part2:
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ExtLUKS&lt;/code&gt; is an arbitrary name I chose for the container - feel free to pick whatever name you like. Whatever your choice is, after successfully unlocking the LUKS container it will be available as a block device under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/mapper/&amp;lt;name&amp;gt;&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ ls -l1 /dev/mapper/ExtLUKS
lrwxrwxrwx 1 root root 7 Aug 28 22:45 /dev/mapper/ExtLUKS -&amp;gt; ../dm-0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;212-formatting-the-container&quot;&gt;2.1.2. Formatting the container&lt;/h3&gt;

&lt;p&gt;Now that we have an unlocked LUKS container, we can format it with a “real” filesystem. Note that, if you wish to use LVM2, this would be the right time to create the LVM volumes.&lt;/p&gt;

&lt;p&gt;No matter the filesystem you plan to use over LUKS, &lt;em&gt;ext4&lt;/em&gt;, &lt;em&gt;F2FS&lt;/em&gt;, &lt;em&gt;XFS&lt;/em&gt; and &lt;em&gt;Btrfs&lt;/em&gt; are all created via the respective &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mkfs&lt;/code&gt; tool:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mke2fs -t ext4 -L ExtRoot /dev/mapper/ExtLUKS # for Ext4
# mkfs.f2fs -l ExtRoot /dev/mapper/ExtLUKS # for F2FS
# mkfs.xfs -L ExtRoot /dev/mapper/ExtLUKS # for XFS
# mkfs.btrfs -L ExtRoot /dev/mapper/ExtLUKS # for Btrfs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;22-advanced-btrfs-subvolumes&quot;&gt;2.2. Advanced: Btrfs subvolumes&lt;/h2&gt;

&lt;p&gt;If you picked a “plain” filesystem such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ext4&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;F2FS&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XFS&lt;/code&gt;, you can skip this section.&lt;/p&gt;

&lt;p&gt;In case you picked &lt;em&gt;Btrfs&lt;/em&gt;, it’s a good idea to create subvolumes for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/home&lt;/code&gt; in order to take advantage of Btrfs’s snapshotting capabilities.&lt;/p&gt;

&lt;p&gt;Compared to older filesystems, Btrfs and ZFS have the built-in capability to create logical subvolumes (&lt;em&gt;datasets&lt;/em&gt; in ZFS parlance) that can be mounted, snapshotted and managed independently. This is somewhat similar to LVM2, but immensely more powerful and flexible; all subvolumes share the same storage pool and can have different properties enabled (such as compression or CoW), or ad-hoc quotas and mount options.&lt;/p&gt;

&lt;p&gt;Compared to other filesystems, Btrfs (and ZFS) requires filesystems to be online and mounted in order to perform operation on them, such as scrubbing (an operation akin to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fsck&lt;/code&gt;) and subvolume management.&lt;/p&gt;

&lt;h3 id=&quot;221-mounting-the-root-subvolume&quot;&gt;2.2.1. Mounting the root subvolume&lt;/h3&gt;

&lt;p&gt;Mount the filesystem on a temporary mountpoint:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mount /dev/mapper/ExtLUKS /path/to/temp/mount
# mount | grep ExtLUKS
/dev/mapper/ExtLUKS on /path/to/temp/mount type btrfs (rw,relatime,ssd,space_cache=v2,subvolid=5,subvol=/)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mtab&lt;/code&gt; includes the options &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;subvolid=5,subvol=/&lt;/code&gt;. This means that the &lt;em&gt;default subvolume&lt;/em&gt; has been mounted, identified with the ID &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;5&lt;/code&gt; and named &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/&lt;/code&gt;. This is the subvolume that will be mounted by default, acting as the root parent of all other subvolumes.&lt;/p&gt;

&lt;h3 id=&quot;222-creating-the-subvolumes&quot;&gt;2.2.2. Creating the subvolumes&lt;/h3&gt;

&lt;p&gt;Now we can create the subvolumes for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/home&lt;/code&gt;, called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@home&lt;/code&gt; respectively:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# btrfs subvolume create /path/to/temp/mount/@     # for /
Created subvolume &apos;/path/to/temp/mount/@&apos;
# btrfs subvolume create /path/to/temp/mount/@home # for /home
Created subvolume &apos;/path/to/temp/mount/@home&apos;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Using a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&lt;/code&gt; prefix with Btrfs subvolumes is long established convention. The situation should now look like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# btrfs subvolume list -p /path/to/temp/mount
ID 256 gen 8 parent 5 top level 5 path @
ID 257 gen 9 parent 5 top level 5 path @home
# ls -l1 /path/to/temp/mount
@
@home
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice how in &lt;em&gt;Btrfs&lt;/em&gt; subvolumes &lt;em&gt;are also subdirectories&lt;/em&gt; of their parent subvolume. This is very useful when mounting the disk as an external drive. Subvolumes can also be mounted directly by passing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;subvol&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;subvolid&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mount&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Before moving to the next step, remember to unmount the root subvolume.&lt;/p&gt;

&lt;h2 id=&quot;23-advanced-zfs-with-native-encryption&quot;&gt;2.3. Advanced: ZFS with native encryption&lt;/h2&gt;

&lt;p&gt;My personal favourite, ZFS is a rocksolid system that’s ubiquitous in data storage, thanks to its impressive stability record and advanced features such as deduplication, built-in RAID, …&lt;/p&gt;

&lt;p&gt;Albeit arguably less flexible than Btrfs, which was originally designed as a Linux-oriented replacement for the CDDL-encumbered ZFS, in my experience ZFS tends to be vastly more stable and reliable in day to day use. In the last 6 years, I have almost exclusively used ZFS on all my computers, and I have yet to lose any data due to ZFS itself. &lt;sup id=&quot;fnref:11&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:11&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;11&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;ZFS is quite different compared to other filesystems. Instead of filesystems, ZFS works on &lt;em&gt;pools&lt;/em&gt;, which consists in collections of one or more block devices (potentially in &lt;em&gt;RAID&lt;/em&gt; configurations). Every pool can be divided into a hierarchy of &lt;em&gt;datasets&lt;/em&gt;, which are roughly equivalent to subvolumes in Btrfs.&lt;/p&gt;

&lt;p&gt;Datasets can be mounted independently, and can each have their own properties, such as compression, quotas, and so on, which may either be set per-dataset or inherited from the parent dataset.&lt;/p&gt;

&lt;p&gt;Compared to Btrfs, ZFS manages its own mountpoints as inherent properties of the dataset. This is both incredibly useful and bothersome; on one hand, having mountpoints intrinsicly related to datasets allows for easier management and more clarity than legacy mounting, but on the other hand it may turn confusing and inflexible when managing complex setups. In any case, you can opt-out from letting ZFS managing mountpoints for a given dataset by setting the mountpoint to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;legacy&lt;/code&gt;, and mounting it manually as you would with any other filesystem.&lt;/p&gt;

&lt;h3 id=&quot;231-creating-the-zfs-pool&quot;&gt;2.3.1. Creating the ZFS pool&lt;/h3&gt;

&lt;p&gt;Our case is quite simple, given that we only have a single drive.&lt;/p&gt;

&lt;p&gt;Create a new dataset called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;extzfs&lt;/code&gt; (or whatever you prefer), being careful to specify an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;altroot&lt;/code&gt; via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-R&lt;/code&gt;. Otherwise, the new mountpoints will override your system ones as soon as you set up the pool:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# zpool create -m none -R /tmp/mnt extzfs /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You may have to specify &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-f&lt;/code&gt; if the partition wasn’t empty before. Note the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-m none&lt;/code&gt; option, which will set no mountpoint for the root dataset of the pool itself. Compared to Btrfs, ZFS doesn’t expose datasets as subdirectories of their parent pool, so it makes little sense to allow mounting the root dataset.&lt;/p&gt;

&lt;h3 id=&quot;232-creating-an-encrypted-dataset-root&quot;&gt;2.3.2. Creating an encrypted dataset root&lt;/h3&gt;

&lt;p&gt;As mentioned before, we are going to use native ZFS encryption, which is generally considered safe, but it &lt;em&gt;may&lt;/em&gt; not be as water-tight and battle-tested as LUKS; this is generally not a problem for most people except the most paranoid. If you count yourself among their ranks, remember that you can always use LUKS on top of ZFS. It may end up being more complex, but it’s a viable option.&lt;/p&gt;

&lt;p&gt;First, we need to create an encrypted dataset; this will act as the encryption root for all the other datasets. We will (arbitrarily) call it &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;extzfs/encr&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# zfs create -o encryption=on -o keyformat=passphrase -o keylocation=prompt -o mountpoint=none -o compression=lz4 extzfs/encr
Enter new passphrase:
Re-enter new passphrase:
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice that we are using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;passphrase&lt;/code&gt; key format alongsize the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prompt&lt;/code&gt; key location. This means that ZFS will expect the encryption key in the form of a password entered by the user. Another option would be to use a key file, which is arguably more secure but also incredibly more cumbersome to use for the root device, so I’ll leave how to use one as an exercise to the reader.&lt;/p&gt;

&lt;p&gt;Like with LUKS, I recommend picking a safe password that’s easy to remember but hard to guess. See paragraph 2.2.1. for more details.&lt;/p&gt;

&lt;p&gt;Also, like in Btrfs’s case I will enable compression in order to spare some space on my small SSD. This &lt;em&gt;may&lt;/em&gt; potentially leak a bit of information about the data contained inside the encrypted container, but it’s generally not a problem for most people.&lt;/p&gt;

&lt;h3 id=&quot;233-creating-the-system-dataset&quot;&gt;2.3.3. Creating the system dataset&lt;/h3&gt;

&lt;p&gt;Now that we have an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;encryptionroot&lt;/code&gt;, we can create all the datasets we need under it, and they will be encrypted and unlocked automatically along with it.&lt;/p&gt;

&lt;p&gt;Keeping in mind that’s good practice to create a hierchy that allows for the quick and easy creation of new boot environments, under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;encr&lt;/code&gt; we are going to create:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;root&lt;/code&gt; dataset, which will not be mounted, and under which we will place datasets contain system images&lt;/li&gt;
  &lt;li&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;home&lt;/code&gt; dataset, which will act as the root for all user-data datasets&lt;/li&gt;
  &lt;li&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;default&lt;/code&gt; dataset under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;root&lt;/code&gt;, which will be mounted as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/&lt;/code&gt; and contain the system we’re going to install&lt;/li&gt;
  &lt;li&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;logs&lt;/code&gt; dataset for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/var/log&lt;/code&gt; under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;default&lt;/code&gt;, which is required to be a separate dataset in order to enable the ACLs required by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemd-journald&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;users&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;root&lt;/code&gt; datasets under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;home&lt;/code&gt;, which will respectively be mounted as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/home&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/root&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# zfs create -o mountpoint=none extzfs/encr/root
# zfs create -o mountpoint=none extzfs/encr/home
# zfs create -o mountpoint=/ extzfs/encr/root/default
# zfs create -o mountpoint=/var/log -o acltype=posixacl extzfs/encr/root/logs
# zfs create -o mountpoint=/home extzfs/encr/home/users
# zfs create -o mountpoint=/root extzfs/encr/home/root
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After running the commands above, the situation should look like the following:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# zfs list
NAME                        USED   AVAIL     REFER  MOUNTPOINT
extzfs                     1.20M    225G       24K  none
extzfs/encr                 721K    225G       98K  none
extzfs/encr/home            294K    225G       98K  none
extzfs/encr/home/root        98K    225G       98K  /tmp/mnt/root
extzfs/encr/home/users       98K    225G       98K  /tmp/mnt/home
extzfs/encr/root            329K    225G       98K  none
extzfs/encr/root/default    231K    225G      133K  /tmp/mnt
extzfs/encr/root/logs        98K    225G       98K  /tmp/mnt/var/log
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice how all mountpoints are relative to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/tmp/mnt&lt;/code&gt;, which is the alternate root the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;extzfs&lt;/code&gt; pool was imported with (in this case, created) using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-R&lt;/code&gt; flag. The prefix will be stripped when importing the pool on the final system, leaving only the real mountpoints. This feature makes mounting systems installed on ZFS incredibly convenient, because the entire hierarchy is properly mounted under any directory you choose, allowing to rapidly chroot into the system and perform emergency maintenance operations.&lt;/p&gt;

&lt;h3 id=&quot;234-setting-the-bootfs&quot;&gt;2.3.4. Setting the bootfs&lt;/h3&gt;

&lt;p&gt;The pool’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bootfs&lt;/code&gt; property can be used to indicate which dataset contains the desired boot environment. This is not necessary, but it helps simplifying the kernel command line.&lt;/p&gt;

&lt;p&gt;Run the following command to set the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bootfs&lt;/code&gt; property to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;extzfs/encr/root/default&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# zpool set bootfs=extzfs/encr/root/default extzfs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For the sake of consistency, export now the pool before moving to the next step. This is not strictly necessary, but it doesn’t hurt to ensure that the pool can be correctly imported using the given passphrase.&lt;/p&gt;

&lt;p&gt;To export the pool and unmount all datasets, run:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# zpool export extzfs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h1 id=&quot;3-installing-arch-linux&quot;&gt;3. Installing Arch Linux&lt;/h1&gt;

&lt;p&gt;Installing Arch Linux is not the complex task it was a few decades ago. Arguably, it requires a bit of knowledge and experience, but it’s not out of reach for most tech-savvy users.&lt;/p&gt;

&lt;p&gt;In general, when installing Arch onto a new drive (in this case, our portable SSD), there are two basic approaches:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Install a fresh system from either an existing Linux install&lt;sup id=&quot;fnref:12&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:12&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;12&lt;/a&gt;&lt;/sup&gt; or the Arch Linux ISO;&lt;/li&gt;
  &lt;li&gt;Clone an existing system to the new drive.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I’ll go cover both approaches in the next sections, alongside with a few tips and tricks I’ve learnt over the years.&lt;/p&gt;

&lt;h2 id=&quot;31-mounting-the-filesystems&quot;&gt;3.1. Mounting the filesystems&lt;/h2&gt;

&lt;p&gt;Regardless on the filesystem or approach you’ve picked, you should now mount the root filesystem on a temporary mountpoint. I will use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/tmp/mnt&lt;/code&gt;, but feel free to use whatever you prefer:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mkdir /tmp/mnt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If using LUKS:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# cryptsetup open /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part2 ExtLUKS 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and then, depending on the filesystem:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mount /dev/mapper/ExtLUKS /tmp/mnt # for ext4/XFS/F2FS
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mount -o subvol=@,compress=lzo /dev/mapper/ExtLUKS, /tmp/mnt # for Btrfs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I’ve also enabled compression for Btrfs, which may or may not be a good idea depending on your use case. Notice that compressing data before encrypting it &lt;em&gt;may&lt;/em&gt; hypothetically leak some info about the data contained. Avoid compression if you are concerned about this and/or you have a very large SSD.&lt;/p&gt;

&lt;p&gt;If using ZFS, run:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# zpool import -l -d /dev/disk/by-id -R /tmp/mnt extzfs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and it should do the trick.&lt;/p&gt;

&lt;h2 id=&quot;32-installing-from-scratch&quot;&gt;3.2. Installing from scratch&lt;/h2&gt;

&lt;p&gt;If in doubt, just refer follow the official &lt;a href=&quot;https://wiki.archlinux.org/title/Installation_guide&quot;&gt;Arch Linux installation guide&lt;/a&gt; - it will not cover all the details for advanced installs, but it’s a good starting point.&lt;/p&gt;

&lt;p&gt;In general, the steps somewhat resemble the following, regardless of what filesystem you’ve picked:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mkdir -p /tmp/mnt/boot/efi # we still need to mount /boot, which is on a separate partition
# mount /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0-part1 /tmp/mnt/boot/efi
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;321-installing-from-an-existing-arch-linux-install&quot;&gt;3.2.1. Installing from an existing Arch Linux install&lt;/h3&gt;

&lt;p&gt;If you are running from an existing Arch Linux install or the Arch ISO, installing a base system is as easy as running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacstrap&lt;/code&gt; (from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;arch-install-scripts&lt;/code&gt; package) on the mountpoint of the root filesystem:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# pacstrap -K /tmp/mnt base perl neovim
[lots of output]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I’ve also thrown in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;neovim&lt;/code&gt; because there are no editors installed by default in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base&lt;/code&gt;, but feel free to use whatever you like. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;perl&lt;/code&gt; is also (implictly) required by several packages, and not installing it may trigger unpredictable issues later.&lt;/p&gt;

&lt;p&gt;Now enter the new system with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;arch-chroot&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# arch-chroot /tmp/mnt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;322-installing-from-a-non-arch-system&quot;&gt;3.2.2. Installing from a non-Arch system&lt;/h3&gt;

&lt;p&gt;All the steps above (except for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacstrap&lt;/code&gt;) can be performed from basically any Linux distribution. If you are running from a non-Arch system, don’t worry - there are workarounds available for that.&lt;/p&gt;

&lt;p&gt;An always viable solution is always &lt;a href=&quot;https://wiki.archlinux.org/title/Install_Arch_Linux_from_existing_Linux#Method_A:_Using_the_bootstrap_tarball_(recommended)&quot;&gt;to use the bootstrap tarball from an Arch mirror&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A trickier (but arguably more fun) path is to build &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman&lt;/code&gt; from source, and then using it to install the base system. For instance, on Debian:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ sudo apt install build-essential meson cmake libcurl4-openssl-dev libgpgme-dev libssl-dev libarchive-dev pkgconf
[...]
$ wget -O - https://sources.archlinux.org/other/pacman/pacman-6.0.2.tar.xz | tar xvfJ -
$ cd pacman-6.0.2
$ meson setup --default-library static build # avoid linking pacman with the newly built shared libalpm
[...]
$ ninja -C build
[...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You should have a working &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman&lt;/code&gt; binary in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;build/pacman&lt;/code&gt;. In order to install the base system, you need to create a minimal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman.conf&lt;/code&gt; file:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ cat &amp;lt;&amp;lt;&apos;EOF&apos; &amp;gt;&amp;gt; build/pacman.conf
SigLevel = Never

[core]
Server = https://geo.mirror.pkgbuild.com/$repo/os/$arch

[extra]
Server = https://geo.mirror.pkgbuild.com/$repo/os/$arch
EOF
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For this time only, I have disabled signature verification because going through the whole ordeal of setting up &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman-key&lt;/code&gt; and importing the Arch Linux signing keys for a makeshift &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman&lt;/code&gt; install is very troublesome. If you are &lt;em&gt;really&lt;/em&gt; concerned about security, use the bootstrap tarball instead.&lt;/p&gt;

&lt;p&gt;Create the required database directory for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman&lt;/code&gt;, and install the same packages as above:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ sudo mkdir -p /tmp/mnt/var/lib/pacman/
$ sudo build/pacman -r /tmp/mnt --config=build/pacman.conf -Sy base perl neovim
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will result in a working Arch Linux chroot, albeit only partially set up.&lt;/p&gt;

&lt;p&gt;Chroot into the new system, and properly set up the Arch Linux keyring:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ sudo mount --make-rslave --rbind /dev /tmp/mnt/dev
$ sudo mount --make-rslave --rbind /sys /tmp/mnt/sys
$ sudo mount --make-rslave --rbind /run /tmp/mnt/run
$ sudo mount -t proc /proc /tmp/mnt/proc
$ sudo cp -L /etc/resolv.conf /tmp/mnt/etc/resolv.conf
$ sudo chroot /tmp/mnt /bin/bash
[root@chroot /]# pacman-key --init
[root@chroot /]# pacman-key --populate archlinux
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You can now proceed as if you were installing from an existing Arch Linux system.&lt;/p&gt;

&lt;h3 id=&quot;323-installing-a-kernel&quot;&gt;3.2.3. Installing a kernel&lt;/h3&gt;

&lt;p&gt;In order to install packages inside your chroot, you need to enable at least one Pacman mirror first in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/pacman.d/mirrorlist&lt;/code&gt;. If you used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacstrap&lt;/code&gt; from an existing Arch Linux system, this may be unnecessary.&lt;/p&gt;

&lt;p&gt;After enabling one or more mirrors, you can install a kernel of your choice:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[root@chroot /]# pacman -Sy linux-lts linux-lts-headers linux-firmware
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice that I’ve chosen to install the LTS kernel, which is in general a good idea when depending on out-of-tree kernel modules such as ZFS or NVIDIA drivers. Feel free to install the latest kernel if you prefer, but remember to be careful when updating the system due to potential module breakage.&lt;/p&gt;

&lt;p&gt;The command above will also generate an initrd, which we don’t really need (we will use UKI instead). We will have to delete that later.&lt;/p&gt;

&lt;h3 id=&quot;324-installing-the-correct-helpers-for-your-filesystem&quot;&gt;3.2.4. Installing the correct helpers for your filesystem&lt;/h3&gt;

&lt;p&gt;In order for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fsck&lt;/code&gt; to properly run, or to mount ZFS, you need to install the correct package for your filesystem:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;If you’ve installed your system over ZFS, this is a good time to set-up the ArchZFS repository in the chroot (see above)&lt;/li&gt;
  &lt;li&gt;If you’ve installed your system over Btrfs, you need to install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;btrfs-progs&lt;/code&gt;. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cryptsetup&lt;/code&gt; should already have been pulled in as a dependency to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemd&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If you are using another filesystem, install the correct package:&lt;/p&gt;

    &lt;p&gt;a. For &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ext4&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;e2fsprogs&lt;/code&gt; should already have been pulled in by dependencies installed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base&lt;/code&gt; - ensure you can run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;e2fsck&lt;/code&gt; from the chroot.&lt;/p&gt;

    &lt;p&gt;b. For &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XFS&lt;/code&gt;, install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xfsprogs&lt;/code&gt;.&lt;/p&gt;

    &lt;p&gt;c. For &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;F2FS&lt;/code&gt;, install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f2fs-tools&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Remember to also always install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dosfstools&lt;/code&gt;, which is required to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fsck&lt;/code&gt; the FAT filesystem on the ESP.&lt;/p&gt;

&lt;h2 id=&quot;33-cloning-an-existing-system&quot;&gt;3.3. Cloning an existing system&lt;/h2&gt;

&lt;p&gt;Instead of installing the system from scratch, you may clone an existing system instead. Just remember after the move to&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;fix &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/fstab&lt;/code&gt; with the new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PARTUUID&lt;/code&gt;s&lt;/li&gt;
  &lt;li&gt;give the system an unique configuration (i.e., change the hostname, fix the hostid, …) in order to avoid clashes&lt;/li&gt;
  &lt;li&gt;do not transfer the contents of the ESP - if you use UKI and mount it at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/boot/efi&lt;/code&gt;, you will regenerate its contents later when you reapply the steps from above.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are 3 feasible ways to do this.&lt;/p&gt;

&lt;h3 id=&quot;331-use-dd-to-clone-a-partition-block-by-block&quot;&gt;3.3.1. Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dd&lt;/code&gt; to clone a partition block by block.&lt;/h3&gt;

&lt;p&gt;This methods has a few advantages, and quite a bit of downsides:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;PRO&lt;/strong&gt;: because it literally clones an entire disk, byte per byte, to another, it is the most conservative method among all&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CON&lt;/strong&gt;: because it clones an entire disk byte per byte, issues such as fragmentation and data in unallocated sectors are copied&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CON&lt;/strong&gt;: because it clones an entire disk byte per byte, the target partition or disk must be at least as large as the source, or the source must be shrunk beforehand, which is not always possible (like with XFS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you opt for this solution, just run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dd&lt;/code&gt; and copy one or more existing partitions to the LUKS container:&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;# dd if=/path/to/source/partition of=/dev/mapper/ExtLUKS bs=1M status=progress&lt;/code&gt;&lt;/p&gt;

&lt;h3 id=&quot;332-use-rsync-to-clone-a-filesystem-onto-a-new-partition&quot;&gt;3.3.2. Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsync&lt;/code&gt; to clone a filesystem onto a new partition.&lt;/h3&gt;

&lt;p&gt;This method is the most flexible,because it’s completely agnostic regarding the source and destination filesystems, as long as the destination can fit all contents from the source. Just mount everything where it’s supposed to go, and run (as root):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# rsync -qaHAXS /{bin,boot,etc,home,lib,opt,root,sbin,srv,usr,var} /tmp/mnt/dest
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The root has now been cloned, but it’s missing some base directories.&lt;/p&gt;

&lt;p&gt;Given that I assume we are booting from an Arch Linux system, just reinstall &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filesystem&lt;/code&gt; inside the new root:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$  sudo pacman -r /tmp/mnt --config /tmp/mnt/etc/pacman.conf -S filesystem
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will fixup any missing directory and symlink, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/proc&lt;/code&gt;, … Notice that only for this time I have used the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-r&lt;/code&gt; parameter. This changes pacman’s root directory, and should always used with extreme care.&lt;/p&gt;

&lt;h3 id=&quot;333-use-btrfs-snapshotting-and-replication-facilities-to-clone-existing-subvolumes&quot;&gt;3.3.3. Use Btrfs snapshotting and replication facilities to clone existing subvolumes.&lt;/h3&gt;

&lt;p&gt;Btrfs supports incremental snapshotting and sending/receiving them as incremental data streams. This is extremely convenient, because replication ensures that files are transferred perfectly (with the right permissions, metadata, …) without having to copy any unnecessary empty space.&lt;/p&gt;

&lt;p&gt;In order to duplicate a system using Btrfs, partition and format the disk as described above, and then snapshot and send the subvolumes to the new disk. Assuming the root subvolume has been mounted under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/tmp/src&lt;/code&gt;&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mount -o subvol=/ /path/to/root/dev /tmp/src
# mount -o subvol=/ /dev/mapper/ExtLUKS /tmp/mnt
# btrfs su snapshot -r /tmp/src/@{,-mig}
Create a readonly snapshot of &apos;/tmp/src/@&apos; in &apos;/tmp/src/@-mig&apos;
# btrfs su snapshot -r /tmp/src/@home{,-mig}
Create a readonly snapshot of &apos;/tmp/src/@home&apos; in &apos;/tmp/src/@home-mig&apos;
# btrfs send /tmp/src/@-mig | btrfs receive /tmp/mnt
At subvol /tmp/src/@-mig
At subvol @-mig
# btrfs send /tmp/src/@home-mig | btrfs receive /tmp/mnt
At subvol /tmp/src/@home-mig
At subvol @home-mig
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The system has now been correctly transferred. Rename the subvolumes to their original names and delete the now unnecessary snapshots if you want to reclaim the space &lt;sup id=&quot;fnref:14&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:14&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;13&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# perl-rename -v &apos;s/\-mig//g&apos; /tmp/mnt/@* 
/tmp/mnt/@-mig -&amp;gt; /tmp/mnt/@
/tmp/mnt/@home-mig -&amp;gt; /tmp/mnt/@home
# btrfs su delete /tmp/src/@*-mig
Delete subvolume (no-commit): &apos;/tmp/src/@-mig&apos;
Delete subvolume (no-commit): &apos;/tmp/src/@home-mig&apos;
# umount /tmp/{src,mnt}
# mount -o subvol=@,compress=lzo /dev/mapper/ExtLUKS /tmp/mnt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Unmount the root subvolume and mount the system as you normally would. You are now ready to move to the next step.&lt;/p&gt;

&lt;h3 id=&quot;334-use-zfs-snapshotting-and-replication-facilities-to-clone-existing-datasets&quot;&gt;3.3.4. Use ZFS snapshotting and replication facilities to clone existing datasets.&lt;/h3&gt;

&lt;p&gt;With ZFS, the process is very similar to Btrfs, with a few different steps depending if your source datasets are already encrypted or not.&lt;/p&gt;

&lt;p&gt;After creating a pool, snapshot your root disk &lt;em&gt;recursively&lt;/em&gt;. If your system resides on an encrypted dataset, snapshotting the encryption root will also snapshot all the datasets contained within it:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# zfs snapshot -r zroot/encr@migration # otherwise, snapshot all the required datasets
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After doing that, you can either:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;create a new encrypted dataset and send the unencrypted snashots to it:
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  # zfs create -o encryption=on -o keyformat=passphrase -o keylocation=prompt -o mountpoint=none -o compression=lz4 extzfs/encr
  # for DATASET in root home ... # note: replace with the actual datasets
  do zfs send zroot/$DATASET@migration | zfs recv extzfs/encr/$DATASET
  # ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Migrating unencrypted datasets to an encrypted root dataset requires transferring the snapshots one by one. It’s generally easier to just let the newly received snapshots inherit properties from their parents, and then fixing mountpoints and other properties later using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs set&lt;/code&gt;. You can also do it directly, if necessary, by setting the properties using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-o&lt;/code&gt; flag with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs recv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Ensure that all datasets are correctly mounted before moving to the next step.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;clone another encrypted dataset as raw data:
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  # zfs send -Rw zroot/encr@migration | zfs recv -F extzfs/encr
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This will recursively clone all the datasets under a new encrypted dataset called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;extzfs/encr/encr&lt;/code&gt;. The new encryption root will have the same key as the source dataset, so you will be able to unlock it with the same passphrase. All properties and mountpoints will also be kept.&lt;/p&gt;

&lt;p&gt;Given that all properties have been preserved, it may be enough to run&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  # zfs mount -la
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;to unlock and mount all new datasets. If that doesn’t result in correctly mounted datasets, ensure that all properties (including mountpoints) have been correctly preserved.&lt;/p&gt;

&lt;h3 id=&quot;335-migrating-filesystems-wrapping-up&quot;&gt;3.3.5. Migrating filesystems: wrapping up&lt;/h3&gt;

&lt;p&gt;Regardless of the method you’ve picked, you should now have a working system on the new disk. Chroot into it as described in section 3.2., and then proceed to the next step.&lt;/p&gt;

&lt;h2 id=&quot;34-configuring-the-base-system&quot;&gt;3.4. Configuring the base system&lt;/h2&gt;

&lt;p&gt;Regardless of whatever path you took, you should now be in a working Arch Linux chroot.&lt;/p&gt;

&lt;h3 id=&quot;341-basic-configuration&quot;&gt;3.4.1. Basic configuration&lt;/h3&gt;

&lt;p&gt;Most of the pre-boot configuration steps now are basically the same as a normal Arch Linux install:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[root@chroot /]# nvim /etc/pacman.d/mirrorlist # enable your favourite mirrors
[root@chroot /]# nvim /etc/locale.gen          # enable your favourite locales (e.g. en_US.UTF-8) 
[root@chroot /]# locale-gen                    # generate the locales configured above
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The next step is to populate the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/fstab&lt;/code&gt; file with the correct entries for all your partitions. Remember to use PARTUUIDs or plain UUIDs, and never rely on disk and partition names (except for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/mapper&lt;/code&gt; device files). The contents of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/fstab&lt;/code&gt; will vary depending on the filesystem you’ve picked. Remember that the initrd will be the one to unlock the LUKS container, so you don’t need to specify it in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/crypttab&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/fstab&lt;/code&gt; for  ext4/XFS/F2FS with LUKS:
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# it is not strictly necessary to also include the root partition, but it&apos;s a good idea in case of remounts
/dev/mapper/ExtLUKS                             /            ext4    defaults      0 1
PARTUUID=4a0eab50-7dfc-4dcb-98a6-ad954d344ad7   /boot/efi    vfat    defaults      0 2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/fstab&lt;/code&gt; for Btrfs with LUKS:
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;/dev/mapper/ExtLUKS                             /            btrfs   defaults,subvol=@,compress=lzo       0 0
/dev/mapper/ExtLUKS                             /home        btrfs   defaults,subvol=@home,compress=lzo   0 0
PARTUUID=4a0eab50-7dfc-4dcb-98a6-ad954d344ad7   /boot/efi    vfat    defaults                             0 2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;With ZFS, all datasets mountpoints are managed via the filesystem itself. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/fstab&lt;/code&gt; will only contain the ESP (unless you have created legacy mountpoints):&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;PARTUUID=4a0eab50-7dfc-4dcb-98a6-ad954d344ad7   /boot/efi    vfat    defaults      0 2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then, set a password for root:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[root@chroot /]# passwd
New password: 
Retype new password: 
passwd: password updated successfully
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and create a standard user. Remember to mount &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/home&lt;/code&gt; first if you are using a separate partition or subvolume!&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[root@chroot /]# mount /home # if using a separate partition or subvolume, not needed with ZFS
[root@chroot /]# useradd -m marco # this is my name
[root@chroot /]# passwd marco
New password:
Retype new password:
passwd: password updated successfully
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Before moving to the next step, install all packages required for connectivity, or you may be unable to connect to the internet after you boot up the system.&lt;/p&gt;

&lt;p&gt;For simplicity, I’ll just install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NetworkManager&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[root@chroot /]# pacman -S networkmanager
[...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As the last step before moving to the next point, remember to configure the correct console layout in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/vconsole.conf&lt;/code&gt;, or you will have a hard time typing your password at boot time (the file will be copied to the &lt;em&gt;initrd&lt;/em&gt;):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[root@chroot /]# cat &amp;gt; /etc/vconsole.conf &amp;lt;&amp;lt;&apos;EOF&apos;
KEYMAP=us
EOF
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;342-configuring-the-kernel&quot;&gt;3.4.2. Configuring the kernel&lt;/h3&gt;

&lt;p&gt;Configuring the system for booting on multiple systems is easier than it sounds, thanks to how good Linux and the graphical stack has become at automatically configuring itself depending on the hardware.&lt;/p&gt;

&lt;p&gt;In the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chroot&lt;/code&gt;, run the following preliminary steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;(optional) First, install ZFS (if you are using it); if using the LTS kernel, I recommend using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs-dkms&lt;/code&gt;, while for a more up-to-date kernel a “fixed” build such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs-linux&lt;/code&gt; is probably safer.&lt;/li&gt;
  &lt;li&gt;In order to support systems with an NVIDIA GPU, install the Nvidia driver (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nvidia&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nvidia-lts&lt;/code&gt;, depending on what you’ve chosen) &lt;sup id=&quot;fnref:13&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:13&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;14&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
  &lt;li&gt;Install the microcode for &lt;em&gt;both&lt;/em&gt; Intel and AMD CPUs (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;intel-ucode&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;amd-ucode&lt;/code&gt; respectively). Only the correct one will be loaded at boot time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With the kernel and all necessary modules installed, we can now generate a bootable image.&lt;/p&gt;

&lt;p&gt;For this step I’ve decided to use UKI, which is a novel approach to initramfs that simplifies the process a lot, by merging together kernel and initrd into a single bootable file. This is not strictly necessary, but it allows us to avoid messing the ESP with the contents of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/boot&lt;/code&gt;: only UKIs and the (optional) bootloader will need to reside on it.&lt;/p&gt;

&lt;p&gt;UKIs can be generated with several initramfs-generating tools, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dracut&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mkinitcpio&lt;/code&gt;. After a somewhat long stint with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dracut&lt;/code&gt;, I’ve recently switched to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mkinitcpio&lt;/code&gt; (Arch’s default) due to how simple it is to configure and customize with custom hooks.&lt;/p&gt;

&lt;p&gt;For a portable system, it’s best to always boot using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fallback&lt;/code&gt; preset. The default preset generates a initramfs custom tailored to the current hardware, which may not work on other systems except the one that generated it. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fallback&lt;/code&gt; preset, on the other hand, generates a generic initramfs that contains by default the modules needed to boot on (almost) any system. The size difference may have been significant in the past, where disk space was small and expensive, but nowadays it’s negligible. A UKI image generated with the fallback preset is around 110 MiB in size, which is enough to fit on our 300 MiB ESP.&lt;/p&gt;

&lt;p&gt;First, we ought to create a file containing the command line arguments for the kernel.&lt;/p&gt;

&lt;p&gt;The kernel command line is a set of arguments passed to the kernel at boot time, which can be used to configure how the kernel, the initramfs or systemd will behave. Under UEFI, these parameters are usually passed by a bootloader as boot arguments to the kernel when invoked from the ESP. UKI differs in this regard by directly embedding the command line in the image itself.&lt;/p&gt;

&lt;p&gt;Create a file called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/kernel/cmdline&lt;/code&gt; with at least the following contents; feel free to add more parameters if you need them.&lt;/p&gt;

&lt;p&gt;For LUKS:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;rw nvidia-drm.modeset=1 cryptdevice=PARTUUID=5c97981e-4e4c-428e-8dcf-a82e2bc1ec0a:ExtLUKS root=/dev/mapper/ExtLUKS rootflags=subvol=@,compress=lzo rootfstype=btrfs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Omit the rootflags and rootfstype parameters if you are not using Btrfs.&lt;/p&gt;

&lt;p&gt;For ZFS, try something akin to the following:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;rw nvidia-drm.modeset=1 zfs=extzfs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;which relies on automatic &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bootfs&lt;/code&gt; detection in order to find the root dataset.&lt;/p&gt;

&lt;p&gt;After this, edit the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/mkinitcpio.conf&lt;/code&gt; to add any extra modules and hooks required by the new system.&lt;/p&gt;

&lt;p&gt;You probably want to load &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nvidia&lt;/code&gt; KMS modules early, in order to avoid any issues when booting on systems with an NVIDIA discrete GPU. Notice that this may sometimes cause issues with buggy laptops with hybrid graphics, so remember this tradeoff in case you are incurring on this issue.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;MODULES=(nvidia nvidia_drm nvidia_uvm nvidia_modeset)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The hooks you pick and the order in which they are run are crucial for a working system. For instance, if you are using encrypted ZFS, this is a safe starting point:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;HOOKS=(base udev autodetect modconf kms block keyboard keymap consolefont zfs filesystems fsck)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For LUKS,&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;HOOKS=(base udev autodetect modconf kms keyboard keymap consolefont block encrypt filesystems fsck)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice how the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;keyboard&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;keymap&lt;/code&gt; hooks have been specified before either the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;encrypt&lt;/code&gt; hooks.
This ensures that the keyboard and keymap are correctly configured before reaching the root encryption password prompt.&lt;/p&gt;

&lt;p&gt;Before triggering the generation of our image, we must enable UKI support in the fallback preset (and disable the default one).&lt;/p&gt;

&lt;p&gt;Edit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/mkinitcpio.d/linux-lts.preset&lt;/code&gt; as follows:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# mkinitcpio preset file for the &apos;linux-lts&apos; package

#ALL_config=&quot;/etc/mkinitcpio.conf&quot;
ALL_kver=&quot;/boot/vmlinuz-linux-lts&quot;
ALL_microcode=(/boot/*-ucode.img)

PRESETS=(&apos;fallback&apos;)

#default_config=&quot;/etc/mkinitcpio.conf&quot;
#default_image=&quot;/boot/initramfs-linux-lts.img&quot;
#default_uki=&quot;/boot/efi/EFI/Linux/arch-linux-lts.efi&quot;
#default_options=&quot;--splash /usr/share/systemd/bootctl/splash-arch.bmp&quot;

#fallback_config=&quot;/etc/mkinitcpio.conf&quot;
#fallback_image=&quot;/boot/initramfs-linux-lts-fallback.img&quot;
fallback_uki=&quot;/boot/efi/EFI/BOOT/Bootx64.efi&quot;
fallback_options=&quot;-S autodetect&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the preset above, I have completely disabled out the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;default&lt;/code&gt; preset by removing it from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PRESETS&lt;/code&gt; and commenting all of its entries. Under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fallback&lt;/code&gt;, I only kept the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uki&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;options&lt;/code&gt; entries, in order to avoid generating an initramfs image that we have no use for.&lt;/p&gt;

&lt;p&gt;Run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mkinitcpio -p linux-lts&lt;/code&gt; to finally generate the UKI under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/boot/efi/EFI/BOOT/Bootx64.efi&lt;/code&gt;, which is the custom path I set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fallback_uki&lt;/code&gt; to. This is the location conventionally associated with the UEFI Fallback bootloader, which will make the external drive bootable on any UEFI system without the need of any configuration or bootloader, as long as booting from USB is allowed (and UEFI Secure Boot is off).&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[root@chroot /]# mkdir -p /boot/efi/EFI/BOOT # create the target directory
[root@chroot /]# mkinitcpio -p linux-lts
[...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Optionally, clean up &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/boot&lt;/code&gt; by removing the initramfs images previously generated by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pacman&lt;/code&gt; when installing the kernel package. These are unnecessary when using UKIs, and will never be generated again with the modifications we made to the kernel preset:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# rm /boot/initramfs*.img
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;342-installing-a-bootloader-optional&quot;&gt;3.4.2. Installing a bootloader (optional)&lt;/h3&gt;

&lt;p&gt;In principle, the instructions above make having a bootloader at all somewhat redundant. With UEFI, you can also always tinker with command line arguments using the UEFI Shell, which can be either already installed on the machine you are booting on or copied in the ESP under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\EFI\Shellx64.efi&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In case you want to install a bootloader, change the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fallback_uki&lt;/code&gt; argument to a different path (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/boot/efi/EFI/Linux/arch-linux-lts.efi&lt;/code&gt;) and then just follow &lt;a href=&quot;https://wiki.archlinux.org/title/Systemd-boot&quot;&gt;Arch Wiki’s instructions on how to set up &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemd-boot&lt;/code&gt;&lt;/a&gt; (or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rEFInd&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GRUB&lt;/code&gt;, or whatever you like).&lt;/p&gt;

&lt;p&gt;If you opt for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemd-boot&lt;/code&gt;, ensure that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bootctl install&lt;/code&gt; copies the bootloader to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\EFI\BOOT\Bootx64.efi&lt;/code&gt;, or it will not get picked up by the UEFI firmware automatically.&lt;/p&gt;

&lt;h2 id=&quot;35-unmounting-the-filesystems&quot;&gt;3.5. Unmounting the filesystems&lt;/h2&gt;

&lt;p&gt;Before attempting to boot the system, remember to unmount all filesystems and close the LUKS container. After ensuring you followed all the steps above correctly, exit the chroot, and then:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[root@chroot /]# exit
$ sudo umount -l /tmp/mnt/{dev,sys,proc,run} # the `-l` flag prevents issues with busy mounts
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you used LUKS:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ sudo umount -R /tmp/mnt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you used ZFS, you also have to remember to export the pool - otherwise, the pool will still be in use next boot, and the initrd scripts won’t be able to import it:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ sudo zpool export extzfs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This command may sometimes fail with an error message similar to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cannot export &apos;extzfs&apos;: pool is busy&lt;/code&gt;. This is usually caused by a process still using the pool, such as a shell with its current directory set to a directory inside the pool. If this happens, the fastest way to fix it is to reboot the system, import the pool (without necessarily unlocking any dataset), and then immediately export it. This will ensure that the pool is not in use and untie it from the current system’s hostid.&lt;/p&gt;

&lt;h1 id=&quot;4-booting-the-system&quot;&gt;4. Booting the system&lt;/h1&gt;

&lt;p&gt;If you’ve followed the instructions above, you should now have be able to boot onto the new system successfully, without any troubleshoot necessary.&lt;/p&gt;

&lt;p&gt;You can either test the new system by booting from native hardware, or inside a virtual machine.&lt;/p&gt;

&lt;h2 id=&quot;41-setting-up-a-vm&quot;&gt;4.1. Setting up a VM&lt;/h2&gt;

&lt;p&gt;In order to spin up a VM, you need a working hypervisor. If you intend to run the VM on a Linux host, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qemu&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;KVM&lt;/code&gt; is an excellent choice. &lt;sup id=&quot;fnref:15&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:15&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;15&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;You can either use Qemu via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libvirt&lt;/code&gt; and tools such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;virt-manager&lt;/code&gt;, or use plain QEMU directly. The former tends to be way easier to setup, but more troublesome to troubleshoot; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libvirt&lt;/code&gt; is unfortunately full of abstractions that make configuring Qemu harder than just invoking it with the right parameters. On the other hand, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libvirt&lt;/code&gt; automatically handles unpleasant parts such as configuring network bridges and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dnsmasq&lt;/code&gt;, which you are otherwise required to configure manually.&lt;/p&gt;

&lt;p&gt;Regardless of what approach you prefer, you should install UEFI support for guests, which is usually provided in packages called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ovmf&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;edk2-ovmf&lt;/code&gt;, or similar.&lt;/p&gt;

&lt;h3 id=&quot;411-using-libvirt&quot;&gt;4.1.1. Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libvirt&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;If you are using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libvirt&lt;/code&gt;, you can use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;virt-manager&lt;/code&gt; to create a new VM (or dabble with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;virsh&lt;/code&gt; and XML directly, if that’s more to your liking). If you opt for this approach, remember to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Select &lt;em&gt;the device&lt;/em&gt;, and not partitions or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/mapper&lt;/code&gt; devices. The disk must be unmounted and no partitions should be in use. Pick “Import an image” and then select &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/disk/by-id/usb-XXX&lt;/code&gt;, without &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-partN&lt;/code&gt;, via the &lt;em&gt;“Browse local”&lt;/em&gt; button.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Select “Customize configuration before install”, or you won’t be able to enable UEFI support. In the configuration screen, in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Overview&lt;/code&gt; pane, select the “Firmware” tab and pick an x86-64 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OVMF_CODE.fd&lt;/code&gt;. If you don’t see any, check that you’ve installed all the correct packages.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;(optional)&lt;/em&gt; If you wish, you may enable VirGL in order to have a smoother experience while using the VM. If you’re interested, toggle the &lt;em&gt;“OpenGL”&lt;/em&gt; option under the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Display Spice&lt;/code&gt; device section. Also remember to disable the SPICE socket, by setting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Listen type&lt;/code&gt; for SPICE to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;None&lt;/code&gt;. Check that the adapter model is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Virtio&lt;/code&gt;, and enable 3D acceleration. &lt;sup id=&quot;fnref:16&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:16&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;16&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;412-using-raw-qemu&quot;&gt;4.1.2. Using raw &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qemu&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;Using plain Qemu in place of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libvirt&lt;/code&gt; is undoubtedly less convenient. It definitely requires more tinkering for networking (especially if you don’t want to use SLIRP, which is slow and limited), with the advantage of being more versatile and not requiring setting up &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libvirt&lt;/code&gt; - which tends to be problematic on machines with complex firewall rules and network configurations.&lt;/p&gt;

&lt;p&gt;First, make a copy of the default UEFI variables file:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ cp /usr/share/ovmf/x64/OVMF_VARS.fd ext_vars.fd
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then, temporarily take ownership of the disk device, in order to avoid having to run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qemu&lt;/code&gt; as root:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ sudo chown $(id -u):$(id -g) /dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Finally, run Qemu with the following command line. In this case, I’ll use SLIRP for simplicity, plus I will enable VirGL for a smoother experience:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ qemu-system-x86_64 -enable-kvm -cpu host -m 8G -smp sockets=1,cpus=8,threads=1 -drive if=pflash,format=raw,readonly=on,file=/usr/share/ovmf/x64/OVMF_CODE.fd -drive if=pflash,format=raw,file=$PWD/ext_vars.fd -drive if=virtio,format=raw,cache=none,file=/dev/disk/by-id/usb-Samsung_SSD_960_EVO_250G_012938001243-0:0 -nic user,model=virtio-net-pci -device virtio-vga-gl -display sdl,gl=on -device intel-hda -device hda-duplex
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;42-booting-on-bare-hardware&quot;&gt;4.2. Booting on bare hardware&lt;/h2&gt;

&lt;p&gt;The disk previously created &lt;em&gt;should&lt;/em&gt; be capable of booting on potentially any UEFI-enabled x86-64 system, as long as booting from USB is allowed and Secure Boot is disabled.&lt;sup id=&quot;fnref:17&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:17&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;17&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;At machine startup, press the &lt;em&gt;“Boot Menu”&lt;/em&gt; key for your system (usually F12 or F8, but it may vary considerably depending on the vendor) and select the external SSD. The disk may be referred to as “fallback bootloader” - this is normal, given that we’ve placed the UKI image at the fallback bootloader location.&lt;/p&gt;

&lt;h2 id=&quot;43-first-boot&quot;&gt;4.3. First boot&lt;/h2&gt;

&lt;p&gt;If you did everything right in the last few steps, the boot process should stop at a password prompt from either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cryptsetup&lt;/code&gt; (LUKS) or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zpool&lt;/code&gt; (ZFS).&lt;/p&gt;

&lt;p&gt;Insert the password and press enter. If everything went well, you should now be greeted by a login prompt.&lt;/p&gt;

&lt;p&gt;Login as root, and proceed with the last missing configuration steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;If you are running on ZFS, you’ll notice that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/home&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/root&lt;/code&gt; are not mounted automatically. In order to fix this, immediately run
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# systemctl enable zfs.target zfs-mount.server zfs-import.target
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After doing this, &lt;strong&gt;reboot the system&lt;/strong&gt; and check that the datasets are mounted correctly. You shouldn’t need to enable &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs-import-cache.service&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs-import-scan.service&lt;/code&gt; as they are unnecessary, given that we’re booting from a single pool which is already imported.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Enable and start up the network manager of your choice you’ve installed previously, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NetworkManager&lt;/code&gt;:
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;# systemctl enable --now NetworkManager&lt;/code&gt;&lt;/p&gt;

    &lt;p&gt;If you are using a wired connection with DHCP or IPv6 and no special configuration required, you should see any relevant IPs under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ip address&lt;/code&gt;, and Internet should be working.&lt;/p&gt;

    &lt;p&gt;If you need special configurations, or you must use wireless connectivity, use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nmtui&lt;/code&gt; to configure the network.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;With a booted instance of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemd&lt;/code&gt;, you can now easily set up everything else you are missing, such as:&lt;/p&gt;
    &lt;ul&gt;
      &lt;li&gt;a &lt;strong&gt;hostname&lt;/strong&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hostnamectl set-hostname&lt;/code&gt;;&lt;/li&gt;
      &lt;li&gt;a &lt;strong&gt;timezone&lt;/strong&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timedatectl set-timezone&lt;/code&gt; (you may need to adjust it depending on where you boot from);&lt;/li&gt;
      &lt;li&gt;if you know as a fact you are always going to boot from systems with an RTC on localtime, set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timedatectl set-local-rtc 1&lt;/code&gt; to avoid having to adjust the time every time you boot. Note that this is arguably one of the most annoying parts about a portable system; I recommend setting every machine you own to UTC and properly configuring Windows to use UTC instead.&lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;a different locale (generated via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;locale-gen&lt;/code&gt;), in order to change your system’s language settings.&lt;/p&gt;

        &lt;p&gt;As an example:&lt;/p&gt;
        &lt;ul&gt;
          &lt;li&gt;Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;localectl set-locale LANG=en_US.UTF-8&lt;/code&gt; to set the default locale to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;en_US.UTF-8&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;localectl set-keymap de&lt;/code&gt; to change the keyboard layout to German.&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;44-installing-a-desktop-environment&quot;&gt;4.4. Installing a desktop environment&lt;/h2&gt;

&lt;p&gt;The most useful part about a portable system is to carry a workspace around, so you can work on your projects wherever you are.&lt;/p&gt;

&lt;p&gt;In order to do this, you need to install some kind of desktop environment, which may range from minimal (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dwm&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sway&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fluxbox&lt;/code&gt;) to a full fledged environment like Plasma, GNOME, XFCE, Mate, …&lt;/p&gt;

&lt;p&gt;Just remember that you are going to use this system on a variety of machines, so it’s useful to avoid anything that requires an excessive amount of tinkering to function properly. For instance, if one or more of the systems you plan to target involve NVIDIA GPUs, you may find running Wayland more annoying than just sticking with X11.&lt;/p&gt;

&lt;h3 id=&quot;441-example-installing-kde-plasma&quot;&gt;4.4.1. Example: Installing KDE Plasma&lt;/h3&gt;

&lt;p&gt;I’m a big fan of KDE Plasma (even though I’ve been using GNOME recently, for a few reasons), so I’ll use it as an example.&lt;/p&gt;

&lt;p&gt;In general, all DEs require you to install a metapackage to pull in all the basic components (like the KF5 frameworks) and an (optional display manager), plus some or all the applications that are part of the DE.&lt;/p&gt;

&lt;p&gt;If you plan on running X11, install the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xorg&lt;/code&gt; package group, and then install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plasma&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# pacman -S plasma plasma-wayland-session sddm kde-utilities
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you are using a display manager, enable it with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemctl enable --now sddm&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Otherwise, either configure your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.xinitrc&lt;/code&gt; to start Plasma by appending&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-.xinitrc&quot;&gt;export DESKTOP_SESSION=plasma
exec startplasma-x11
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;startx&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you prefer using Wayland, just straight out run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;startplasma-wayland&lt;/code&gt; instead.&lt;/p&gt;

&lt;h1 id=&quot;5-basic-troubleshooting&quot;&gt;5. Basic troubleshooting&lt;/h1&gt;

&lt;p&gt;If you followed all steps listed above, you &lt;em&gt;should&lt;/em&gt; have a working portable system. Most troubleshooting steps after the initial booting should be identical to those of a normal Arch Linux system. Below you’ll find a very basic list of a few common issues that may arise when attempting to boot the system on different machines.&lt;/p&gt;

&lt;h2 id=&quot;51-device-not-found-or-no-pool-to-import-during-boot&quot;&gt;5.1. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Device not found&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;No pool to import&lt;/code&gt; during boot&lt;/h2&gt;

&lt;p&gt;If the initrd fails to find the root device (or the ZFS pool), it means that the initrd failed to correctly mount the correct drive. This it’s often due to the following three reasons:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The initrd is missing the required drivers. The disk is not appearing under /dev because of this.&lt;/p&gt;

    &lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fallback&lt;/code&gt; initrd is supposed to contain all the storage and USB drivers needed to boot on any system, but it’s possible that some may be missing if your USB controller is either particularly exotic or particularly quirky (e.g. Intel Macs).&lt;/p&gt;

    &lt;p&gt;First, on the affected system, try to probe what drivers are in use for your USB controller. You can use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lspci -k&lt;/code&gt; from a Linux system you can mount the external disk from:&lt;/p&gt;

    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ lspci -k
[..]
0a:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller
     Subsystem: Gigabyte Technology Co., Ltd Family 17h (Models 00h-0fh) USB 3.0 Host Controller
     Kernel driver in use: xhci_hcd
     Kernel modules: xhci_pci
 [..]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Afterwards, add the relevant module(s) to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MODULES&lt;/code&gt; array in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/mkinitcpio.conf&lt;/code&gt;, and regenerate the initrd.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The kernel command line is incorrect. The initrd either has the wrong device set, or the kernel is not receiving the correct parameters.&lt;/p&gt;

    &lt;p&gt;This happens either due to a bad &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;root&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs&lt;/code&gt; line in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/kernel/cmdline&lt;/code&gt;, or because a bootloader or firmware are passing spurious arguments to the UKI.&lt;/p&gt;

    &lt;p&gt;Double check that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;root&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs&lt;/code&gt; line in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/kernel/cmdline&lt;/code&gt; is correct. Some bootloaders such as &lt;em&gt;rEFInd&lt;/em&gt; support automatic discovery of bootable files on ESPs; it may also be that the bootloader is wrongly assuming the UKI is a EFISTUB-capable kernel image and passing incorrect flags instead.&lt;/p&gt;

    &lt;p&gt;In any case, ascertain that the kernel is actually receiving the correct parameters by running&lt;/p&gt;

    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# cat /proc/cmdline
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;from the initrd recovery shell.&lt;/p&gt;

    &lt;p&gt;If you are using ZFS and you only specified the target pool instead of the root dataset, remember to set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bootfs&lt;/code&gt; correctly first.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;(ZFS only)&lt;/strong&gt; An incorrect cachefile has been embedded in the initrd. The initrd is trying to use potentially incorrect pool data instead of scanning &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev&lt;/code&gt;.&lt;/p&gt;

    &lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs&lt;/code&gt; hook embeds &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/zfs/zpool.cache&lt;/code&gt; into the initrd during generation. While this is often useful to reduce boot times, especially with large multi-disk pools, it may cause issues if the cachefile is stale or incorrect. Return back to the setup system, chroot, remove the cachefile and regenerate the UKI. The initrd should now attempt discovery the root pool via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zpool import -d /dev&lt;/code&gt; instead of using the cachefile (or any &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs_import_dir&lt;/code&gt; you may have set via the kernel command line).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If none of the previous steps work, you may want to try to boot the system from a different machine to ensure there’s not a problem in the setup itself.&lt;/p&gt;

&lt;h2 id=&quot;52-the-keyboard-doesnt-work-properly-at-the-password-prompt&quot;&gt;5.2 The keyboard doesn’t work properly at the password prompt&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;If the keyboard doesn’t work when typing the encryption password, it’s probably due to the keyboard hook not being run before the encryption hooks (whatever you are using). Ensure that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;keyboard&lt;/code&gt; is listed before &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;encrypt&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/mkinitcpio.conf&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If the keyboard is working, but the password is not being accepted, it may be due to an incorrectly set keyboard layout. Ensure that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/vconsole.conf&lt;/code&gt; is set correctly, and that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;keymap&lt;/code&gt; hook is being run before the encryption hooks.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;53-the-system-boots-but-the-display-is-not-working&quot;&gt;5.3. The system boots, but the display is not working&lt;/h2&gt;

&lt;p&gt;This is rarely an issue with Intel or AMD GPUs, but it’s pretty common with NVIDIA GPUs, especially on buggy laptops with Optimus hybrid graphics.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Remember to always enable KMS modules early, in order to avoid any issues when booting on systems with an NVIDIA discrete GPU. Append &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nvidia-drm.modeset=1&lt;/code&gt; to the kernel command line, and add the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kms&lt;/code&gt; hook right after &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;modconf&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/mkinitcpio.conf&lt;/code&gt;. This should force whatever KMS driver you are using to load early in the boot process, which should provide a working display as soon as the initrd is loaded.&lt;/p&gt;

    &lt;p&gt;Note that with NVIDIA the framebuffer resolution is often not increased automatically, which may lead to a poor CLI experience. This is a common issue that unfortunately tends only to affect NVIDIA users.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Add &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nvidia nvidia_modeset nvidia_uvm nvidia_drm&lt;/code&gt; to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MODULES&lt;/code&gt; array in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/mkinitcpio.conf&lt;/code&gt;. This will ensure that the NVIDIA driver is always loaded early in the boot process. The module will be ignored and unloaded if not needed on the system currently in use.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Do not use any legacy kernel option such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;video=&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vga=&lt;/code&gt;. There are lots of old guides still suggesting to use them, but they are not compatible with KMS and should not be used anymore.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;54-its-impossible-to-log-in-via-a-display-manager-or-logging-from-a-tty-complains-that-the-user-directory-is-missing&quot;&gt;5.4. It’s impossible to log in via a display manager, or logging from a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tty&lt;/code&gt; complains that the user directory is missing&lt;/h2&gt;

&lt;p&gt;This is an issue almost always caused by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/home&lt;/code&gt; not being mounted correctly. Either check that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/home&lt;/code&gt; is correctly configured in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/fstab&lt;/code&gt;, or that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs-mount&lt;/code&gt; is enabled and running alongside the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zfs&lt;/code&gt; target.&lt;/p&gt;

&lt;h1 id=&quot;6-conclusion&quot;&gt;6. Conclusion&lt;/h1&gt;

&lt;p&gt;This post is a very basic guide on how to set up Arch Linux on a portable SSD, which I think feels less like a manual and more like my personal notes.&lt;/p&gt;

&lt;p&gt;This is intentional: while nothing in this guide is unique (everything can be found in the Arch Wiki, in forums or in other blogs), I felt that it was worth it gathering some of my personal experience in a single place, hopefully with the intent of it being useful to someone else besides myself.&lt;/p&gt;

&lt;p&gt;I suspect that after installing Linux (and Arch in particual) an infinite number of times, I grew a bit desensitised to how tricky and error-prone the process can be, especially for newcomers and people who are not accustomed to system administration and troubleshooting. Hopefully, the knowledge written in this article will be a good starting point for anyone who wants to try out Arch Linux, and maybe also get a cool portable system out of it.&lt;/p&gt;

&lt;p&gt;Thanks a lot for reading, and as always feel free to contact me if you find anything incorrect, imprecise or hard to understand.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;and Wi-Fi. Wi-Fi was a PITA too, and don’t get me started on &lt;em&gt;*retches*&lt;/em&gt; USB ADSL modems with Windows-only drivers on mini CDs. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;YMMV. Some devices (e.g. Macs) are notoriously picky about booting from USB drives, but that’s not our system’s fault. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;E.g. if you drop it, there’s a non-zero chance the USB connector and/or the logic board will break. USB enclosures are often very cheap compared to SSDs, so using them is the smarter choice in the long run. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;ARM would be interesting too, if it wasn’t for the fact that there’s nothing akin to PC standards for ARM devices, and even today in 2023 it’s still a hodgepodge of ad-hoc systems and clunky firmware. The fact that lots of ARM devices are also severely locked down doesn’t help, either. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;SSDs and HDDs are complex systems and may fail in several ways, which may lead to situations where the data on the disk is still readable using specialised tools, but cannot be accessed, deleted or overwritten using a normal computer (i.e. if the SSD controller fails). Properly encrypted disks are fundamentally random data, and as long as the encryption scheme is secure and the password is strong, you can chuck a broken disk in the trash without losing sleep over it. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Using ZFS is also a lot of fun IMHO. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;If you suspect you may be a potential target for evil maid attacks, you should probably refrain from using a portable install altogether. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;A small warning: compared to similar tools, parted writes changes to the disk immediately, so always triple-check what you’re doing before hitting enter. I recommend sticking to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gdisk&lt;/code&gt; due to its better support for automatic alignment of partitions. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gparted&lt;/code&gt; also supports advanced features such as resizing filesystems, which is very handy when you don’t want to use the whole disk for the installation. It is also possible to perform such tasks from the command line, but it is in general more complex and error-prone. &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:10&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Linux has no “absolute” naming policy for raw block devices. In particular, USB mass storage devices are enumerate alongside the SCSI and SATA devices, so it’s not uncommon for a USB disk to suddenly become &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sda&lt;/code&gt; after a reboot. &lt;a href=&quot;#fnref:10&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:11&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Once I’ve lost a ZFS pool due to a bug in a Git pre-alpha release of OpenZFS. That day, I learnt that running an OS from a pre-alpha filesystem driver is not a hallmark of good judgement. &lt;a href=&quot;#fnref:11&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:12&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;If you compile pacman and/or use an Arch chroot, it’s absolutely doable from any distro, really, as long as its kernel is new enough to run Arch-distributed binaries. See section 3.2.2. to learn how to do this. &lt;a href=&quot;#fnref:12&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:14&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Notice that I’m using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;perl-rename&lt;/code&gt; in place of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rename&lt;/code&gt;, because I honestly think that plain &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rename&lt;/code&gt; is just outright terrible. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;perl-rename&lt;/code&gt; is a Perl script that can be installed separately (on Arch is in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;perl-rename&lt;/code&gt; package) and it’s just better than util-linux’ &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rename&lt;/code&gt; utility in every measurable way. &lt;a href=&quot;#fnref:14&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:13&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I don’t recommend using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nvidia-open&lt;/code&gt; or Nouveau as of the time of writing (October ‘23), due to the immature state of the first is and the utter incompleteness the latter. The closed source &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nvidia&lt;/code&gt; driver is still the best choice for NVIDIA GPUs, even if it sucks due to how “third-party” it feels (its non-Mesa userland is particularly annoying). &lt;a href=&quot;#fnref:13&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:15&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;On Windows, you can also consider using Hyper-V, which has also the advantage of being already included in Windows and supports using real device drives as virtual disks. &lt;a href=&quot;#fnref:15&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:16&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This feature is known to be buggy under the closed-source NVIDIA driver, so beware. &lt;a href=&quot;#fnref:16&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:17&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Using Secure Boot with an external disk you plan on carrying around is very troublesome for a variety of reasons - first and foremost that you’d either have to enroll your personal keys on every system you plan on booting from, or plan on using Microsoft’s keys, which means fighting with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MokList&lt;/code&gt;s, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PreLoader.efi&lt;/code&gt;, and going through a lot of pain for very dubious benefits. &lt;a href=&quot;#fnref:17&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Unicode is harder than you think</title>
   <link href="https://mcilloni.ovh/2023/07/23/unicode-is-hard/"/>
   <updated>2023-07-23T00:00:00+00:00</updated>
   <id>https://mcilloni.ovh/2023/07/23/unicode-is-hard</id>
   <content type="html">&lt;p&gt;Reading the excellent article by JeanHeyd Meneide on &lt;a href=&quot;https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape&quot;&gt;how broken string encoding in C/C++ is&lt;/a&gt; made me realise that Unicode is a topic that is often overlooked by a large number of developers. In my experience, there’s a lot of confusion and wrong expectations on what Unicode is, and what best practices to follow when dealing with strings that may contain characters outside of the ASCII range.&lt;/p&gt;

&lt;p&gt;This article attempts to briefly summarise and clarify some of the most common misconceptions I’ve seen people struggle with, and some of the pitfalls that tend to recur in codebases that have to deal with non-ASCII text.&lt;/p&gt;

&lt;h2 id=&quot;the-convenience-of-ascii&quot;&gt;The convenience of ASCII&lt;/h2&gt;

&lt;p&gt;Text is usually represented and stored as a sequence of numerical values in binary form. Wherever its source is, to be represented in a way the user can understand it needs to be decoded from its binary representation, as specified by a given &lt;strong&gt;character encoding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;One such example of this is ASCII, the US-centric standard which has been for decades the de-facto way to represent characters and symbols in C and UNIX. ASCII is a 7-bit encoding, which means that it can represent up to 128 different characters. The first 32 characters are control characters, which are not printable, and the remaining 96 are printable characters, which include the 26 letters of the English alphabet, the 10 digits, and a few symbols:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex  
  0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
  1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
  2 02 STX  18 12 DC2  34 22 &quot;  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
  3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
  4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
  5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
  6 06 ACK  22 16 SYN  38 26 &amp;amp;  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
  7 07 BEL  23 17 ETB  39 27 &apos;  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
  8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
  9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
 10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
 11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
 12 0C FF   28 1C FS   44 2C ,  60 3C &amp;lt;  76 4C L  92 5C \  108 6C l  124 7C |
 13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
 14 0E SO   30 1E RS   46 2E .  62 3E &amp;gt;  78 4E N  94 5E ^  110 6E n  126 7E ~
 15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This table defines a two-way transformation, in jargon a &lt;strong&gt;charset&lt;/strong&gt;, which maps a certain sequence of bits (representing a number) to a given character, and vice versa. This can be easily seen by dumping some text as binary:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; Cat! | xxd
00000000: 4361 7421                                Cat!
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The first column represents the binary representation of the input string “Cat!” in hexadecimal form. Each character is mapped into a single byte (represented here as two hexadecimal digits):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;43&lt;/code&gt; is the hexadecimal representation of the ASCII character &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt;;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;61&lt;/code&gt; is the hexadecimal representation of the ASCII character &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;a&lt;/code&gt;;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;74&lt;/code&gt; is the hexadecimal representation of the ASCII character &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;t&lt;/code&gt;;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;21&lt;/code&gt; is the hexadecimal representation of the ASCII character &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This simple set of characters was for decades considered more than enough by most of the English-speaking world, which was where the vast majority of computer early computer users and pioneers came from.&lt;/p&gt;

&lt;p&gt;An added benefit of ASCII is that it is a &lt;strong&gt;fixed-width encoding&lt;/strong&gt;: each character is always represented &lt;em&gt;univocally&lt;/em&gt; by the same number of bits, that in turn always represent the same number.&lt;/p&gt;

&lt;p&gt;This leads to some very convenient ergonomics when handling strings in C:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;ctype.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdio.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// converts all arguments to uppercase&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;// iterate over each character in the string, and print its uppercase&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;putchar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toupper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;putchar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sc&quot;&gt;&apos; &apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;putchar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sc&quot;&gt;&apos;\n&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The example above assumes, like a large amount of code written in the last few decades, that the C basic type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;char&lt;/code&gt; represents a byte-sized ASCII character. This assumption minimises the mental and runtime overhead of handling text, as strings can be treated as arrays of characters belonging to a very minimal set. Because of this, ASCII strings can be iterated on, addressed individually and transformed or inspected using simple, cheap operations such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;isalpha&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;toupper&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;the-world-outside&quot;&gt;The world outside&lt;/h2&gt;

&lt;p&gt;However, as computers started to spread worldwide it became clear that it was necessary to devise character sets capable to represent all the characters required in a given locale. For instance, Spanish needs the letter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ñ&lt;/code&gt;, Japan needs the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;¥&lt;/code&gt; symbol and support for Kana and Kanji, and so on.&lt;/p&gt;

&lt;p&gt;All of this led to a massive proliferation of different character encodings, usually tied to a given language, area or locale. These varied from 8-bit encodings, which either extended ASCII by using its unused eighth bit (like &lt;strong&gt;ISO-8859-1&lt;/strong&gt;) or completely replaced its character set (like &lt;strong&gt;KOI-7&lt;/strong&gt;), to multi-byte encodings for Asian languages with thousands of characters like &lt;strong&gt;Shift-JIS&lt;/strong&gt; and &lt;strong&gt;Big5&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This turned into a huge headache for both developers and users, as it was necessary to know (or deduce via hacky heuristics) which encoding was used for a given piece of text, for instance when receiving a file from the Internet, which was becoming more and more common thanks to email, IRC and the World Wide Web.&lt;/p&gt;

&lt;p&gt;Most crucially, multibyte encodings (a necessity for Asian characters) meant that &lt;em&gt;the assumption “one char = one byte” didn’t hold anymore&lt;/em&gt;, with the small side effect of breaking all code in existence at the time.&lt;/p&gt;

&lt;p&gt;For a while, the most common solution was to use a single encoding for each language, and then hope for the best. This often led to garbled text (who hasn’t seen the infamous &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;�&lt;/code&gt; character at least once), so much so that a specific term was coined to describe it - &lt;em&gt;“mojibake”&lt;/em&gt;, from the Japanese &lt;em&gt;“文字化け”&lt;/em&gt; &lt;em&gt;(“character transformation”)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mcilloni.ovh/public/mojibake.jpg&quot; alt=&quot;KOI8-R text mistakenly written on an envelope as ISO-8859-1 text&quot; title=&quot;I guess they thought it was actual Russian text&quot; class=&quot;center-image&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In general, for a long time using a non-English locale meant that you had to contend with broken third (often first) party software, patchy support for certain characters, and switching encodings on the fly depending on the context. The inconvenience was such that it was common for non-Latin Internet users to converse in their native languages with the Latin alphabet, using impromptu transliterations if necessary. A prime example of this was the Arabic chat alphabet widespread among Arabic-speaking netizens in the 90’s and 00’s &lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;unicode&quot;&gt;Unicode&lt;/h2&gt;

&lt;p&gt;It was clear to most people back then that the situation as it was untenable, so much so that as early as the late ’80s people started proposing a universal character encoding capable to cover all modern scripts and symbols in use.&lt;/p&gt;

&lt;p&gt;This led to the creation of &lt;strong&gt;Unicode&lt;/strong&gt;, whose first version was standardised in 1991 after a few years of joint development led by Xerox and Apple (among others). Unicode main design goal was, and still is, to define a universal &lt;strong&gt;character set&lt;/strong&gt; capable to represent all the aforementioned characters, alongside a &lt;strong&gt;character encoding&lt;/strong&gt; capable of uniformly representing them all.&lt;/p&gt;

&lt;p&gt;In Unicode, every character, or more properly &lt;strong&gt;code point&lt;/strong&gt;, is represented by a unique number, belonging to a specific &lt;strong&gt;Unicode block&lt;/strong&gt;. Crucially, the first block of Unicode (“Basic Latin”) corresponds point per point to ASCII, so that &lt;em&gt;all ASCII characters correspond to equivalent Unicode codepoints&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Code points are usually represented with the syntax &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+XXXX&lt;/code&gt;, where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XXXX&lt;/code&gt; is the hexadecimal representation of the code point. For instance, the code point for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; character is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+0041&lt;/code&gt;, while the code point for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ñ&lt;/code&gt; character is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+00F1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Unicode 1.0 covered 26 scripts and 7,161 characters, covering most of the world’s languages and lots of commonplace symbols and glyphs.&lt;/p&gt;

&lt;h2 id=&quot;ucs-2-or-how-unicode-made-everything-worse&quot;&gt;UCS-2, or &lt;em&gt;“how Unicode made everything worse”&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;Alongside the first Unicode specification, which defined the character set, two&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; new character encodings, called &lt;strong&gt;UCS-2&lt;/strong&gt; and &lt;strong&gt;UCS-4&lt;/strong&gt; (which came a bit later), were also introduced. UCS-2 was the original Unicode encoding, and it’s an extension of ASCII to 16 bits, representing what Unicode called the &lt;strong&gt;Basic Multilingual Plane&lt;/strong&gt; (&lt;em&gt;“BMP”&lt;/em&gt;); UCS-4 is the same but with 32-bit values. Both were fixed-width encodings, using multiple bytes to represent each single character in a string.&lt;/p&gt;

&lt;p&gt;In particular, UCS-2’s maximum range of 65,536 possible values was good enough to cover the entire Unicode 1.0 set of characters. The storage savings compared with UCS-4 were quite enticing, also - while ’90s machines weren’t as constrained as the ones that came before, representing basic Latin characters with 4 bytes was still seen as an egregious waste.&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Thus, 16 bits quickly became the standard size for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wchar_t&lt;/code&gt; type recently added by the C89 standard to support wide characters for encodings like Shift-JIS. Sure, switching from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;char&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wchar_t&lt;/code&gt; required developers to rewrite all code to use wide characters and wide functions, but a bit of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sed&lt;/code&gt; was a small price to pay for the ability to resolve internationalization, right?&lt;/p&gt;

&lt;p&gt;The C library had also introduced, alongside the new wide char type, a set of functions and types to handle &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wchar_t&lt;/code&gt;, wide strings and (poorly designed) functions locale support, including support for multibyte encodings. Some vendors, like Microsoft, even devised tricks to make it possible to optionally switch from legacy 8-bit codepages to UCS-2 by using ad-hoc types like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TCHAR&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LPTSTR&lt;/code&gt; in place of specific character types.&lt;/p&gt;

&lt;p&gt;All of that said, the code snippet above could be rewritten on Win32 as the following:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;ctype.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;tchar.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#if !defined(_UNICODE) &amp;amp;&amp;amp; !defined(UNICODE)
#   include &amp;lt;stdio.h&amp;gt;
#endif
&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_tmain&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TCHAR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// converts all arguments to uppercase&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TCHAR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;// iterate over each character in the string, and print its uppercase&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TCHAR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;_puttchar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_totupper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;it&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;_puttchar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sc&quot;&gt;&apos; &apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;_puttchar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sc&quot;&gt;&apos;\n&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Neat, right? This was indeed considered so convenient that developers jumped on the UCS-2 bandwagon in droves, finally glad the encoding mess was over.&lt;/p&gt;

&lt;p&gt;16-bit Unicode was indeed a huge success, as attested by the number of applications and libraries that adopted it during the ’90s:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Windows NT, 2000 and XP used UCS-2 as their internal character encoding, and exposed it to developers via the Win32 API;&lt;/li&gt;
  &lt;li&gt;Apple’s Cocoa, too, used UCS-2 as its internal character encoding for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NSString&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unichar&lt;/code&gt;;&lt;/li&gt;
  &lt;li&gt;Sun’s Java used UCS-2 as its internal character encoding for all strings, even going as far as to define its String type as an array of 16-bit characters;&lt;/li&gt;
  &lt;li&gt;Javascript, too, didn’t want to be left behind, and basically defined its String type the same way Java did;&lt;/li&gt;
  &lt;li&gt;Qt, the popular C++ GUI framework, used UCS-2 as its internal character encoding, and exposed it to developers via the QString class;&lt;/li&gt;
  &lt;li&gt;Unreal Engine just copied the WinAPI approach and used UCS-2 as its internal character encoding &lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and many more. Every once in a while, I still find out that some piece of code I frequently use is still using UCS-2 (or UTF-16, see later) internally. In general, every time you read something along the lines of “Unicode support” without any reference to UTF, there’s an almost 100% chance that it actually means “UCS-2”, or some borked variant of it.&lt;/p&gt;

&lt;h2 id=&quot;combining-characters&quot;&gt;Combining characters&lt;/h2&gt;

&lt;p&gt;Unicode supported since its first release the concept of &lt;strong&gt;combining characters&lt;/strong&gt; (later better defined as &lt;strong&gt;grapheme clusters&lt;/strong&gt;), which are clusters of characters meant to be combined with other characters in order to form a single unit by text processing tools.&lt;/p&gt;

&lt;p&gt;In Unicode jargon, these are called &lt;strong&gt;composite sequences&lt;/strong&gt; and were designed to allow Unicode to represent scripts like Arabic, which uses a lot of diacritics and other combining characters, without having to define a separate code point for each possible combination.&lt;/p&gt;

&lt;p&gt;This could have been in principle a neat idea - grapheme clusters allow Unicode to save a massive amount of code points from being pointlessly wasted for easily combinable characters (just think about South Asian languages or Hangul). The real issue was that the Consortium, anxious to help with the transition to Unicode, did not want to drop support for dedicated codepoints for “preassembled” characters such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;è&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ñ&lt;/code&gt;, which were historically supported by the various extended ASCII encodings.&lt;/p&gt;

&lt;p&gt;This led to Unicode supporting &lt;strong&gt;precomposed characters&lt;/strong&gt;, which are codepoints that stand for &lt;em&gt;a glyph that also be represented using a grapheme cluster&lt;/em&gt;. An example of this is the Extended Latin characters with accents or diacritics, which can all be represented by combining the base Latin character with the corresponding modifier, or by using a single code point.&lt;/p&gt;

&lt;p&gt;For instance, let’s try testing out a few things with Python’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unicodedata&lt;/code&gt; and two seemingly identical strings, “caña” and “caña” (notice how they look the same):&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;unicodedata&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;caña&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;caña&quot;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;
&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Uh?&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;caña&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;caña&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The two strings are visually identical - they are rendered the same by our Unicode-enabled terminal - and yet, they do not evaluate as equal, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;len()&lt;/code&gt; function returns different lengths. This is because the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ñ&lt;/code&gt; in the second string is a grapheme cluster composed of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+006E LATIN SMALL LETTER N&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+0303 COMBINING TILDE&lt;/code&gt; character, combined by terminal into a single character.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;c&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;a&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;ñ&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;a&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;c&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;a&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;n&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;̃&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;a&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER C&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER A&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER N WITH TILDE&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER A&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER C&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER A&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER N&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;COMBINING TILDE&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER A&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is obviously a big departure from the “strings are just arrays of characters” model the average developer is used to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Trivial comparisons like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;a == b&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strcmp(a, b)&lt;/code&gt; &lt;strong&gt;are no longer trivial&lt;/strong&gt;. A Unicode-aware algorithm must to be implemented, in order to actually compare the strings as they are rendered or printed;&lt;/li&gt;
  &lt;li&gt;Random access to characters is no longer safe, because a single glyph can span over multiple code points, and thus over multiple array elements;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;640k-16-bits-ought-to-be-enough-for-everyone&quot;&gt;&lt;em&gt;”&lt;del&gt;640k&lt;/del&gt; 16 bits ought to be enough for everyone”&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;Anyone with any degree of familiarity with Asian languages will have noticed that 7,161 characters are way too small a number to include the tens of thousands of Chinese characters in existence. This is without counting minor and historical scripts, and the thousands of symbols and glyphs used in mathematics, music, and other fields.&lt;/p&gt;

&lt;p&gt;In the years following 1991, the Unicode character set was thus expanded with tens of thousands of new characters, and it become quickly apparent that UCS-2 was soon going to run out of 16-bit code points.&lt;sup id=&quot;fnref:6&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;To circumvent this issue, the Unicode Consortium decided to expand the character set from 16 to 21 bits. This was a huge breaking change that basically meant &lt;strong&gt;obsoleting UCS-2&lt;/strong&gt; (and thus breaking most software designed in the ’90s), just a few years after its introduction and widespread adoption.&lt;/p&gt;

&lt;p&gt;While UCS-2 was still capable of representing anything inside the BMP, it became clear a new encoding was needed to support the growing set of characters in the UCS.&lt;/p&gt;

&lt;h2 id=&quot;utf&quot;&gt;UTF&lt;/h2&gt;

&lt;p&gt;The acronym &lt;em&gt;“UTF”&lt;/em&gt; stands for &lt;em&gt;“Unicode Transformation Format”&lt;/em&gt;, and represents a family of &lt;strong&gt;variable-width encodings&lt;/strong&gt; capable of representing the whole Unicode character set, up to its hypothetical supported potential 2²¹ characters. Compared to UCS, UTF encodings specify how a given stream of bytes can be converted into a sequence of Unicode code points, and vice versa (i.e., &lt;em&gt;“transformed”&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;Compared to a fixed-width encoding like UCS-2, a variable-width character encoding can employ a variable number of code units to encode each character. This bypasses the “one code unit per character” limitation of fixed-width encodings, and allows the representation of a much larger number of characters—potentially, an infinite number, depending on how many &lt;em&gt;“lead units”&lt;/em&gt; are reserved as markers for multi-unit sequences.&lt;/p&gt;

&lt;p&gt;Excluding the dead-on-arrival UTF-1, there are 4 UTF encodings in use today:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;UTF-8, a variable-width encoding that uses 1-byte characters&lt;/li&gt;
  &lt;li&gt;UTF-16, a variable-width encoding that uses 2-byte characters&lt;/li&gt;
  &lt;li&gt;UTF-32, a variable-width encoding that uses 4-byte characters&lt;/li&gt;
  &lt;li&gt;UTF-EBCDIC, a variable-width encoding that uses 1-byte characters designed for IBM’s EBCDIC systems (note: I think it’s safe to argue that using EBCDIC in 2023 edges very close to being a felony)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;utf-16&quot;&gt;UTF-16&lt;/h3&gt;

&lt;p&gt;To salvage the consistent investments made to support UCS-2, the Unicode Consortium created UTF-16 as a backward-compatible extension of UCS-2. When some piece of software advertises “support for UNICODE”, it almost always means that some software supported UCS-2 and switched to UTF-16 sometimes later. &lt;sup id=&quot;fnref:7&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Like UCS-2, UTF-16 can represent the entirety of the BMP using a single 16-bit value. Every codepoint above &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+FFFF&lt;/code&gt; is represented using a pair of 16-bit values, called &lt;strong&gt;surrogate pairs&lt;/strong&gt;. The first value (the &lt;em&gt;“high surrogate”&lt;/em&gt;) is always a value in the range &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+D800&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+DBFF&lt;/code&gt;, while the second value (the &lt;em&gt;“low surrogate”&lt;/em&gt;) is always a value in the range &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+DC00&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+DFFF&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This, in practice, means that the range reserved for BMP characters never overlaps with surrogates, making it trivial to distinguish between a single 16-bit codepoint and a surrogate pair, which makes UTF-16 &lt;a href=&quot;https://en.wikipedia.org/wiki/Self-synchronizing_code&quot;&gt;&lt;em&gt;self-synchronizing&lt;/em&gt;&lt;/a&gt; over 16-bit values.&lt;/p&gt;

&lt;p&gt;Emojis are an example of characters that lie outside of the BMP; as such, they are always represented using surrogate pairs.
For instance, the character &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+1F600&lt;/code&gt; (😀) is represented in UTF-16 by the surrogate pair &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[0xD83D, 0xDE00]&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# pack the surrogate pair into bytes by hand, and then decode it as UTF-16
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bys&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cp&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0xD83D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xDE00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;little&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))]&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bys&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;61&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;216&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;222&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;utf-16le&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&apos;😀&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;the-bom&quot;&gt;The BOM&lt;/h3&gt;

&lt;p&gt;Notice that in the example above I had to specify an endianness for the bytes (little-endian in this case) by writing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;utf-16le&quot;&lt;/code&gt; instead of just &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;utf-16&quot;&lt;/code&gt;. This is due to the fact that &lt;em&gt;UTF-16 is actually two different (incompatible) encodings&lt;/em&gt;, UTF-16LE and UTF-16BE, which differ in the endianness of the single codepoints. &lt;sup id=&quot;fnref:8&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;The standard calls for UTF-16 streams to start with a &lt;strong&gt;Byte Order Mark&lt;/strong&gt; (BOM), represented by the special codepoint &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+FEFF&lt;/code&gt;. Reading &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0xFEFF&lt;/code&gt; indicates that the endianness of a text block is the same as the endianness of the decoding system; reading those bytes flipped, as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0xFFFE&lt;/code&gt;, indicates opposite endianness instead.&lt;/p&gt;

&lt;p&gt;As an example, let’s assume a big-endian system has generated the sequence &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[0xFE, 0xFF, 0x00, 0x61]&lt;/code&gt;. &lt;br /&gt;
All systems, LE or BE, will detect that the first two bytes are a surrogate pair, and read them as they are depending on their endianness. Then:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A &lt;em&gt;big-endian&lt;/em&gt; system will decode &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+FEFF&lt;/code&gt;, which is the BOM, and thus will assume the text is in UTF-16 in its same byte endianness (BE);&lt;/li&gt;
  &lt;li&gt;A &lt;em&gt;little-endian&lt;/em&gt; system will instead read &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+FFFE&lt;/code&gt;, which is still the BOM but flipped, so it will assume the text is in the opposite endianness (BE in the case of an LE system).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In both cases, the BOM allows the following character to be correctly parsed as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+0061&lt;/code&gt; (a.k.a. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;a&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;If no BOM is detected, then most decoders will do as they please (despite the standard recommending to assume UTF-16BE), which most of the time means assuming the endianness of the system:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sys&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;byteorder&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&apos;little&apos;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# BOM read as 0xFEFF and system is LE -&amp;gt; will assume UTF-16LE
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0xFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xFE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x61&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x62&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x63&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;utf-16&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; 
&lt;span class=&quot;s&quot;&gt;&apos;abc&apos;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# BOM read as 0xFFFE and system is LE -&amp;gt; will assume UTF-16BE
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0xFE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x61&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x62&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x63&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;utf-16&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&apos;abc&apos;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# no BOM, text is BE and system is LE -&amp;gt; will assume UTF-16LE and read garbage
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x61&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x62&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x63&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;utf-16&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&apos;愀戀挀&apos;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# no BOM, text is BE and UTF-16BE is explicitly specified -&amp;gt; will read the text correctly
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x61&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x62&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x63&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;utf-16be&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&apos;abc&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Some decoders may probe the first few codepoints for zeroes to detect the endianness of the stream, which is in general not an amazing idea. As a rule of thumb, UTF-16 text should &lt;strong&gt;never rely on automated endianness detection&lt;/strong&gt;, and thus either always start with a BOM or assume a fixed endianness value (which in the vast majority of cases is UTF-16LE, which is what Windows does).&lt;/p&gt;

&lt;h3 id=&quot;utf-32&quot;&gt;UTF-32&lt;/h3&gt;
&lt;p&gt;Just as UTF-16 is an extension of UCS-2, &lt;strong&gt;UTF-32&lt;/strong&gt; is an evolution of UCS-4. Compared to all other UTF encodings, UTF-32 is by far the simplest, because like its predecessor, it is a &lt;strong&gt;fixed-width encoding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The major difference between UCS-4 and UTF-32 is that the latter has been limited down 21 bits, from its maximum of 31 bits (UCS-4 was signed). This has been done to maintain compatibility with UTF-16, which is constrained by its design to only represent codepoints up to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+10FFFF&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;While UTF-32 seems convenient at first, it is not in practice all that useful, for quite a few reasons:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;UTF-32 is outrageously wasteful because all characters, including those belonging to the ASCII plane, are represented using 4 bytes. Given that the vast majority of text uses ASCII characters for markup, content or both, UTF-32 encoded text tends to be mostly comprised of just a few significant bytes scattered in between a sea of zeroes:&lt;/p&gt;

    &lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# UTF-32BE encoded text with BOM
&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xFE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x61&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x62&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x63&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;utf-32&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;s&quot;&gt;&apos;abc&apos;&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# The same, but in UTF-16BE
&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0xFE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x61&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x62&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x63&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;utf-16&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;s&quot;&gt;&apos;abc&apos;&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# The same, but in ASCII
&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0x61&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x62&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x63&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;ascii&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;s&quot;&gt;&apos;abc&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;No major OS or software uses UTF-32 as its internal encoding as far as I’m aware of. While locales in modern UNIX systems usually define &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wchar_t&lt;/code&gt; as representing UTF-32 codepoints, they are seldom used due to most software in existence assuming that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wchar_t&lt;/code&gt; is 16-bit wide.&lt;/p&gt;

    &lt;p&gt;On Linux, for instance:&lt;/p&gt;

    &lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;locale.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdio.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;wchar.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
 &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;c1&quot;&gt;// one of the bajilion ways to set a Unicode locale - we&apos;ll talk UTF-8 later&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;setlocale&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LC_ALL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;en_US.UTF-8&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; 
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;wchar_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;L&quot;abc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

     &lt;span class=&quot;n&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;sizeof(wchar_t) == %zu&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// 4&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;wcslen(s) == %zu&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wcslen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// 3&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;bytes in s == %zu&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// 16 (12 + 4, due to the null terminator)&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;    
 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The fact UTF-32 is a fixed-width encoding is only marginally useful, due to &lt;em&gt;grapheme clusters still being a thing&lt;/em&gt;. This means that the equivalence between codepoints and rendered glyphs is still not 1:1, just like in UCS-4:&lt;/p&gt;

    &lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;c1&quot;&gt;// GNU/Linux, x86_64&lt;/span&gt;

 &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;locale.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdio.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;wchar.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
 &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;setlocale&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LC_ALL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;en_US.UTF-8&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// &quot;caña&quot;, with &apos;ñ&apos; written as the grapheme cluster &quot;n&quot; + &quot;combining tilde&quot;&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;wchar_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;L&quot;can\u0303a&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

     &lt;span class=&quot;n&quot;&gt;wprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;L&quot;`%ls`&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// prints &quot;caña&quot; as 4 glyphs&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;wprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;L&quot;`%ls` is %zu codepoints long&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wcslen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// 5 codepoints&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;wprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;L&quot;`%ls` is %zu bytes long&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// 24 bytes (5 UCS-4 codepoints + null)&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// this other string is the same as the previous one, but with the precomposed &quot;ñ&quot; character&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;wchar_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probe&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;L&quot;ca\u00F1a&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;_Bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;different&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wcscmp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probe&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// this will always print &quot;different&quot;, because the two strings are not the same despite being identical&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;wprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;L&quot;`%ls` and `%ls` are %s&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probe&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;different&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;different&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;equal&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;cc &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; widestr_test widestr_test.c &lt;span class=&quot;nt&quot;&gt;-std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;c11
 &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./widestr_test
 &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;caña&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
 &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;caña&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt; is 5 codepoints long
 &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;caña&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt; is 24 bytes long
 &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;caña&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt; and &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;caña&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt; are different
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;This is by far the biggest letdown about UTF-32: it is not the ultimate “extended ASCII” encoding most people wished for, because it is still incorrect so iterate over characters, and it still requires normalization &lt;em&gt;(see below)&lt;/em&gt; in order to be safely operated on character by character.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;utf-8&quot;&gt;UTF-8&lt;/h3&gt;

&lt;p&gt;I left &lt;strong&gt;UTF-8&lt;/strong&gt; as last because it is by far the best among the crop of Unicode encodings &lt;sup id=&quot;fnref:9&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;. UTF-8 is a variable width encoding, just like UTF-16, but with the crucial advantage that &lt;strong&gt;UTF-8 uses byte-sized (8-bit) code units&lt;/strong&gt;, just like ASCII.&lt;/p&gt;

&lt;p&gt;This is a major advantage, for a series of reasons:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;All ASCII text is valid UTF-8, and ASCII itself is in UTF-8, limited to the codepoints between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+0000&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+007F&lt;/code&gt;.
    &lt;ul&gt;
      &lt;li&gt;This also implies that UTF-8 can encode ASCII text with one byte per character, even when mixed up with non-Latin characters;&lt;/li&gt;
      &lt;li&gt;Editors, terminals and other software can just support UTF-8 without having to support a separate ASCII mode;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;UTF-8 doesn’t require bothering with endianness, because bytes are just that - bytes. This means that UTF-8 does not require a BOM, even though poorly designed software may still add one &lt;em&gt;(see below)&lt;/em&gt;;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;UTF-8 doesn’t need a wide char type, like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wchar_t&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;char16_t&lt;/code&gt;. Old APIs can use classic byte-sized &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chars&lt;/code&gt;, and just disregard characters above &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+007F&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The following is an arguably poorly designed C program that parses a basic key-value file format defined as follows:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;key1:value1
key2:value2
key\:3:value3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stddef.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdio.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdlib.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;string.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#define BUFFER_SIZE 1024
&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[])&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stderr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;usage: %s &amp;lt;file&amp;gt;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]);&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EXIT_FAILURE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kt&quot;&gt;FILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fopen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;r&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stderr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;error: could not open file `%s`&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]);&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EXIT_FAILURE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;retval&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EXIT_SUCCESS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;malloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BUFFER_SIZE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stderr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;error: could not allocate memory&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BUFFER_SIZE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;ptrdiff_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key_offs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;_Bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;escape&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(;;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fgetc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;switch&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&apos;\\&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;escape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;escape&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;continue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

            &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&apos;:&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;escape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key_offs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;fprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stderr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;error: extra `:` at position %td&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
                    
                    &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

                &lt;span class=&quot;n&quot;&gt;key_offs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

                &lt;span class=&quot;k&quot;&gt;continue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

            &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&apos;\n&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;escape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key_offs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;fprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stderr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;error: missing `:`&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

                &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

            &lt;span class=&quot;n&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;key: `%.*s`, value: `%.*s`&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key_offs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key_offs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key_offs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

            &lt;span class=&quot;n&quot;&gt;key_offs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

            &lt;span class=&quot;k&quot;&gt;continue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;line_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;realloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;fprintf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stderr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;error: could not allocate memory&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

                &lt;span class=&quot;k&quot;&gt;goto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pos&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;escape&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nl&quot;&gt;end:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fclose&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EXIT_SUCCESS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;cc &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; kv kv.c &lt;span class=&quot;nt&quot;&gt;-std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;c11
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;kv_test.txt
key1:value1
key2:value2
key&lt;span class=&quot;se&quot;&gt;\:&lt;/span&gt;3:value3
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./kv kv_test.txt
key: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;key1&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;, value: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;value1&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
key: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;key2&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;, value: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;value2&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
key: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;key:3&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;, value: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;value3&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This program operates on files &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;char&lt;/code&gt; by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;char&lt;/code&gt; (or rather, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int&lt;/code&gt; by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int&lt;/code&gt;—that’s a long story), using whatever the “native” 8-bit (“narrow”) execution character set is to match for basic ASCII characters such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\n&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The beauty of UTF-8 is that code that splits, searches, or synchronises using ASCII &lt;em&gt;symbols&lt;/em&gt;&lt;sup id=&quot;fnref:10&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:10&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;9&lt;/a&gt;&lt;/sup&gt; will work fine as-is, with little to no modification, even with Unicode text.&lt;/p&gt;

&lt;p&gt;Standard C character literals will still be valid Unicode codepoints, as long as the encoding of the source file is UTF-8. In the file above, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&apos;:&apos;&lt;/code&gt; and other ASCII literals will fit in a char (int, really) as long as they are encoded as ASCII (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:&lt;/code&gt; is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+003A&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Like UTF-16, UTF-8 is self-synchronizing: the code-splitting logic above will never match a UTF-8 codepoint in the middle, given that ASCII is reserved all of the codepoints between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+0000&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+007F&lt;/code&gt;. The text can then be returned to the UTF-8 compliant system as it is, and the Unicode text will be correctly rendered.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;kv_test_utf8.txt
tcp:127.0.0.1
Affet, affet:Yalvarıyorum
Why? 😒:blåbær
Spla&lt;span class=&quot;se&quot;&gt;\:&lt;/span&gt;too:3u33
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./kv kv_test_utf8.txt
key: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;tcp&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;, value: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;127.0.0.1&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
key: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;Affet, affet&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;, value: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;Yalvarıyorum&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
key: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;Why? 😒&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;, value: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;blåbær&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
key: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;Spla:too&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;, value: &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;3u33&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;unicode-normalization&quot;&gt;Unicode Normalization&lt;/h2&gt;

&lt;p&gt;As I previously mentioned, Unicode codepoints can be modified using combining characters, and the standard supports precomposed forms of some characters which have decomposed forms.
The resulting glyphs are visually indistinguishable after being rendered, and there’s no limitation on using both forms alongside each other in the same text bit of text:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;unicodedata&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;Störfälle&apos;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN CAPITAL LETTER S&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER T&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER O WITH DIAERESIS&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER R&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER F&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER A&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;COMBINING DIAERESIS&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER L&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER L&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER E&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# getting the last 4 characters actually picks the last 3 glyphs, plus a combining character
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# sometimes the combining character may be mistakenly rendered over the `&apos;` Python prints around the string
&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:]&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&apos;̈lle&apos;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:]]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;COMBINING DIAERESIS&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER L&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER L&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER E&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is a significant issue, given how character-centric our understanding of text is: users (and by extension, developers) expect to be able to count what they see as “letters”, in a way that is consistent with how they are printed, shown on screen or inputted in a text field.&lt;/p&gt;

&lt;p&gt;Another headache is the fact Unicode also may define special forms for the same letter or group of letters, which are visibly different but understood by humans to be derived from the same symbol.&lt;/p&gt;

&lt;p&gt;A very common example of this is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ﬁ&lt;/code&gt; (U+FB01), &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ﬂ&lt;/code&gt; (U+FB02), &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ﬀ&lt;/code&gt; (U+FB00) and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ﬃ&lt;/code&gt; (U+FB03) ligatures, which are ubiquitous in Latin text as a “more readable” form of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fi&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fl&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffi&lt;/code&gt; digraphs. In general, users expect &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;office&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ofﬁce&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oﬃce&lt;/code&gt; to be treated and rendered similarly, because they all represent the same identical word, but not necessarily without any visual difference. &lt;sup id=&quot;fnref:11&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:11&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;10&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;h3 id=&quot;canonical-and-compatibility-equivalence&quot;&gt;Canonical and Compatibility Equivalence&lt;/h3&gt;

&lt;p&gt;To solve this issue, Unicode defines two different types of equivalence between codepoints (or sequences thereof):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Canonical equivalence&lt;/strong&gt;, when two combinations of one or more codepoints represent the same “abstract” character, like in the case of &lt;em&gt;“ñ”&lt;/em&gt; and &lt;em&gt;“n + combining tilde”&lt;/em&gt;;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Compatibility equivalence&lt;/strong&gt;, when two combinations of one or more codepoints more or less represent the same “abstract” character, while being rendered differently or having different semantics, like in the case of &lt;em&gt;“ﬁ”&lt;/em&gt;, or mathematical signs such as “Mathematical Bold Capital A” (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;𝐀&lt;/code&gt;).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Canonical equivalence is generally considered a stronger form of equivalence than compatibility equivalence: it is critical for text processing tools to be able to treat canonically equivalent characters as the same, otherwise, users may be unable to search, edit or operate on text properly.&lt;sup id=&quot;fnref:12&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:12&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;11&lt;/a&gt;&lt;/sup&gt; On the other end, users are aware of compatibility-equivalent characters due to their different semantic and visual features, so their equivalence becomes relevant only in specific circumstances (like textual search, for instance, or when the user tries to copy “fancy” characters from Word to a text box that only accepts plain text).&lt;/p&gt;

&lt;h3 id=&quot;normalization-forms&quot;&gt;Normalization Forms&lt;/h3&gt;

&lt;p&gt;Unicode defines four distinct &lt;strong&gt;normalization forms&lt;/strong&gt;, which are specific forms a Unicode text can be in, and which allow for safe comparisons between strings.
The standard describes how text can be transformed into any form, following a specific &lt;em&gt;normalization algorithm&lt;/em&gt; based &lt;a href=&quot;https://www.unicode.org/charts/normalization/&quot;&gt;on per-glyph mappings&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The four normalization forms are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;NFD&lt;/strong&gt;, or &lt;strong&gt;Normalization Form D&lt;/strong&gt;, which applies a single &lt;strong&gt;canonical decomposition&lt;/strong&gt; to all characters of a string. In general, this can be assumed to mean that every character that has a &lt;em&gt;canonically-equivalent&lt;/em&gt; decomposed form is in it, with all of its modifiers &lt;a href=&quot;https://unicode.org/reports/tr15/#Description_Norm&quot;&gt;sorted into a canonical order&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;For instance,&lt;/p&gt;
    &lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;e&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\u0302\u0323&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;&apos;ệ&apos;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;e&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\u0302\u0323&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER E&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;COMBINING CIRCUMFLEX ACCENT&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;COMBINING DOT BELOW&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;normalized&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normalize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;NFD&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;e&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\u0302\u0323&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;normalized&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;&apos;ệ&apos;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;normalized&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER E&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;COMBINING DOT BELOW&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;COMBINING CIRCUMFLEX ACCENT&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
    &lt;p&gt;Notice how the circumflex and the dot below were in a noncanonical order and were swapped by the normalization algorithm.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;NFC&lt;/strong&gt;, or &lt;strong&gt;Normalization Form C&lt;/strong&gt;, which first applies a &lt;strong&gt;canonical decomposition&lt;/strong&gt;, followed by a &lt;strong&gt;canonical composition&lt;/strong&gt;. In NFC, all characters are composed into a &lt;em&gt;precomposed character&lt;/em&gt;, if possible:&lt;/p&gt;

    &lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;precomposed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normalize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;NFC&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;e&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\u0302\u0323&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;precomposed&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;&apos;ệ&apos;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;precomposed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Notice that normalizing to NFC &lt;strong&gt;is not enough to “count” glyphs&lt;/strong&gt;, given that some may not be representable with a single codepoint. An example of this is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ẹ̄&lt;/code&gt;, which has no associated precomposed character:&lt;/p&gt;

    &lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normalize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;NFC&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;ẹ̄&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER E WITH DOT BELOW&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;COMBINING MACRON&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;A particularly nice property of NFC is that by definition all ASCII text is by definition already in NFC, which means that compilers and other tools do not necessarily have to bother with normalization when dealing with source code or scripts. &lt;sup id=&quot;fnref:13&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:13&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;12&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;NFKD&lt;/strong&gt;, or &lt;strong&gt;Normalization Form KD&lt;/strong&gt;, which applies a &lt;strong&gt;compatibility decomposition&lt;/strong&gt; to all characters of a string. Alongside canonical equivalence, Unicode also defines compatibility-equivalent decompositions for certain characters, like the previously mentioned &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ﬁ&lt;/code&gt; ligature, which is decomposed into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i&lt;/code&gt;.&lt;/p&gt;

    &lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;ﬁ&quot;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LIGATURE FI&apos;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normalize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;NFD&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# doesn&apos;t do anything, `ﬁ` has no canonical decomposition
&lt;/span&gt;  &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LIGATURE FI&apos;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;decomposed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normalize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;NFKD&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;ﬁ&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;decomposed&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;&apos;fi&apos;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;decomposed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER F&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER I&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Characters that don’t have a compatibility decomposition are canonically decomposed instead:&lt;/p&gt;

    &lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\u1EC7&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;&apos;ệ&apos;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normalize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;NFKD&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\u1EC7&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER E&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;COMBINING DOT BELOW&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;COMBINING CIRCUMFLEX ACCENT&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;NFKC&lt;/strong&gt;, or &lt;strong&gt;Normalization Form KC&lt;/strong&gt;, which first applies a &lt;strong&gt;compatibility decomposition&lt;/strong&gt;, followed by a &lt;strong&gt;canonical composition&lt;/strong&gt;. In NFKC, all characters are composed into a &lt;em&gt;precomposed character&lt;/em&gt;, if possible:&lt;/p&gt;

    &lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;precomposed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normalize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;NFKC&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;ﬁ&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# this is U+FB01, &quot;LATIN SMALL LIGATURE FI&quot;
&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;precomposed&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;&apos;fi&apos;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unicodedata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;precomposed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER F&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;LATIN SMALL LETTER I&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Notice how the composition performed is &lt;em&gt;canonical&lt;/em&gt;: there’s no such thing as “compatibility composition” as far as my understanding goes. This means that NFKC never recombines characters into compatibility-equivalent forms, which are thus permanently lost:&lt;/p&gt;

    &lt;pre&gt;&lt;code class=&quot;language-Python&quot;&gt;  &amp;gt;&amp;gt;&amp;gt; s = &quot;Souﬀl\u0065\u0301&quot; # notice the `ﬀ` ligature
  &amp;gt;&amp;gt;&amp;gt; s
  &apos;Souﬀlé&apos;
  &amp;gt;&amp;gt;&amp;gt; norm = unicodedata.normalize(&apos;NFKC&apos;, s) 
  &amp;gt;&amp;gt;&amp;gt; norm
  &apos;Soufflé&apos;
  &amp;gt;&amp;gt;&amp;gt; # the ligature is gone, but the accent is still there
&lt;/code&gt;&lt;/pre&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All in all, normalization is a fairly complex topic, and it’s especially tricky to implement right due to the sheer amount of special cases, so it’s always best to rely on libraries in order to get it right.&lt;/p&gt;

&lt;h2 id=&quot;unicode-in-the-wild-caveats&quot;&gt;Unicode in the wild: caveats&lt;/h2&gt;

&lt;p&gt;Unicode is really the only relevant character set in existence, with UTF-8 holding the status of “best encoding”.&lt;/p&gt;

&lt;p&gt;Unfortunately, internationalization support introduces a great deal of complexity into text handling, something that developers are often unaware of:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;First and foremost, there’s still a massive amount of software that doesn’t default to (or outright does not support) UTF-8, because it was either designed to work with legacy 8-bit encodings (like ISO-8859-1) or because it was designed in the ’90s to use UCS-2 and it’s permanently stuck with it or with faux &lt;em&gt;“UTF-16”&lt;/em&gt;.
Software libraries and frameworks like Qt, Java, Unreal Engine and the Win32 API are constantly converting text from UTF-8 (which is the sole Internet standard) to their internal UTF-16 representation. This is a massive waste of CPU cycles, which while more abundant than in the past, are still a finite resource.&lt;/p&gt;

    &lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;c1&quot;&gt;// Linux x86_64, Qt 6.5.1. Encoding is `en_US.UTF-8`.&lt;/span&gt;
 &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;iostream&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
 &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;QCoreApplication&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;QDebug&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
 &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[])&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;QCoreApplication&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// converts UTF-8 (the source file&apos;s encoding) to the internal QString representation&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;QString&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;caña&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt; 

     &lt;span class=&quot;c1&quot;&gt;// prints `&quot;caña&quot;``, using Qt&apos;s debugging facilities. This will convert back to UTF-8 in order&lt;/span&gt;
     &lt;span class=&quot;c1&quot;&gt;// to print the string to the console&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;qDebug&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// prints `caña`, using C++&apos;s IOStreams. This will force Qt to convert the string to&lt;/span&gt;
     &lt;span class=&quot;c1&quot;&gt;// a UTF-8 encoded std::string, which will then be printed to the console&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cout&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toStdString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&apos;\n&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Case insensitivity in Unicode is a massive headache. First and foremost, the concept itself of “ignoring case” is deeply European-centric due to it being chiefly limited to &lt;em&gt;bicameral scripts&lt;/em&gt; such as Latin, Cyrillic or Greek. What is considered the opposite case of a letter may vary as well, depending on the system’s locale:&lt;/p&gt;

    &lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Up&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;CIAO&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
         &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;ciao&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

         &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;toLowerCase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
         &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;toUpperCase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;

         &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;uc(\&quot;%s\&quot;) == \&quot;%s\&quot;: %b\n&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;uc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;toUpperCase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;equals&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
     &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$LANG&lt;/span&gt;
 en_US.UTF-8
 &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;java Up
 ciao
 CIAO
 uc&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ciao&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;CIAO&quot;&lt;/span&gt;: &lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;This seems working fine until the runtime locale is switched to Turkish:&lt;/p&gt;

    &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;env &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LANG&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;tr_TR.UTF-8&apos;&lt;/span&gt; java Up
 cıao
 CİAO
 uc&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ciao&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;CIAO&quot;&lt;/span&gt;: &lt;span class=&quot;nb&quot;&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;In Turkish, the uppercase of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i&lt;/code&gt; is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;İ&lt;/code&gt;, and the lowercase of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;I&lt;/code&gt; is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ı&lt;/code&gt;, which breaks the ASCII-centric assumption the Java&lt;sup id=&quot;fnref:14&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:14&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;13&lt;/a&gt;&lt;/sup&gt; snippet above is built on. There is a multitude of such examples of “naive” implementations of case insensitivity in Unicode that inevitably end up being incorrect under unforeseen circumstances.&lt;/p&gt;

    &lt;p&gt;Taking all edge cases related to Unicode case folding into account is a lot of work, especially since it’s very hard to properly test all possible locales. This is the reason why &lt;strong&gt;Unicode handling is always best left to a library&lt;/strong&gt;. For C/C++ and Java, the Unicode Consortium itself provides a reference implementation of the Unicode algorithms, called &lt;a href=&quot;https://unicode-org.github.io/icu/&quot;&gt;ICU&lt;/a&gt;, which is used by a large number of frameworks and shipped by almost every major OS.&lt;/p&gt;

    &lt;p&gt;While quite tricky to get right at times and at times more UTF-16 centric than I’d like, using ICU is still way saner than any self-written alternative:&lt;/p&gt;

    &lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdint.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdio.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdlib.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;string.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
 &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;unicode/ucasemap.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;unicode/utypes.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
 &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[])&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;c1&quot;&gt;// Support custom locales&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;locale&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;en_US&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

     &lt;span class=&quot;n&quot;&gt;UErrorCode&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;U_ZERO_ERROR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// Create a UCaseMap object for case folding&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;UCaseMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;caseMap&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ucasemap_open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;locale&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;U_FAILURE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Error creating UCaseMap: %s&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;u_errorName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
         &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EXIT_FAILURE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// Case fold the input string using the default settings&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;CIAO&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
     &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lcLength&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ucasemap_utf8ToLower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;caseMap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;U_FAILURE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Error performing case folding: %s&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;u_errorName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
         &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// Print the lower case string&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;lc(&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;%s&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;) == %.*s&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lcLength&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// Clean up resources&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;ucasemap_close&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;caseMap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EXIT_SUCCESS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;cc &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; casefold casefold.c &lt;span class=&quot;nt&quot;&gt;-std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;c11 &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;icu-config &lt;span class=&quot;nt&quot;&gt;--ldflags&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./casefold
 lc&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;CIAO&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; ciao
 &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./casefold tr_TR
 lc&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;CIAO&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; cıao
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Unicode generalises “case insensitivity” into the broader concept of &lt;a href=&quot;https://unicode.org/L2/L2002/02186-foldings-0d6.html&quot;&gt;character folding&lt;/a&gt;, which boils down to a set of rules that define how characters can be transformed into other characters, in order to make them comparable.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Similarly to folding, sorting text in a well-defined order (for instance alphabetical), an operation better known as &lt;em&gt;&lt;strong&gt;collation&lt;/strong&gt;&lt;/em&gt;, is also not trivial with Unicode.&lt;/p&gt;

    &lt;p&gt;Different languages (and thus &lt;em&gt;locales&lt;/em&gt;) may have different sorting rules, even with the Latin scripts.&lt;/p&gt;

    &lt;p&gt;If, perchance, someone wanted to sort the list of words &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[ &quot;tuck&quot;, &quot;löwe&quot;, &quot;luck&quot;, &quot;zebra&quot;]&lt;/code&gt;:&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;In German, ‘Ö’ is placed between ‘O’ and ‘P’, and the rest of the alphabet follows the same order as in English. The correct sorting for that list is thus &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[ &quot;löwe&quot;, &quot;luck&quot;, &quot;tuck&quot;, &quot;zebra&quot;]&lt;/code&gt;;&lt;/li&gt;
      &lt;li&gt;In Estonian, ‘Z’ is placed between ‘S’ and ‘T’, and ‘Ö’ is the penultimate letter of the alphabet. The list is then sorted as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[ &quot;luck&quot;, &quot;löwe&quot;, &quot;zebra&quot;, &quot;tuck&quot;]&lt;/code&gt;;&lt;/li&gt;
      &lt;li&gt;In Swedish, ‘Ö’ is the last letter of the alphabet, with the classical Latin letters in their usual order. The list is thus &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[ &quot;luck&quot;, &quot;löwe&quot;, &quot;tuck&quot;, &quot;zebra&quot;]&lt;/code&gt;.&lt;/li&gt;
    &lt;/ul&gt;

    &lt;p&gt;Unicode defines &lt;a href=&quot;https://unicode.org/reports/tr10/&quot;&gt;a complex set of rules for collation&lt;/a&gt; and provides a reference implementation in ICU through the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ucol&lt;/code&gt; API (and its relative C++ and Java equivalents).&lt;/p&gt;

    &lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;cp&quot;&gt;#define _GNU_SOURCE // for qsort_r
&lt;/span&gt;
 &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdint.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdio.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;stdlib.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;string.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
 &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;unicode/ustring.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;unicode/ucol.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;#include&lt;/span&gt; &lt;span class=&quot;cpf&quot;&gt;&amp;lt;unicode/uloc.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;cp&quot;&gt;
&lt;/span&gt;
 &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;strcmp_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;str1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;str2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

     &lt;span class=&quot;n&quot;&gt;UErrorCode&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;U_ZERO_ERROR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UCollationResult&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cres&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ucol_strcollUTF8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;str1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strlen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;str2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strlen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cres&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UCOL_GREATER&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cres&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UCOL_LESS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

 &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;sort_strings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UCollator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;collator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;ptrdiff_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;qsort_r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strcmp_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;collator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

 &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[])&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;c1&quot;&gt;// Support custom locales&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;locale&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getenv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ICU_LOCALE&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

     &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;locale&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;locale&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;en_US&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

     &lt;span class=&quot;n&quot;&gt;UErrorCode&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;U_ZERO_ERROR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        
     &lt;span class=&quot;c1&quot;&gt;// Create a UCaseMap object for case folding&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;UCollator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;coll&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ucol_open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;locale&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;U_FAILURE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Error creating UCollator: %s&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;u_errorName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
         &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EXIT_FAILURE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
        
     &lt;span class=&quot;n&quot;&gt;sort_strings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;coll&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// Clean up resources&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;ucol_close&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;coll&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
        
     &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;puts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    
     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EXIT_SUCCESS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;env &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ICU_LOCALE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;de_DE ./coll &lt;span class=&quot;s2&quot;&gt;&quot;tuck&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;löwe&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;luck&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;zebra&quot;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# German&lt;/span&gt;
 löwe
 luck
 tuck
 zebra
 &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;env &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ICU_LOCALE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;et_EE ./coll &lt;span class=&quot;s2&quot;&gt;&quot;tuck&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;löwe&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;luck&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;zebra&quot;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# Estonian&lt;/span&gt;
 luck
 löwe
 zebra
 tuck
 &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;env &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ICU_LOCALE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sv_SE ./coll &lt;span class=&quot;s2&quot;&gt;&quot;tuck&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;löwe&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;luck&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;zebra&quot;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# Swedish&lt;/span&gt;
 luck
 löwe
 tuck
 zebra
 &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# more complex case: sorting Japanese Kana using the Japanese locale&apos;s gojūon order&lt;/span&gt;
 &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;env &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ICU_LOCALE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ja ./coll &lt;span class=&quot;s2&quot;&gt;&quot;パンダ&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;ありがとう&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;パソコン&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;さよなら&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;カード&quot;&lt;/span&gt;
 ありがとう
 カード
 さよなら
 パソコン
 パンダ
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;To facilitate UTF-8 detection when other encodings may be in use, some platforms annoyingly add a UTF-8 BOM (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EF BB BF&lt;/code&gt;) at the beginning of text files. Microsoft’s Visual Studio is historically a major offender in this regard:&lt;/p&gt;

    &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt; file OldProject.sln
 OldProject.sln: Unicode text, UTF-8 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;with BOM&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; text, with CRLF line terminators
 &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;xxd OldProject.sln | &lt;span class=&quot;nb&quot;&gt;head&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; 1
 00000000: efbb bf0d 0a4d 6963 726f 736f 6674 2056  .....Microsoft V
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;The sequence is simply &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+FEFF&lt;/code&gt;, just like in UTF-16 and 32, but encoded in UTF-8. While it’s not forbidden by the standard per se, it has no real utility besides signaling that the file is in UTF-8 (it makes no sense talking about endianness with single bytes). Programs that need to parse or operate on UTF-8 encoded files should always be aware that a BOM may be present, and probe for it to avoid exposing users to unnecessary complexity they probably don’t care about.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Because of all of the reasons listed above, random, array-like access to Unicode strings is almost always broken—this is true even with UTF-32, due to grapheme clusters. It also follows that operations such as string slicing are not trivial to implement correctly, and the way languages such as Python and JavaScript do it (codepoint by codepoint) is IMHO arguably problematic.&lt;/p&gt;

    &lt;p&gt;A good example of a modern language that attempts to mitigate this issue is Rust, which has UTF-8 strings that disallow indexed access and only support slicing at byte indices, with UTF-8 validation at runtime:&lt;/p&gt;

    &lt;div class=&quot;language-rust highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;k&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;caña&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// error[E0277]: the type `str` cannot be indexed by `{integer}`&lt;/span&gt;
     &lt;span class=&quot;c1&quot;&gt;// let c = s[1];&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// char-by-char access requires iterators&lt;/span&gt;
     &lt;span class=&quot;nd&quot;&gt;println!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;{}&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.chars&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.nth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;.unwrap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// OK: ñ&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// this will crash the program at runtime:&lt;/span&gt;
     &lt;span class=&quot;c1&quot;&gt;// &quot;byte index 3 is not a char boundary; it is inside &apos;ñ&apos; (bytes 2..4) of `caña`&quot;&lt;/span&gt;
     &lt;span class=&quot;c1&quot;&gt;// let slice = &amp;amp;s[1..3]);&lt;/span&gt;

     &lt;span class=&quot;c1&quot;&gt;// the user needs to check UTF-8 character bounds beforehand&lt;/span&gt;
     &lt;span class=&quot;nd&quot;&gt;println!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;{}&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;..&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]);&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// OK: &quot;añ&quot;&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;The stabilisation of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.chars()&lt;/code&gt; method took quite a long time, reflecting the fact that deducing what is or is not a character in Unicode is complex and quite controversial. The method itself ended up implementing iteration over Rust’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;char&lt;/code&gt;s (aka, Unicode scalar codepoints) instead of grapheme clusters, which is rarely what the user wants. The fact it returns an iterator does at least effectively express that character-by-character access in Unicode is not, indeed, the “simple” operation developers have been so long accustomed to.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;wrapping-up&quot;&gt;Wrapping up&lt;/h2&gt;

&lt;p&gt;Unicode is a massive standard, and it’s constantly adding new characters&lt;sup id=&quot;fnref:15&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:15&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;14&lt;/a&gt;&lt;/sup&gt;, so for everybody’s safety it’s always best to rely on libraries to provide Unicode support, and if necessary ship fonts that support all the characters you may need (such as &lt;em&gt;Noto Fonts&lt;/em&gt;). As previously introduced, C and C++ do not provide great support for Unicode, so it’s always best to just use ICU, which is widely supported and shipped by every major OS (&lt;a href=&quot;https://learn.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu-&quot;&gt;including Windows&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;When handling text that may contain non-English characters, it’s always best to stick to UTF-8 when possible and use Unicode-aware libraries for text processing. While writing custom text processing code may seem doable, it’s easy to miss a few corner cases and confuse end users in the process.&lt;/p&gt;

&lt;p&gt;This is especially important because the main users of localized text and applications tend to often be the least technically savvy—those who may lack the ability to understand why the piece of software they are using is misbehaving, and can’t search for help in a language they don’t understand.&lt;/p&gt;

&lt;p&gt;I hope this article may have been useful to shed some light on what is, in my opinion, an often overlooked topic in software development, especially among C++ users. If I had to be honest, I was striving for a shorter article, but I guess I had to make up for all those years I didn’t post a thing :)&lt;/p&gt;

&lt;p&gt;As always, feel free to comment underneath or send me a message if anything does not look right, and hopefully, the next post will come before 2025…&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This wacky yet ingenious system made it possible to write in Arabic on ASCII-only channels, by using a mixture of Latin script and Western numerals with a passing resemblance with letters not present in English (i.e.,&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;3&lt;/code&gt; in place of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ع&lt;/code&gt;, …). &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Three actually: there was also UTF-1, a variable-width encoding that used 1 byte characters. It was pretty borked, so it never really saw much use. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;32-bit Unicode was initially resisted by both the Unicode consortium and the industry, due to its wastefulness while representing Latin text and everybody’s heavy investment in 16-bit Unicode. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;&lt;a href=&quot;https://docs.unrealengine.com/5.2/en-US/character-encoding-in-unreal-engine/&quot;&gt;And they still do it as of today&lt;/a&gt;. They do claim UTF-16 support, but it’s a bald-faced lie given that they don’t support anything outside of the BMP. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;It was basically IPv4 all over again. I guess we’ll never learn. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;A good example of this is Unreal Engine, which pretends to support UTF-16 even though it is actually UCS-2 &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;UCS-2 also had the same issue, and so it was also in practice two different encodings, UCS-2LE and UCS-2BE. My opinions on this matter can thankfully be represented using Unicode itself with codepoint &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;U+1F92E&lt;/code&gt;. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Or rather, it is the one Unicode encoding people &lt;em&gt;want&lt;/em&gt; to use, as opposed to UTF-16, which is a scourge we’ll (probably) never get rid of. &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:10&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I’ve specified “ASCII &lt;em&gt;symbols&lt;/em&gt;” because letters may potentially be part of a grapheme cluster, so splitting on an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;e&lt;/code&gt; may, for instance, split an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;é&lt;/code&gt; in two. &lt;a href=&quot;#fnref:10&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:11&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;For instance, you most definitely expect that searching for “office” in a PDF also matches the words containing the ligature “ﬁ”—&lt;a href=&quot;https://unicode-org.github.io/icu/userguide/collation/string-search.html&quot;&gt;string search is another tricky topic by itself&lt;/a&gt;. &lt;a href=&quot;#fnref:11&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:12&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;And not only that: just think of how hard would it be to find a file, or to check a password or username, if there weren’t ways to verify the canonical equivalence between characters. &lt;a href=&quot;#fnref:12&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:13&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;While most programming languages are somewhat standardizing around UTF-8 encoded source code, C and C++ still don’t have a standard encoding. Modern languages like Rust, Swift and Go also support Unicode in identifiers, which introduces some interesting challenges - see the relative &lt;a href=&quot;https://www.unicode.org/reports/tr31/tr31-29.html#Parsing&quot;&gt;Unicode specification for identifiers and parsing&lt;/a&gt; for more details. &lt;a href=&quot;#fnref:13&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:14&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I’ve used Java as an example here because it hits the right spot as a poster child of all the wrong assumptions of the ’90s: it’s old enough to easily provide naive, Western-centric built-in concepts such as “toUpperCase” and “toLowerCase”, while also attempting to implement them in a “Unicode” way. Unicode support in C and C++ is too barebones to really work as an example (despite C and C++ locales being outstandingly broken), and modern ones such as Rust or Go are usually locale agnostic; they also tend to implement case folding in a “saner” way (for instance, Rust only supports it on ASCII in its standard library). &lt;a href=&quot;#fnref:14&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:15&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;A prime example of this is emojis, which have been ballooning in number since they were first introduced in 2010. &lt;a href=&quot;#fnref:15&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Cross compiling made easy, using Clang and LLVM</title>
   <link href="https://mcilloni.ovh/2021/02/09/cxx-cross-clang/"/>
   <updated>2021-02-09T00:00:00+00:00</updated>
   <id>https://mcilloni.ovh/2021/02/09/cxx-cross-clang</id>
   <content type="html">&lt;p&gt;Anyone who ever tried to cross-compile a C/C++ program knows how big a PITA the whole process could be. The main reasons for this sorry state of things are generally how byzantine build systems tend to be when configuring for cross-compilation, and how messy it is to set-up your cross toolchain in the first place.&lt;/p&gt;

&lt;p&gt;One of the main culprits in my experience has been the GNU toolchain, the decades-old behemoth upon which the POSIXish world has been built for years.
Like many compilers of yore, GCC and its &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binutils&lt;/code&gt; brethren were never designed with the intent to support multiple targets within a single setup, with he only supported approach being installing a full cross build for each triple you wish to target on any given host.&lt;/p&gt;

&lt;p&gt;For instance, assuming you wish to build something for FreeBSD on your Linux machine using GCC, you need:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A GCC + binutils install for your host triplet (i.e., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x86_64-pc-linux-gnu&lt;/code&gt; or similar);&lt;/li&gt;
  &lt;li&gt;A GCC + binutils complete install for your target triplet (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x86_64-unknown-freebsd12.2-gcc&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;as&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nm&lt;/code&gt;, etc)&lt;/li&gt;
  &lt;li&gt;A sysroot containing the necessary libraries and headers, which you can either build yourself or promptly steal from a running installation of FreeBSD.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This process is sometimes made simpler by Linux distributions or hardware vendors offering a selection of prepackaged compilers, but this will never suffice due to the sheer amount of possible host-target combinations. This sometimes means you have to build the whole toolchain yourself, something that, unless you rock a quite beefy CPU, tends to be a massive waste of time and power.&lt;/p&gt;

&lt;h2 id=&quot;clang-as-a-cross-compiler&quot;&gt;Clang as a cross compiler&lt;/h2&gt;

&lt;p&gt;This annoying limitation is one of the reasons why I got interested in LLVM (and thus Clang), which is by-design a full-fledged cross compiler toolchain and is mostly compatible with GNU. A single install can output and compile code &lt;em&gt;for every supported target&lt;/em&gt;, as long as a complete sysroot is available at build time.&lt;/p&gt;

&lt;p&gt;I found this to be a game-changer, and, while it can’t still compete in convenience with modern language toolchains (such as Go’s gc and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GOARCH&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GOOS&lt;/code&gt;), it’s night and day better than the rigmarole of setting up GNU toolchains. You can now just fetch whatever your favorite package management system has available in its repositories (as long as it’s not extremely old), and avoid messing around with multiple installs of GCC.&lt;/p&gt;

&lt;p&gt;Until a few years ago, the whole process wasn’t as smooth as it could be. Due to LLVM not having a full toolchain yet available, you were still supposed to provide a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binutils&lt;/code&gt; build specific for your target. While this is generally much more tolerable than building the whole compiler (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binutils&lt;/code&gt; is relatively fast to build), it was still somewhat of a nuisance, and I’m glad that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llvm-mc&lt;/code&gt; (LLVM’s integrated assembler) and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lld&lt;/code&gt; (universal linker) are finally stable and as flexible as the rest of LLVM.&lt;/p&gt;

&lt;p&gt;With the toolchain now set, the next step becomes to obtain a sysroot in order to provide the needed headers and libraries to compile and link for your target.&lt;/p&gt;

&lt;h2 id=&quot;obtaining-a-sysroot&quot;&gt;Obtaining a sysroot&lt;/h2&gt;
&lt;p&gt;A super fast way to find a working system directory for a given OS is to rip it straight out of an existing system (a Docker container image will often also do).
For instance, this is how I used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tar&lt;/code&gt; through &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ssh&lt;/code&gt; as a quick way to extract a working sysroot from a FreeBSD 13-CURRENT AArch64 VM &lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ mkdir ~/farm_tree
$ ssh FARM64 &apos;tar cf - /lib /usr/include /usr/lib /usr/local/lib /usr/local/include&apos; | bsdtar xvf - -C $HOME/farm_tree/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;invoking-the-cross-compiler&quot;&gt;Invoking the cross compiler&lt;/h2&gt;

&lt;p&gt;With everything set, it’s now only a matter of invoking Clang with the right arguments:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt; clang++ &lt;span class=&quot;nt&quot;&gt;--target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;aarch64-pc-freebsd &lt;span class=&quot;nt&quot;&gt;--sysroot&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/farm_tree &lt;span class=&quot;nt&quot;&gt;-fuse-ld&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;lld &lt;span class=&quot;nt&quot;&gt;-stdlib&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;libc++ &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; zpipe zpipe.cc &lt;span class=&quot;nt&quot;&gt;-lz&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--verbose&lt;/span&gt;
clang version 11.0.1
Target: aarch64-pc-freebsd
Thread model: posix
InstalledDir: /usr/bin
 &lt;span class=&quot;s2&quot;&gt;&quot;/usr/bin/clang-11&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-cc1&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-triple&lt;/span&gt; aarch64-pc-freebsd &lt;span class=&quot;nt&quot;&gt;-emit-obj&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-mrelax-all&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-disable-free&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-disable-llvm-verifier&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-discard-value-names&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-main-file-name&lt;/span&gt; zpipe.cc &lt;span class=&quot;nt&quot;&gt;-mrelocation-model&lt;/span&gt; static &lt;span class=&quot;nt&quot;&gt;-mframe-pointer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;non-leaf &lt;span class=&quot;nt&quot;&gt;-fno-rounding-math&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-mconstructor-aliases&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-munwind-tables&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-fno-use-init-array&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-target-cpu&lt;/span&gt; generic &lt;span class=&quot;nt&quot;&gt;-target-feature&lt;/span&gt; +neon &lt;span class=&quot;nt&quot;&gt;-target-abi&lt;/span&gt; aapcs &lt;span class=&quot;nt&quot;&gt;-fallow-half-arguments-and-returns&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-fno-split-dwarf-inlining&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-debugger-tuning&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;gdb &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-resource-dir&lt;/span&gt; /usr/lib/clang/11.0.1 &lt;span class=&quot;nt&quot;&gt;-isysroot&lt;/span&gt; /home/marco/farm_tree &lt;span class=&quot;nt&quot;&gt;-internal-isystem&lt;/span&gt; /home/marco/farm_tree/usr/include/c++/v1 &lt;span class=&quot;nt&quot;&gt;-fdeprecated-macro&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-fdebug-compilation-dir&lt;/span&gt; /home/marco/dummies/cxx &lt;span class=&quot;nt&quot;&gt;-ferror-limit&lt;/span&gt; 19 &lt;span class=&quot;nt&quot;&gt;-fno-signed-char&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-fgnuc-version&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4.2.1 &lt;span class=&quot;nt&quot;&gt;-fcxx-exceptions&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-fexceptions&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-faddrsig&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /tmp/zpipe-54f1b1.o &lt;span class=&quot;nt&quot;&gt;-x&lt;/span&gt; c++ zpipe.cc
clang &lt;span class=&quot;nt&quot;&gt;-cc1&lt;/span&gt; version 11.0.1 based upon LLVM 11.0.1 default target x86_64-pc-linux-gnu
&lt;span class=&quot;c&quot;&gt;#include &quot;...&quot; search starts here:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#include &amp;lt;...&amp;gt; search starts here:&lt;/span&gt;
 /home/marco/farm_tree/usr/include/c++/v1
 /usr/lib/clang/11.0.1/include
 /home/marco/farm_tree/usr/include
End of search list.
 &lt;span class=&quot;s2&quot;&gt;&quot;/usr/bin/ld.lld&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--sysroot&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/home/marco/farm_tree &lt;span class=&quot;nt&quot;&gt;--eh-frame-hdr&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-dynamic-linker&lt;/span&gt; /libexec/ld-elf.so.1 &lt;span class=&quot;nt&quot;&gt;--enable-new-dtags&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; zpipe /home/marco/farm_tree/usr/lib/crt1.o /home/marco/farm_tree/usr/lib/crti.o /home/marco/farm_tree/usr/lib/crtbegin.o &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt;/home/marco/farm_tree/usr/lib /tmp/zpipe-54f1b1.o &lt;span class=&quot;nt&quot;&gt;-lz&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-lc&lt;/span&gt;++ &lt;span class=&quot;nt&quot;&gt;-lm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-lgcc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--as-needed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-lgcc_s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--no-as-needed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-lc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-lgcc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--as-needed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-lgcc_s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--no-as-needed&lt;/span&gt; /home/marco/farm_tree/usr/lib/crtend.o /home/marco/farm_tree/usr/lib/crtn.o
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;file zpipe
zpipe: ELF 64-bit LSB executable, ARM aarch64, version 1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;FreeBSD&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, dynamically linked, interpreter /libexec/ld-elf.so.1, &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;FreeBSD 13.0 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1300136&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, FreeBSD-style, with debug_info, not stripped
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the snipped above, I have managed to compile and link a C++ file into an executable for AArch64 FreeBSD, all while using just the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lld&lt;/code&gt; I had already installed on my GNU/Linux system.&lt;/p&gt;

&lt;p&gt;More in detail:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--target&lt;/code&gt; switches the LLVM default target (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x86_64-pc-linux-gnu&lt;/code&gt;) to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aarch64-pc-freebsd&lt;/code&gt;, thus enabling cross-compilation.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--sysroot&lt;/code&gt; forces Clang to assume the specified path as root when searching headers and libraries, instead of the usual paths. Note that sometimes this setting might not be enough, especially if the target uses GCC and Clang somehow fails to detect its install path. This can be easily fixed by specifying &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--gcc-toolchain&lt;/code&gt;, which clarifies where to search for GCC installations.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-fuse-ld=lld&lt;/code&gt; tells Clang to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lld&lt;/code&gt; instead whatever default the platform uses. As I will explain below, it’s highly unlikely that the system linker understands foreign targets, while LLD can natively support &lt;em&gt;almost&lt;/em&gt; every binary format and OS &lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-stdlib=libc++&lt;/code&gt; is needed here due to Clang failing to detect that FreeBSD on AArch64 uses LLVM’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libc++&lt;/code&gt; instead of GCC’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libstdc++&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-lz&lt;/code&gt; is also specified to show how Clang can also resolve other libraries inside the sysroot without issues, in this case, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zlib&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The final test is now to copy the binary to our target system (i.e. the VM we ripped the sysroot from before) and check if it works as expected:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;rsync zpipe FARM64:&lt;span class=&quot;s2&quot;&gt;&quot;~&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;ssh FARM64
FreeBSD-ARM64-VM &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;chmod&lt;/span&gt; +x zpipe
FreeBSD-ARM64-VM &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;ldd zpipe
zpipe:
        libz.so.6 &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; /lib/libz.so.6 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;0x4029e000&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
        libc++.so.1 &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; /usr/lib/libc++.so.1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;0x402e4000&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
        libcxxrt.so.1 &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; /lib/libcxxrt.so.1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;0x403da000&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
        libm.so.5 &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; /lib/libm.so.5 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;0x40426000&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
        libc.so.7 &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; /lib/libc.so.7 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;0x40491000&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
        libgcc_s.so.1 &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; /lib/libgcc_s.so.1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;0x408aa000&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
FreeBSD-ARM64-VM &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./zpipe &lt;span class=&quot;nt&quot;&gt;-h&lt;/span&gt;
zpipe usage: zpipe &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &amp;lt; &lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; dest
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Success! It’s now possible to use this cross toolchain to build larger programs, and below I’ll give a quick example to how to use it to build real projects.&lt;/p&gt;

&lt;h3 id=&quot;optional-creating-an-llvm-toolchain-directory&quot;&gt;Optional: creating an LLVM toolchain directory&lt;/h3&gt;

&lt;p&gt;LLVM provides a mostly compatible counterpart for almost every tool shipped by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binutils&lt;/code&gt; (with the notable exception of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;as&lt;/code&gt; &lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;), prefixed with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llvm-&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The most critical of these is LLD, which is a drop in replacement for a plaform’s system linker, capable to replace both GNU &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ld.bfd&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gold&lt;/code&gt; on GNU/Linux or BSD, and Microsoft’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LINK.EXE&lt;/code&gt; when targeting MSVC. It supports linking on (almost) every platform supported by LLVM, thus removing the nuisance to have multiple specific linkers installed.&lt;/p&gt;

&lt;p&gt;Both GCC and Clang support using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ld.lld&lt;/code&gt; instead of the system linker (which may well be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lld&lt;/code&gt;, like on FreeBSD) via the command line switch &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-fuse-ld=lld&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In my experience, I found that Clang’s driver might get confused when picking the right linker on some uncommon platforms, especially before version 11.0.
For some reason, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang&lt;/code&gt; sometimes decided to outright ignore the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-fuse-ld=lld&lt;/code&gt; switch and picked the system linker (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ld.bfd&lt;/code&gt; in my case), which does not support AArch64.&lt;/p&gt;

&lt;p&gt;A fast solution to this is to create a toolchain directory containing symlinks that rename the LLVM utilities to the standard &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binutils&lt;/code&gt; programs:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-la&lt;/span&gt; ~/.llvm/bin/
Permissions Size User  Group Date Modified Name
lrwxrwxrwx    16 marco marco  3 Aug  2020  ar -&amp;gt; /usr/bin/llvm-ar
lrwxrwxrwx    12 marco marco  6 Aug  2020  ld -&amp;gt; /usr/bin/lld
lrwxrwxrwx    21 marco marco  3 Aug  2020  objcopy -&amp;gt; /usr/bin/llvm-objcopy
lrwxrwxrwx    21 marco marco  3 Aug  2020  objdump -&amp;gt; /usr/bin/llvm-objdump
lrwxrwxrwx    20 marco marco  3 Aug  2020  ranlib -&amp;gt; /usr/bin/llvm-ranlib
lrwxrwxrwx    21 marco marco  3 Aug  2020  strings -&amp;gt; /usr/bin/llvm-strings
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-B&lt;/code&gt; switch can then be used to force Clang (or GCC) to search the required tools in this directory, stopping the issue from ever occurring:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt; clang++ &lt;span class=&quot;nt&quot;&gt;-B&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/.llvm/bin &lt;span class=&quot;nt&quot;&gt;-stdlib&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;libc++ &lt;span class=&quot;nt&quot;&gt;--target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;aarch64-pc-freebsd &lt;span class=&quot;nt&quot;&gt;--sysroot&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/farm_tree &lt;span class=&quot;nt&quot;&gt;-std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;c++17 &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; mvd-farm64 mvd.cc
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;file mvd-farm64
mvd-farm64: ELF 64-bit LSB executable, ARM aarch64, version 1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;FreeBSD&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, dynamically linked, interpreter /libexec/ld-elf.so.1, &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;FreeBSD 13.0, FreeBSD-style, with debug_info, not stripped
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;optional-creating-clang-wrappers-to-simplify-cross-compilation&quot;&gt;Optional: creating Clang wrappers to simplify cross-compilation&lt;/h3&gt;

&lt;p&gt;I happened to notice that certain build systems (and with &lt;em&gt;“certain”&lt;/em&gt; I mean some poorly written &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Makefile&lt;/code&gt;s and sometimes Autotools) have a tendency to misbehave when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$CC&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$CXX&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$LD&lt;/code&gt; contain spaces or multiple parameters. This might become a recurrent issue if we need to invoke &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang&lt;/code&gt; with several arguments. &lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Given also how unwieldy it is to remember to write all of the parameters correctly everywhere, I usually write quick wrappers for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang++&lt;/code&gt; in order to simplify building for a certain target:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; ~/.local/bin/aarch64-pc-freebsd-clang
&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env sh&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; /usr/bin/clang &lt;span class=&quot;nt&quot;&gt;-B&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/.llvm/bin &lt;span class=&quot;nt&quot;&gt;--target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;aarch64-pc-freebsd &lt;span class=&quot;nt&quot;&gt;--sysroot&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/farm_tree &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$@&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; ~/.local/bin/aarch64-pc-freebsd-clang++
&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env sh&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; /usr/bin/clang++ &lt;span class=&quot;nt&quot;&gt;-B&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/.llvm/bin &lt;span class=&quot;nt&quot;&gt;-stdlib&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;libc++ &lt;span class=&quot;nt&quot;&gt;--target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;aarch64-pc-freebsd &lt;span class=&quot;nt&quot;&gt;--sysroot&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/farm_tree &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$@&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;	
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If created in a directory inside $PATH, these script can used everywhere as standalone commands:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;aarch64-pc-freebsd-clang++ &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; tst tst.cc &lt;span class=&quot;nt&quot;&gt;-static&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;file tst
tst: ELF 64-bit LSB executable, ARM aarch64, version 1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;FreeBSD&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, statically linked, &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;FreeBSD 13.0 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1300136&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, FreeBSD-style, with debug_info, not stripped

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;cross-building-with-autotools-cmake-and-meson&quot;&gt;Cross-building with Autotools, CMake and Meson&lt;/h2&gt;

&lt;p&gt;Autotools, CMake, and Meson are arguably the most popular building systems for C and C++ open source projects (sorry, SCons).
All of three support cross-compiling out of the box, albeit with some caveats.&lt;/p&gt;

&lt;h3 id=&quot;autotools&quot;&gt;Autotools&lt;/h3&gt;

&lt;p&gt;Over the years, Autotools has been famous for being horrendously clunky and breaking easily. While this reputation is definitely well earned, it’s still widely used by most large GNU projects. Given it’s been around for decades, it’s quite easy to find support online when something goes awry (sadly, this is not also true when writing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.ac&lt;/code&gt; files). When compared to its more modern breathren, it doesn’t require any toolchain file or extra configuration when cross compiling, being only driven by command line options.&lt;/p&gt;

&lt;p&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;./configure&lt;/code&gt; script (either generated by autoconf or shipped by a tarball alongside source code) &lt;em&gt;usually&lt;/em&gt; supports the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--host&lt;/code&gt; flag, allowing the user to specify the triple of the &lt;em&gt;host&lt;/em&gt; on which the final artifacts are meant to be run.&lt;/p&gt;

&lt;p&gt;This flags activates cross compilation, and causes the &lt;em&gt;“auto-something”&lt;/em&gt; array of tools to try to detect the correct compiler for the target, which it generally assumes to be called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;some-triple-gcc&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;some-triple-g++&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For instance, let’s try to configure &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binutils&lt;/code&gt; version 2.35.1 for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aarch64-pc-freebsd&lt;/code&gt;, using the Clang wrapper introduced above:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;xvf binutils-2.35.1.tar.xz
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;binutils-2.35.1/build &lt;span class=&quot;c&quot;&gt;# always create a build directory to avoid messing up the source tree&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;binutils-2.35.1/build
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;env &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;aarch64-pc-freebsd-clang&apos;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;CXX&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;aarch64-pc-freebsd-clang++&apos;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;AR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;llvm-ar ../configure &lt;span class=&quot;nt&quot;&gt;--build&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;x86_64-pc-linux-gnu &lt;span class=&quot;nt&quot;&gt;--host&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;aarch64-pc-freebsd &lt;span class=&quot;nt&quot;&gt;--enable-gold&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;yes
&lt;/span&gt;checking build system type... x86_64-pc-linux-gnu
checking host system type... aarch64-pc-freebsd
checking target system type... aarch64-pc-freebsd
checking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;a BSD-compatible install... /usr/bin/install &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt;
checking whether &lt;span class=&quot;nb&quot;&gt;ln &lt;/span&gt;works... &lt;span class=&quot;nb&quot;&gt;yes
&lt;/span&gt;checking whether &lt;span class=&quot;nb&quot;&gt;ln&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; works... &lt;span class=&quot;nb&quot;&gt;yes
&lt;/span&gt;checking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;a &lt;span class=&quot;nb&quot;&gt;sed &lt;/span&gt;that does not &lt;span class=&quot;nb&quot;&gt;truncate &lt;/span&gt;output... /usr/bin/sed
checking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;gawk... gawk
checking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;aarch64-pc-freebsd-gcc... aarch64-pc-freebsd-clang
checking whether the C compiler works... &lt;span class=&quot;nb&quot;&gt;yes
&lt;/span&gt;checking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;C compiler default output file name... a.out
checking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;suffix of executables...
checking whether we are cross compiling... &lt;span class=&quot;nb&quot;&gt;yes
&lt;/span&gt;checking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;suffix of object files... o
checking whether we are using the GNU C compiler... &lt;span class=&quot;nb&quot;&gt;yes
&lt;/span&gt;checking whether aarch64-pc-freebsd-clang accepts &lt;span class=&quot;nt&quot;&gt;-g&lt;/span&gt;... &lt;span class=&quot;nb&quot;&gt;yes
&lt;/span&gt;checking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;aarch64-pc-freebsd-clang option to accept ISO C89... none needed
checking whether we are using the GNU C++ compiler... &lt;span class=&quot;nb&quot;&gt;yes
&lt;/span&gt;checking whether aarch64-pc-freebsd-clang++ accepts &lt;span class=&quot;nt&quot;&gt;-g&lt;/span&gt;... &lt;span class=&quot;nb&quot;&gt;yes&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The invocation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;./configure&lt;/code&gt; above specifies that I want autotools to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Configure for building on an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x86_64-pc-linux-gnu&lt;/code&gt; host (which I specified using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--build&lt;/code&gt;);&lt;/li&gt;
  &lt;li&gt;Build binaries that will run on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aarch64-pc-freebsd&lt;/code&gt;, using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--host&lt;/code&gt; switch;&lt;/li&gt;
  &lt;li&gt;Use the Clang wrappers made above as C and C++ compilers;&lt;/li&gt;
  &lt;li&gt;Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llvm-ar&lt;/code&gt; as the target &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ar&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I also specified to build the Gold linker, which is written in C++ and it’s a good test for well our improvised toolchain handles compiling C++.&lt;/p&gt;

&lt;p&gt;If the configuration step doesn’t fail for some reason (it shouldn’t), it’s now time to run GNU Make to build &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binutils&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;make &lt;span class=&quot;nt&quot;&gt;-j16&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# because I have 16 theads on my system&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; lots of output]
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;dest
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;make &lt;span class=&quot;nv&quot;&gt;DESTDIR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PWD&lt;/span&gt;/dest &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# install into a fake tree&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There should now be executable files and libraries inside of the fake tree generated by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make install&lt;/code&gt;. A quick test using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;file&lt;/code&gt; confirms they have been correctly built for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aarch64-pc-freebsd&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;file dest/usr/local/bin/ld.gold
dest/usr/local/bin/ld.gold: ELF 64-bit LSB executable, ARM aarch64, version 1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;FreeBSD&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, dynamically linked, interpreter /libexec/ld-elf.so.1, &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;FreeBSD 13.0 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1300136&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, FreeBSD-style, with debug_info, not stripped
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;cmake&quot;&gt;CMake&lt;/h3&gt;

&lt;p&gt;The simplest way to set CMake to configure for an arbitrary target is to write a &lt;em&gt;toolchain file&lt;/em&gt;. These usually consist of a list of declarations that instructs CMake on how it is supposed to use a given toolchain, specifying parameters like the target operating system, the CPU architecture, the name of the C++ compiler, and such.&lt;/p&gt;

&lt;p&gt;One reasonable toolchain file for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aarch64-pc-freebsd&lt;/code&gt; triple written as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-CMake&quot;&gt;set(CMAKE_SYSTEM_NAME FreeBSD)
set(CMAKE_SYSTEM_PROCESSOR aarch64)

set(CMAKE_SYSROOT $ENV{HOME}/farm_tree)

set(CMAKE_C_COMPILER aarch64-pc-freebsd-clang)
set(CMAKE_CXX_COMPILER aarch64-pc-freebsd-clang++)
set(CMAKE_AR llvm-ar)

# these variables tell CMake to avoid using any binary it finds in 
# the sysroot, while picking headers and libraries exclusively from it 
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In this file, I specified the wrapper created above as the cross compiler for C and C++ for the target. It should be possible to also use plain Clang with the right arguments, but it’s much less straightforward and potentially more error-prone.&lt;/p&gt;

&lt;p&gt;In any case, it is &lt;em&gt;very&lt;/em&gt; important to indicate the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CMAKE_SYSROOT&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CMAKE_FIND_ROOT_PATH_MODE_*&lt;/code&gt; variables, or otherwise CMake could wrongly pick packages from the host with disastrous results.&lt;/p&gt;

&lt;p&gt;It is now only a matter of setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CMAKE_TOOLCHAIN_FILE&lt;/code&gt; with the path to the toolchain file when configuring a project. To better illustrate this, I will now also build &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{fmt}&lt;/code&gt; (which is an amazing C++ library you should definitely use) for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aarch64-pc-freebsd&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt; git clone https://github.com/fmtlib/fmt
Cloning into &lt;span class=&quot;s1&quot;&gt;&apos;fmt&apos;&lt;/span&gt;...
remote: Enumerating objects: 45, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
remote: Counting objects: 100% &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;45/45&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
remote: Compressing objects: 100% &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;33/33&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
remote: Total 24446 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;delta 17&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, reused 12 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;delta 7&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, pack-reused 24401
Receiving objects: 100% &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;24446/24446&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, 12.08 MiB | 2.00 MiB/s, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
Resolving deltas: 100% &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;16551/16551&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd fmt&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;cmake &lt;span class=&quot;nt&quot;&gt;-B&lt;/span&gt; build &lt;span class=&quot;nt&quot;&gt;-G&lt;/span&gt; Ninja &lt;span class=&quot;nt&quot;&gt;-DCMAKE_TOOLCHAIN_FILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/toolchain-aarch64-freebsd.cmake &lt;span class=&quot;nt&quot;&gt;-DBUILD_SHARED_LIBS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ON &lt;span class=&quot;nt&quot;&gt;-DFMT_TEST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;OFF &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; CMake version: 3.19.4
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; The CXX compiler identification is Clang 11.0.1
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Detecting CXX compiler ABI info
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Detecting CXX compiler ABI info - &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Check &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;working CXX compiler: /home/marco/.local/bin/aarch64-pc-freebsd-clang++ - skipped
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Detecting CXX compile features
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Detecting CXX compile features - &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Version: 7.1.3
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Build &lt;span class=&quot;nb&quot;&gt;type&lt;/span&gt;: Release
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; CXX_STANDARD: 11
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test has_std_11_flag
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test has_std_11_flag - Success
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test has_std_0x_flag
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test has_std_0x_flag - Success
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test SUPPORTS_USER_DEFINED_LITERALS
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test SUPPORTS_USER_DEFINED_LITERALS - Success
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test FMT_HAS_VARIANT
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test FMT_HAS_VARIANT - Success
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Required features: cxx_variadic_templates
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test HAS_NULLPTR_WARNING
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Performing Test HAS_NULLPTR_WARNING - Success
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Looking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;strtod_l
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Looking &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;strtod_l - not found
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Configuring &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Generating &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Build files have been written to: /home/marco/fmt/build
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Compared with Autotools, the command line passed to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cmake&lt;/code&gt; is very simple and doesn’t need too much explanation. After the configuration step is finished, it’s only a matter to compile the project and get &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ninja&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make&lt;/code&gt; to install the resulting artifacts somewhere.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;cmake &lt;span class=&quot;nt&quot;&gt;--build&lt;/span&gt; build
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;4/4] Creating library symlink libfmt.so.7 libfmt.so
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;dest
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;env &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;DESTDIR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PWD&lt;/span&gt;/dest cmake &lt;span class=&quot;nt&quot;&gt;--build&lt;/span&gt; build &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;0/1] Install the project...
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Install configuration: &lt;span class=&quot;s2&quot;&gt;&quot;Release&quot;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/lib/libfmt.so.7.1.3
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/lib/libfmt.so.7
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/lib/libfmt.so
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/lib/cmake/fmt/fmt-config.cmake
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/lib/cmake/fmt/fmt-config-version.cmake
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/lib/cmake/fmt/fmt-targets.cmake
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/lib/cmake/fmt/fmt-targets-release.cmake
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/args.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/chrono.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/color.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/compile.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/core.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/format.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/format-inl.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/locale.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/os.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/ostream.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/posix.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/printf.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/include/fmt/ranges.h
&lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; Installing: /home/marco/fmt/dest/usr/local/lib/pkgconfig/fmt.pc
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt; file dest/usr/local/lib/libfmt.so.7.1.3
dest/usr/local/lib/libfmt.so.7.1.3: ELF 64-bit LSB shared object, ARM aarch64, version 1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;FreeBSD&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, dynamically linked, &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;FreeBSD 13.0 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1300136&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, with debug_info, not stripped
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;meson&quot;&gt;Meson&lt;/h3&gt;

&lt;p&gt;Like CMake, Meson relies on toolchain files (here called &lt;em&gt;“cross files”&lt;/em&gt;) to specify which tools should be used when building for a given target. Thanks to being written in a TOML-like language, they are very straightforward:&lt;/p&gt;

&lt;div class=&quot;language-meson highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;err&quot;&gt;$&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meson_aarch64_fbsd_cross&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;txt&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;binaries&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;/home/marco/.local/bin/aarch64-pc-freebsd-clang&apos;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cpp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;/home/marco/.local/bin/aarch64-pc-freebsd-clang++&apos;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ld&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;/usr/bin/ld.lld&apos;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ar&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;/usr/bin/llvm-ar&apos;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;objcopy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;/usr/bin/llvm-objcopy&apos;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;strip&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;/usr/bin/llvm-strip&apos;&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;properties&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ld_args&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;--sysroot=/home/marco/farm_tree&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;host_machine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;system&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;freebsd&apos;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cpu_family&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;aarch64&apos;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cpu&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;aarch64&apos;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;endian&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;little&apos;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This cross-file can then be specified to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;meson setup&lt;/code&gt; using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--cross-file&lt;/code&gt; option &lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;, with everything else remaining the same as with every other Meson build.&lt;/p&gt;

&lt;p&gt;And, well, this is basically it: like with CMake, the whole process is relatively painless and foolproof. 
For the sake of completeness, this is how to build &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dav1d&lt;/code&gt;, VideoLAN’s AV1 decoder, for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aarch64-pc-freebsd&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;git clone https://code.videolan.org/videolan/dav1d
Cloning into &lt;span class=&quot;s1&quot;&gt;&apos;dav1d&apos;&lt;/span&gt;...
warning: redirecting to https://code.videolan.org/videolan/dav1d.git/
remote: Enumerating objects: 164, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
remote: Counting objects: 100% &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;164/164&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
remote: Compressing objects: 100% &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;91/91&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
remote: Total 9377 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;delta 97&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, reused 118 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;delta 71&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, pack-reused 9213
Receiving objects: 100% &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;9377/9377&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, 3.42 MiB | 54.00 KiB/s, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
Resolving deltas: 100% &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;7068/7068&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, &lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;meson setup build &lt;span class=&quot;nt&quot;&gt;--cross-file&lt;/span&gt; ../meson_aarch64_fbsd_cross.txt &lt;span class=&quot;nt&quot;&gt;--buildtype&lt;/span&gt; release
The Meson build system
Version: 0.56.2
Source &lt;span class=&quot;nb&quot;&gt;dir&lt;/span&gt;: /home/marco/dav1d
Build &lt;span class=&quot;nb&quot;&gt;dir&lt;/span&gt;: /home/marco/dav1d/build
Build &lt;span class=&quot;nb&quot;&gt;type&lt;/span&gt;: cross build
Project name: dav1d
Project version: 0.8.1
C compiler &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;the host machine: /home/marco/.local/bin/aarch64-pc-freebsd-clang &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;clang 11.0.1 &lt;span class=&quot;s2&quot;&gt;&quot;clang version 11.0.1&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
C linker &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;the host machine: /home/marco/.local/bin/aarch64-pc-freebsd-clang ld.lld 11.0.1
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; output &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;meson compile &lt;span class=&quot;nt&quot;&gt;-C&lt;/span&gt; build
Found runner: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;/usr/bin/ninja&apos;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
ninja: Entering directory &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;build&lt;span class=&quot;s1&quot;&gt;&apos;
[129/129] Linking target tests/seek_stress
$ mkdir dest
$ env DESTDIR=$PWD/dest meson install -C build
ninja: Entering directory `build&apos;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;1/11] Generating vcs_version.h with a custom &lt;span class=&quot;nb&quot;&gt;command
&lt;/span&gt;Installing src/libdav1d.so.5.0.1 to /home/marco/dav1d/dest/usr/local/lib
Installing tools/dav1d to /home/marco/dav1d/dest/usr/local/bin
Installing /home/marco/dav1d/include/dav1d/common.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/include/dav1d/data.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/include/dav1d/dav1d.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/include/dav1d/headers.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/include/dav1d/picture.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/build/include/dav1d/version.h to /home/marco/dav1d/dest/usr/local/include/dav1d
Installing /home/marco/dav1d/build/meson-private/dav1d.pc to /home/marco/dav1d/dest/usr/local/lib/pkgconfig
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;file dest/usr/local/bin/dav1d
dest/usr/local/bin/dav1d: ELF 64-bit LSB executable, ARM aarch64, version 1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;FreeBSD&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, dynamically linked, interpreter /libexec/ld-elf.so.1, &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;FreeBSD 13.0 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1300136&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, FreeBSD-style, with debug_info, not stripped

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;bonus-static-linking-with-musl-and-alpine-linux&quot;&gt;Bonus: static linking with musl and Alpine Linux&lt;/h2&gt;

&lt;p&gt;Statically linking a C or C++ program can sometimes save you a lot of library compatibility headaches, especially when you can’t control what’s going to be installed on whatever you plan to target.
Building static binaries is however quite complex on GNU/Linux, due to Glibc actively discouraging people from linking it statically. &lt;sup id=&quot;fnref:6&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Musl is a very compatible standard library implementation for Linux that plays much nicer with static linking, and it is now shipped by most major distributions. These packages often suffice in building your code statically, at least as long as you plan to stick with plain C.&lt;/p&gt;

&lt;p&gt;The situation gets much more complicated if you plan to use C++, or if you need additional components. Any library shipped by a GNU/Linux system (like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libstdc++&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libz&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libffi&lt;/code&gt; and so on) is usually only built for Glibc, meaning that any library you wish to use must be rebuilt to target Musl. This also applies to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libstdc++&lt;/code&gt;, which inevitably means either recompiling GCC or building a copy of LLVM’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libc++&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Thankfully, there are several distributions out there that target &lt;em&gt;“Musl-plus-Linux”&lt;/em&gt;, everyone’s favorite being Alpine Linux. It is thus possible to apply the same strategy we used above to obtain a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x86_64-pc-linux-musl&lt;/code&gt; sysroot complete of libraries and packages built for Musl, which can then be used by Clang to generate 100% static executables.&lt;/p&gt;

&lt;h3 id=&quot;setting-up-an-alpine-container&quot;&gt;Setting up an Alpine container&lt;/h3&gt;

&lt;p&gt;A good starting point is the minirootfs tarball provided by Alpine, which is meant for containers and tends to be very small:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;wget &lt;span class=&quot;nt&quot;&gt;-qO&lt;/span&gt; - https://dl-cdn.alpinelinux.org/alpine/v3.13/releases/x86_64/alpine-minirootfs-3.13.1-x86_64.tar.gz | &lt;span class=&quot;nb&quot;&gt;gunzip&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tar &lt;/span&gt;xfp - &lt;span class=&quot;nt&quot;&gt;-C&lt;/span&gt; ~/alpine_tree
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It is now possible to chroot inside the image in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/alpine_tree&lt;/code&gt; and set it up, installing all the packages you may need. 
I prefer in general to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemd-nspawn&lt;/code&gt; in lieu of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chroot&lt;/code&gt; due to it being vastly better and less error prone. &lt;sup id=&quot;fnref:7&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ $ &lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;systemd-nspawn &lt;span class=&quot;nt&quot;&gt;-D&lt;/span&gt; alpine_tree
Spawning container alpinetree on /home/marco/alpine_tree.
Press ^] three &lt;span class=&quot;nb&quot;&gt;times &lt;/span&gt;within 1s to &lt;span class=&quot;nb&quot;&gt;kill &lt;/span&gt;container.
alpinetree:~# 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We can now (optionally) switch to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;edge&lt;/code&gt; branch of Alpine for newer packages by editing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/apk/repositories&lt;/code&gt;, and then install the required packages containing any static libraries required by the code we want to build:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;alpinetree:~# &lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; /etc/apk/repositories
https://dl-cdn.alpinelinux.org/alpine/edge/main
https://dl-cdn.alpinelinux.org/alpine/edge/community
alpinetree:~# apk update
fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
v3.13.0-1030-gbabf0a1684 &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;https://dl-cdn.alpinelinux.org/alpine/edge/main]
v3.13.0-1035-ga3ac7373fd &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;https://dl-cdn.alpinelinux.org/alpine/edge/community]
OK: 14029 distinct packages available
alpinetree:~# apk upgrade
OK: 6 MiB &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;14 packages
alpinetree:~# apk add g++ libc-dev
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing libgcc &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;10.2.1_pre1-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;2/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing libstdc++ &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;10.2.1_pre1-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;3/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing binutils &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;2.35.1-r1&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;4/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing libgomp &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;10.2.1_pre1-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;5/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing libatomic &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;10.2.1_pre1-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;6/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing libgphobos &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;10.2.1_pre1-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;7/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing gmp &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;6.2.1-r0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;8/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing isl22 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;0.22-r0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;9/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing mpfr4 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;4.1.0-r0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;10/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing mpc1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1.2.1-r0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;11/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing gcc &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;10.2.1_pre1-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;12/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing musl-dev &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1.2.2-r1&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;13/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing libc-dev &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;0.7.2-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;14/14&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing g++ &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;10.2.1_pre1-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
Executing busybox-1.33.0-r1.trigger
OK: 188 MiB &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;28 packages
alpinetree:~# apk add zlib-dev zlib-static
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1/3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing pkgconf &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1.7.3-r0&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;2/3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing zlib-dev &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1.2.11-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;3/3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; Installing zlib-static &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;1.2.11-r3&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
Executing busybox-1.33.0-r1.trigger
OK: 189 MiB &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;31 packages
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In this case I installed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;g++&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libc-dev&lt;/code&gt; in order to get a static copy of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libstdc++&lt;/code&gt;, a static &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libc.a&lt;/code&gt; (Musl) and their respective headers. I also installed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zlib-dev&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zlib-static&lt;/code&gt; to install zlib’s headers and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libz.a&lt;/code&gt;, respectively. 
As a general rule, Alpine usually ships static versions available inside &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-static&lt;/code&gt; packages, and headers as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;somepackage-dev&lt;/code&gt;. &lt;sup id=&quot;fnref:8&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Also, remember every once in a while to run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apk upgrade&lt;/code&gt; inside the &lt;em&gt;sysroot&lt;/em&gt; in order to keep the local Alpine install up to date.&lt;/p&gt;

&lt;h3 id=&quot;compiling-static-c-programs&quot;&gt;Compiling static C++ programs&lt;/h3&gt;

&lt;p&gt;With everything now set, it’s only a matter of running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang++&lt;/code&gt; with the right &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--target&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--sysroot&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;clang++ &lt;span class=&quot;nt&quot;&gt;-B&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/.llvm/bin &lt;span class=&quot;nt&quot;&gt;--gcc-toolchain&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/alpine_tree/usr &lt;span class=&quot;nt&quot;&gt;--target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;x86_64-alpine-linux-musl &lt;span class=&quot;nt&quot;&gt;--sysroot&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/alpine_tree &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;/alpine_tree/lib &lt;span class=&quot;nt&quot;&gt;-std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;c++17 &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; zpipe zpipe.cc &lt;span class=&quot;nt&quot;&gt;-lz&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-static&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;file zpipe
zpipe: ELF 64-bit LSB executable, x86-64, version 1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;SYSV&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, statically linked, with debug_info, not stripped
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The extra &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--gcc-toolchain&lt;/code&gt; is optional, but may help solving issues where compilation fails due to Clang not detecting where GCC and the various crt*.o files reside in the sysroot.
The extra &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-L&lt;/code&gt; for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/lib&lt;/code&gt; is required because Alpine splits its libraries between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/usr/lib&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/lib&lt;/code&gt;, and the latter is not automatically picked up by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang&lt;/code&gt;, which both usually expect libraries to be located in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$SYSROOT/usr/bin&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;writing-a-wrapper-for-static-linking-with-musl-and-clang&quot;&gt;Writing a wrapper for static linking with Musl and Clang&lt;/h3&gt;

&lt;p&gt;Musl packages usually come with the upstream-provided shims &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musl-gcc&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musl-clang&lt;/code&gt;, which wrap the system compilers in order to build and link with the alternative libc. 
In order to provide a similar level of convenience, I quickly whipped up the following Perl script:&lt;/p&gt;

&lt;div class=&quot;language-perl highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;#!/usr/bin/env perl&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;strict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;utf8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;warnings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;v5&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;.30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;use&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;List::&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;Util&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;any&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&apos;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;my&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$ALPINE_DIR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$ENV&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ALPINE_DIR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;sr&quot;&gt;//&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$ENV&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;{HOME}/alpine_tree&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&quot;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;my&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$TOOLS_DIR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$ENV&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;TOOLS_DIR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;sr&quot;&gt;//&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$ENV&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;{HOME}/.llvm/bin&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&quot;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;my&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CMD_NAME&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=~&lt;/span&gt; &lt;span class=&quot;sr&quot;&gt;/\+\+/&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;clang++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&apos;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;clang&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&apos;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;my&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$STATIC&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=~&lt;/span&gt; &lt;span class=&quot;sr&quot;&gt;/static/&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;sub &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;clang&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
	&lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CMD_NAME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;or&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;sub &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;my&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$compile&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;any&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;sr&quot;&gt;/^\s*-c|-S\s*$/&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;@ARGV&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;my&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;@args&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
		 &lt;span class=&quot;sx&quot;&gt;qq{-B$TOOLS_DIR}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		 &lt;span class=&quot;sx&quot;&gt;qq{--gcc-toolchain=$ALPINE_DIR/usr}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		 &lt;span class=&quot;p&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;--target=x86_64-alpine-linux-musl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&apos;,&lt;/span&gt;
		 &lt;span class=&quot;sx&quot;&gt;qq{--sysroot=$ALPINE_DIR}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		 &lt;span class=&quot;sx&quot;&gt;qq{-L$ALPINE_DIR/lib}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		 &lt;span class=&quot;nv&quot;&gt;@ARGV&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

	&lt;span class=&quot;nb&quot;&gt;unshift&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;@args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;-static&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$STATIC&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$compile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;nb&quot;&gt;exit&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;unless&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;clang&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;@args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This wrapper is more refined than the FreeBSD AArch64 wrapper above.
For instance, it can infer C++ if invoked as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang++&lt;/code&gt;, or always force &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-static&lt;/code&gt; if called from a &lt;em&gt;symlink&lt;/em&gt; containing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;static&lt;/code&gt; in its name:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-la&lt;/span&gt; &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;which musl-clang++&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
lrwxrwxrwx    10 marco marco 26 Jan 21:49  /home/marco/.local/bin/musl-clang++ -&amp;gt; musl-clang
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-la&lt;/span&gt; &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;which musl-clang++-static&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
lrwxrwxrwx    10 marco marco 26 Jan 22:03  /home/marco/.local/bin/musl-clang++-static -&amp;gt; musl-clang
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;musl-clang++-static &lt;span class=&quot;nt&quot;&gt;-std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;c++17 &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; zpipe zpipe.cc &lt;span class=&quot;nt&quot;&gt;-lz&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# automatically infers C++ and -static&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;file zpipe
zpipe: ELF 64-bit LSB executable, x86-64, version 1 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;SYSV&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;, statically linked, with debug_info, not stripped
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It is thus possible to force Clang to only ever link &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-static&lt;/code&gt; by setting $CC to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;musl-clang-static&lt;/code&gt;, which can be useful with build systems that don’t play nicely with statically linking. From my experience, the worst offenders in this regard are Autotools (sometimes) and poorly written Makefiles.&lt;/p&gt;

&lt;h3 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h3&gt;

&lt;p&gt;Cross-compiling C and C++ is and will probably always be an annoying task, but it has got much better since LLVM became production-ready and widely available. Clang’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-target&lt;/code&gt; option has saved me countless man-hours that I would have instead wasted building and re-building GCC and Binutils over and over again.&lt;/p&gt;

&lt;p&gt;Alas, all that glitters is not gold, as is often the case. There is still code around that only builds with GCC due to nasty GNUisms (I’m looking at you, Glibc). Cross compiling for Windows/MSVC is also bordeline unfeasible due to how messy the whole Visual Studio toolchain is.&lt;/p&gt;

&lt;p&gt;Furthermore, while targeting arbitrary triples with Clang is now definitely simpler that it was, it still pales in comparison to how trivial cross compiling with Rust or Go is.&lt;/p&gt;

&lt;p&gt;One special mention among these new languages should go to Zig, and its goal to also make C and C++ easy to build for other platforms.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zig cc&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zig c++&lt;/code&gt; commands have the potential to become an amazing swiss-army knife tool for cross compiling, thanks to Zig shipping a copy of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang&lt;/code&gt; and large chunks of projects such as Glibc, Musl, libc++ and MinGW. Any required library is then built &lt;em&gt;on-the-fly&lt;/em&gt; when required:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;zig c++ &lt;span class=&quot;nt&quot;&gt;--target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;x86_64-windows-gnu &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; str.exe str.cc
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;file str.exe
str.exe: PE32+ executable &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;console&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; x86-64, &lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;MS Windows
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;While I think this is not yet perfect, it already feels almost like magic. I dare to say, this might really become a killer selling point for Zig, making it attractive even for those who are not interested in using the language itself.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;If the transfer is happening across a network and not locally, it’s a good idea to compress the output tarball. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Sadly, macOS is not supported anymore by LLD due to Mach-O support being largely unmaintained and left to rot over the last years. This leaves &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ld64&lt;/code&gt; (or a cross-build thereof, if you manage to build it) as the only way to link Mach-O executables (unless &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ld.bfd&lt;/code&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binutils&lt;/code&gt; still supports it). &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llvm-mc&lt;/code&gt; can be used as a (very cumbersome) assembler but it’s poorly documented. Like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gcc&lt;/code&gt;, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clang&lt;/code&gt; frontend can act as an assembler, making &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;as&lt;/code&gt; often redundant. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This is without talking about those criminals who hardcode &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gcc&lt;/code&gt; in their build scripts, but this is a rant better left for another day. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;In the same fashion, it is also possible to tune the native toolchain for the current machine using a &lt;em&gt;native file&lt;/em&gt; and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--native-file&lt;/code&gt; toggle. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Glibc’s builtin name resolution system (NSS) is one of the main culprits, which heavily uses dlopen()/dlsym(). This is due to its heavy usage of plugins, which is meant to provide support for extra third-party resolvers such as mDNS. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;systemd-nspawn&lt;/code&gt; can also double as a lighter alternative to VMs, using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--boot&lt;/code&gt; option to spawn an init process inside the container. See &lt;a href=&quot;https://gist.github.com/sfan5/52aa53f5dca06ac3af30455b203d3404&quot;&gt;this very helpful gist&lt;/a&gt; to learn how to make bootable containers for distributions based on OpenRC, like Alpine. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Sadly, Alpine for reasons unknown to me, does not ship the static version of certain libraries (like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libfmt&lt;/code&gt;). Given that embedding a local copy of third party dependencies is common practice nowadays for C++, this is not too problematic. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>NAT66: The good, the bad, the ugly</title>
   <link href="https://mcilloni.ovh/2018/01/20/oh-god-why-NAT66/"/>
   <updated>2018-01-20T00:00:00+00:00</updated>
   <id>https://mcilloni.ovh/2018/01/20/oh-god-why-NAT66</id>
   <content type="html">&lt;p&gt;NAT (and NAPT) is one of those technologies anyone has a strong opinion about. It has been for years the necessary evil and invaluable (yet massive) hack that kept IPv4 from falling apart in the face of its abysmally small 32-bit address space - which was, to be honest, an absolute OK choice for the time the protocol was designed, when computers cost a small fortune, and were as big as lorries.&lt;/p&gt;

&lt;p&gt;The Internet Protocol, version 4, has been abused for quite too long now.  We made it into the fundamental building block of the modern Internet, a network of a scale it was never designed for. We are well in due time to put it at rest and replace it with its controversial, yet problem-solving 128-bit grandchild, IPv6.&lt;/p&gt;

&lt;p&gt;So, what should be the place for NAT in the new Internet, which makes the return to the end-to-end principle one of its main tenets?&lt;/p&gt;

&lt;h3 id=&quot;nat66-misses-the-point&quot;&gt;NAT66 misses the point&lt;/h3&gt;

&lt;p&gt;Well, none, according to the IETF, which has for years tried to dissuade everyone with dabbing with NAT66 (the name NAT is known on IPv6); this is not without good reasons, though. For too long, the supposedly stateless, connectionless level 3 IP protocol has been made into an impromptu “stateful”, connection-oriented protocol by NAT gateways, just for the sake to meet the demands of an infinite number of devices trying to connect to the Internet.&lt;/p&gt;

&lt;p&gt;This is without considering the false sense of security that address masquerading provides; I cannot recall how many times I’ve heard people say that &lt;em&gt;(gasp!)&lt;/em&gt; NAT is a fundamental piece in the security of their internal networks (it’s not).&lt;/p&gt;

&lt;p&gt;Given that the immensity of the IPv6 address space allows providers to give out full &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/64&lt;/code&gt;s to customers, I’d always failed to see the point in NAT66: it always felt to me as a feature fundamentally dead in the water, a solution seeking a problem, ready to be misused.&lt;/p&gt;

&lt;p&gt;Well, this was before discovering how cheap some hosting services could be.&lt;/p&gt;

&lt;h3 id=&quot;being-cheap-the-root-of-all-evils&quot;&gt;Being cheap: the root of all evils&lt;/h3&gt;

&lt;p&gt;I was quite glad to see a while ago that my VPS provider had announced IPv6 support; thanks to this, I would have been finally able to provide IPv6 access to the guests of the VPNs I host on that VPS, without having to incur into the delay penalties caused by tunneling the traffic on good old services such as Hurrican Electric and SixXS &lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Hooray!&lt;/p&gt;

&lt;p&gt;My excitement was unfortunately not going to last for long, and it was indeed barbarically butchered when I discovered that despite having been granted a full &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/32&lt;/code&gt; (2&lt;sup&gt;96&lt;/sup&gt; IPs), my provider decided to give its VPS customers just a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/128&lt;/code&gt; address.&lt;/p&gt;

&lt;p&gt;JUST. A. SINGLE. ONE.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mcilloni.ovh/public/ohgodwhy.png&quot; alt=&quot;Oh. God. Why.&quot; title=&quot;Y U SO CHEAP?&quot; class=&quot;center-image&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Given that IPv6 connectivity was something I really wished for my OpenVPN setup, this was quite a setback.
I was left with fundamentally only two reasonable choices:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Get a free /64 from a Hurricane Electric tunnel, and allocate IPv6s for VPN guests from there;&lt;/li&gt;
  &lt;li&gt;Be a very bad person, set up NAT66, and feel ashamed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hurricane Electric is, without doubt, the most orthodox option between the two; it’s free of charge, it gives out &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/64&lt;/code&gt;s, and it’s quite easy to set up.&lt;/p&gt;

&lt;p&gt;The main showstopper here is definitely the increased network latency added by two layers of tunneling (VPN -&amp;gt; 6to4 -&amp;gt; IPv6 internet), and, given that by default native IPv6 source IPs are preferred to IPv4, it would have been bad if having a v6 public address incurred in a slow down of connections with usually tolerable latencies. Especially if there was a way to get decent RTTs for both IPv6 and IPv4…&lt;/p&gt;

&lt;p&gt;And so, with a pang of guilt, I shamefully committed the worst crime.&lt;/p&gt;

&lt;h3 id=&quot;how-to-get-away-with-nat66&quot;&gt;How to get away with NAT66&lt;/h3&gt;

&lt;p&gt;The process of setting up NAT usually relies on picking a specially reserved privately-routable IP range, to avoid our internal network structure to get in conflict with the outer networking routing rules (it still may happen, though, if under multiple misconfigured levels of masquerading).&lt;/p&gt;

&lt;p&gt;The IPv6 equivalent to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;10.0.0.0/8&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;172.16.0.0/12&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;192.168.0.0/16&lt;/code&gt; has been defined in 2005 by the IETF, not without a whole deal of confusion first, with the Unique Local Addresses (ULA) specification. This RFC defines the unique, not publicly routable &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fc00::/7&lt;/code&gt; that is supposed to be used to define local subnets, without the unicity guarantees of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2000::/3&lt;/code&gt; (the range from which Global Unicast Addresses (GUA) - i.e. the Internet - are allocated from for the time being). From it, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd00::/8&lt;/code&gt; is the only block really defined so far, and it’s meant to define all of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/48&lt;/code&gt;s your private network may ever need.&lt;/p&gt;

&lt;p&gt;The next step was to configure my OpenVPN instances to give out  ULAs from subnets of my choice to clients, by adding at the end of to my config the following lines:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;server-ipv6 fd00::1:8:0/112
push &quot;route-ipv6 2000::/3&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I resorted to picking &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd00::1:8:0/112&lt;/code&gt; for the UDP server and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd00::1:9:0/112&lt;/code&gt; for the TCP one, due to a limitation in OpenVPN only accepting masks from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/64&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/112&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Given that I also want traffic towards the Internet to be forwarded via my NAT, it is also necessary to instruct the server to push a default route to its clients at connection time.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ ping fd00::1:8:1
PING fd00::1:8:1(fd00::1:8:1) 56 data bytes
64 bytes from fd00::1:8:1: icmp_seq=1 ttl=64 time=40.7 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The clients and servers were now able to ping each other through their local addresses without any issue, but the outer network was still unreachable.&lt;/p&gt;

&lt;p&gt;I continued the creation of this abomination by configuring the kernel to forward IPv6 packets; this is achieved by setting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net.ipv6.conf.all.forwarding = 1&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sysctl&lt;/code&gt; or in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sysctl.conf&lt;/code&gt; (from now on, the rest of this article assumes that you are under Linux).&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# cat /etc/sysctl.d/30-ipforward.conf 
net.ipv4.ip_forward=1
net.ipv6.conf.default.forwarding=1
net.ipv6.conf.all.forwarding=1
# sysctl -p /etc/sysctl.d/30-ipforward.conf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Afterwards, the only step left was to set up NAT66, which can be easily done by configuring the stateful firewall provided by Linux’ packet filter.&lt;br /&gt;
I personally prefer (and use) the newer &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nftables&lt;/code&gt; to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{ip,ip6,arp,eth}tables&lt;/code&gt; mess it is supposed to supersede, because I find it tends to be quite less moronic and clearer to understand (despite the relatively scarce documentation available online, which is sometimes a pain. I wish Linux had the excellent OpenBSD’s pf…).&lt;br /&gt;
Feel free to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ip6tables&lt;/code&gt;, if that’s what you are already using, and you don’t really feel the need to migrate your ruleset to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nft&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is a shortened, summarised snippet of the rules that I’ve had to put into my nftables.conf to make NAT66 work; I’ve also left the IPv4 rules in for the sake of completeness.&lt;/p&gt;

&lt;p&gt;&lt;sub&gt;&lt;sup&gt;&lt;em&gt;PS: Remember to change MY_EXTERNAL_IPVx with your IPv4/6!&lt;/em&gt;&lt;/sup&gt;&lt;/sub&gt;&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;table inet filter {
  [...]
  chain forward {
    type filter hook forward priority 0;

    # allow established/related connections                                                                                                                                                                                                 
    ct state {established, related} accept
    
    # early drop of invalid connections                                                                                                                                                                                                     
    ct state invalid drop

    # Allow packets to be forwarded from the VPNs to the outer world
    ip saddr 10.0.0.0/8 iifname &quot;tun*&quot; oifname eth0 accept
    
    # Using fd00::1:0:0/96 allows to match for
    # every fd00::1:xxxx:0/112 I set up
    ip6 saddr fd00::1:0:0/96 iifname &quot;tun*&quot; oifname eth0 accept
  }
  [...]
}
# IPv4 NAT table
table ip nat {
  chain prerouting {
    type nat hook prerouting priority 0; policy accept;
  }
  chain postrouting {
    type nat hook postrouting priority 100; policy accept;
    ip saddr 10.0.0.0/8 oif &quot;eth0&quot; snat to MY_EXTERNAL_IPV4
  }
} 

# IPv6 NAT table
table ip6 nat {
  chain prerouting {
    type nat hook prerouting priority 0; policy accept;
  }
  chain postrouting {
    type nat hook postrouting priority 100; policy accept;

    # Creates a SNAT (source NAT) rule that changes the source 
    # address of the outbound IPs with the external IP of eth0
    ip6 saddr fd00::1:0:0/96 oif &quot;eth0&quot; snat to MY_EXTERNAL_IPV6
  }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table ip6 nat&lt;/code&gt; table and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chain forward&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table inet filter&lt;/code&gt; are the most important things to notice here, given that they respectively configure the packet filter to perform NAT66 and to forward packets from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tun*&lt;/code&gt; interfaces to the outer world.&lt;/p&gt;

&lt;p&gt;After applying the new ruleset with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nft -f &amp;lt;path/to/ruleset&amp;gt;&lt;/code&gt; command, I was ready to witness the birth of our my little sinful setup. 
The only thing left was to ping a known IPv6 from one of the clients, to ensure that forwarding and NAT are working fine. One of the Google DNS servers would suffice:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ ping 2001:4860:4860::8888
PING 2001:4860:4860::8888(2001:4860:4860::8888) 56 data bytes
64 bytes from 2001:4860:4860::8888: icmp_seq=1 ttl=54 time=48.7 ms
64 bytes from 2001:4860:4860::8888: icmp_seq=2 ttl=54 time=47.5 ms
$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=55 time=49.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=55 time=50.8 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Perfect! NAT66 was working, in its full evil glory, and the client was able to reach the outer IPv6 Internet with round-trip times as fast as IPv4. What was left now was to check if the clients were able to resolve AAAA records; given that I was already using Google’s DNS in /etc/resolv.conf, it should have worked straight away:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ ping facebook.com
PING facebook.com (157.240.1.35) 56(84) bytes of data.
^C
$ ping -6 facebook.com
PING facebook.com(edge-star-mini6-shv-01-lht6.facebook.com (2a03:2880:f129:83:face:b00c:0:25de)) 56 data bytes
^C
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What? Why is ping trying to reach Facebook on its IPv4 address by default instead of trying IPv6 first?&lt;/p&gt;

&lt;h3 id=&quot;one-workaround-always-leads-to-another&quot;&gt;One workaround always leads to another&lt;/h3&gt;

&lt;p&gt;Well, it turned out that Glibc’s getaddrinfo() function, which is generally used to perform DNS resolution, uses a precedence system to correctly prioritise source-destination address pairs.&lt;/p&gt;

&lt;p&gt;I started to suspect that the default behaviour of getaddrinfo() could be to consider local addresses (including ULA) as a separate case than global IPv6 ones; so, I tried to check &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gai.conf&lt;/code&gt;, the configuration file for the IPv6 DNS resolver.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;label ::1/128       0  # Local IPv6 address
label ::/0          1  # Every IPv6
label 2002::/16     2 # 6to4 IPv6
label ::/96         3 # Deprecated IPv4-compatible IPv6 address prefix
label ::ffff:0:0/96 4  # Every IPv4 address
label fec0::/10     5 # Deprecated 
label fc00::/7      6 # ULA
label 2001:0::/32   7 # Teredo addresses
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What is shown in the snippet above is the default label table used by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;getaddrinfo()&lt;/code&gt;. &lt;br /&gt;
As I suspected, a ULA address is labeled differently (6) than a global Unicast one (1), and, because the default behaviour specified by RFC 3484 is to prefer pairs of source-destination addresses with the same label, the IPv4 is picked over the IPv6 ULA every time.&lt;br /&gt;
Damn, I was so close to committing the perfect crime.&lt;/p&gt;

&lt;p&gt;To make this mess finally functional, I had to make yet another ugly hack (as if NAT66 using ULAs wasn’t enough), by setting a new label table in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gai.conf&lt;/code&gt; that didn’t make distinctions between addresses.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;label ::1/128       0  # Local IPv6 address
label ::/0          1  # Every IPv6
label 2002::/16     2 # 6to4 IPv6
label ::/96         3 # Deprecated IPv4-compatible IPv6 address
label ::ffff:0:0/96 4  # Every IPv4 address
label fec0::/10     5 # Deprecated 
label 2001:0::/32   7 # Teredo addresses
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;By omitting the label for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fc00::/7&lt;/code&gt;, ULAs are now grouped together with GUAs, and natted IPv6 connectivity is used by default.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ ping google.com
PING google.com(par10s29-in-x0e.1e100.net (2a00:1450:4007:80f::200e)) 56 data bytes
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;in-conclusion&quot;&gt;In conclusion&lt;/h3&gt;

&lt;p&gt;So, yes, NAT66 can be done and it works, but that doesn’t make it any less than the messy, dirty hack it is.
For the sake of getting IPv6 connectivity behind a provider too cheap to give its customers a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/64&lt;/code&gt;, I had to forgo end-to-end connectivity, hacking Unique Local Addresses to achieve something they weren’t really devised for.&lt;/p&gt;

&lt;p&gt;Was it worthy? Perhaps. My ping under the VPN is now as good on IPv6 as it is on IPv4, and everything works fine, but this came at the cost of an overcomplicated network configuration.
This could have been much simpler if everybody had simply understood how IPv6 differs from IPv4, and that giving out a single address is simply not the right way to allocate addresses to your subscribers anymore.&lt;/p&gt;

&lt;p&gt;The NATs we use today are relics of a past where the address space was so small that we had to break the Internet in order to save it. They were a mistake made to fix an even bigger one, a blunder whose effects we have now the chance to undo.
We should just start to take the ongoing transition period as seriously as it deserves, to avoid falling into the same wrong assumptions yet again.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Ironically, SixXS closed last June because “many ISPs offer IPv6 now”. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>First post!</title>
   <link href="https://mcilloni.ovh/2018/01/03/first-post/"/>
   <updated>2018-01-03T00:00:00+00:00</updated>
   <id>https://mcilloni.ovh/2018/01/03/first-post</id>
   <content type="html">&lt;p&gt;Welcome, internet stranger, into my humble blog!&lt;/p&gt;

&lt;p&gt;I hope I’ll be able to find the time to post at least once a month a new story or tutorial about Linux, FreeBSD, system administration or similar CS-related topics, which will, more often than not, involve a full report on something I’ve been tinkering on during my research activity (or, just because I liked it).&lt;br /&gt;
Everything I publish is written without any arrogance about it being in any way relevant, correct or even interesting; the only thing I hope for is for this blog to be at least in some way useful to myself, to avoid forgetting what I’ve learned, and which mistakes I have already committed.&lt;/p&gt;

&lt;h3 id=&quot;why&quot;&gt;Why?&lt;/h3&gt;
&lt;p&gt;From the very first moment I turned on a PC in the ’90s, I’ve been hooked with computers, and anything revolving around them. Exploring and better understanding how these machines work has been an immense source of entertainment and learning for me, leading to countless hours spent in trying every piece of software, gadget or device I was able to lay my hands onto.&lt;br /&gt;
I cannot state for certain how many times I found myself delving heart and soul into some convoluted install of fundamentally every Linux and BSD distribution I could find, sometimes even resorting into compiling some of them by scratch, just for the sake of better understanding how these complex yet fascinating software packages tied together into creating a fully-fledged and functional operating system.&lt;/p&gt;

&lt;p&gt;Being passionate as I was (and still am) about software made the choice of enrolling in Computer Engineering extremely simple. During my university years, I had the time and opportunity to further improve my coding skills, especially focusing on striving to master C and C++, Go, and recently, Rust. I have a passion for compiler technology, and I’ve dabbled in programming language design for a while, implementing &lt;a href=&quot;https://github.com/mcilloni/fork&quot;&gt;a functioning self-hosting compiler&lt;/a&gt;, which I hope will be the topic of a future, fully dedicated blog post.&lt;/p&gt;

&lt;h3 id=&quot;what-do-you-do&quot;&gt;What do you do?&lt;/h3&gt;
&lt;p&gt;After working for two years at the University of Bologna as both a researcher on distributed ledgers and as a system administrator, I decided to change my professional path and become an embedded developer. I now work as an embedded developer, mostly on the ESP32 platform.&lt;/p&gt;

&lt;p&gt;My other hobbies are also languages (the ones spoken by people, at least for now!), cooking, writing, astronomy, biology, and science in general.&lt;/p&gt;

&lt;h3 id=&quot;you-wrote-something-wrong&quot;&gt;You wrote something wrong!&lt;/h3&gt;

&lt;p&gt;If you notice something is amiss with either my writing or the contents of the blog, do not esitate to contact me (in any way you prefer). I plan to add Disqus support directly on blog posts, but in the meantime don’t be shy to simply fork and PR me on Github, if you wish so.&lt;/p&gt;

</content>
 </entry>
 

</feed>
