Nov 7, 2023

By Graham Christensen

Lessons from 1 million Nix Installs

Nix has an official installer that I feel has served the community reasonably well over the years. From the beginning it had some issues that made me long for a better alternative—but not so much that I considered rethinking the installer from the ground up. That changed about a year ago.

I was struggling with the installer's Bash scripts and trying to handle an edge case around temporary directories. Fixing this bug took nearly a week, and the end result? Five lines of code—and even when we were done we still weren't sure if it was right. At that moment, I knew that we at Determinate Systems needed to start with a clean slate and build something great, not in Bash but rather in the much more robust and expressive Rust.

From the beginning we focused on reliability, user experience, and providing a modern Nix experience out of the box. And I firmly believe that we have succeeded. The Determinate Nix Installer, as we came to call it:

So just how successful has the Determinate Nix Installer been? Today I'm pleased to announce that it has successfully installed Nix over one million times since we first introduced it in January of 2022.

Targeting 100% success

We strive for a 100% success rate when installing Nix. That doesn't mean that we put the files in the right place and call it good. Nix has to work the way that users expect it to. A "successful failure" is still a failure in our eyes.

We're not big fans of software "phoning home." Nobody loves it, and every change we make to our diagnostics receives careful reviews and strong critique on principle. At the same time, we could not make the most reliable installer without it.

The Determinate Nix Installer collects a minimal amount of diagnostic data after every installation. This data includes the OS and architecture of the computer, whether the install succeeded, and some sanitized amount of failure information, such which part of the installation process failed. Collecting this data is a critical component of improving the installer and targeting the most important problems that users are facing.

Reaching 100% may not be possible but we have the results to prove we're striving to give users the best experience every time. Overall, we're tracking approximately a 99.4% success rate.

Rolling deployments

One important way we're able to retain our success rate is through our carefully orchestrated, rolling deployments. We don't just flip a switch and move 100% of our users to new releases all at once. This is risky and it doesn't treat users with the consideration they deserve. Nix and the Determinate Nix Installer are load-bearing components of our users' stacks and we have to respect that.

Our releases start by rolling out to only 20% of requests from GitHub Actions. We start with GitHub Actions because the environment is ephemeral and failure cases are easy to resolve by restarting the job. This means that users on long-term devices don't get a bad experience and CI users are likely to hit "re-run" when they encounter a weird edge case.

We carefully track the new release and monitor to see if users are experiencing an increase in installation failures. But we don't stop there. Like I mentioned earlier, our goal is not just to throw Nix down on the host and call it good. Our goal is to deliver a working version of Nix that doesn't break users' setups.

To accomplish this, our GitHub Action reports back anonymized summary data for public GitHub Actions workflows. The data is a little bit noisy but it is also very valuable, and in practice we find that the rate of workflow failures is consistent between two releases unless Nix or the installer is broken. This data along with some diagnostics data enables us to identify problems and regressions in Nix itself for real users in a way that nobody else in the Nix landscape is doing.

The Determinate Nix Installer's release ramping dashboard.
The Determinate Nix Installer's release ramping dashboard.

Over time, we carefully ramp up the GitHub Actions installations until we reach 100%, and alongside that we also ramp up the upgrade for users outside of CI.

Because our failure rate in CI is so low, we're able to take careful, measured steps to roll out new features and updates without big-bang releases that can spoil many an afternoon.

The long tail of error conditions is long but I do believe that our results speak for themselves.

Lessons learned on our way to 100%

User machines in particular have a uniquely fascinating history (to put it diplomatically).

macOS's security model is robust

Installing Nix on macOS means reckoning with an ever-tightening security model. This is generally good for users but it means that Nix and its installer have to constantly keep up.

Our installer is written in Rust, which means that adapting to the continuous upstream changes is easier; we're not fighting the language as the official installer must do with Bash.

Our installer (and uninstaller!) successfully navigates configuring synthetic.conf and fstab and also creating APFS volumes—with encryption, and more. Using Rust has been crucial to doing things the right way.

We're also able to do experiments and make improvements like switching from named APFS volumes to using UUIDs, which enables us to solve tricky problems surrounding systems that boot without the Nix store mounted.

MUSL builds and nscd/sssd on Linux

For portability across Linux distributions, the Determinate Nix Installer is statically compiled using MUSL. In Rust, this means targeting x86_64-unknown-linux-musl and the crt-static target feature as documented in the Rust Reference.

This unfortunately brings a new set of unique issues. During one step of the installation process we use nix::unistd::User, which uses the getpwnam_r syscall. We've received several reports that indicate that programs like nscd and sssd can override the getpwnam_r syscall. In these situations, one workaround is to build our installer yourself with cargo install nix-installer and run that. But we don't love this solution and we hope to provide something more compelling, such as creating glibc-based release binaries.

Creating and deleting users

Serially creating ~32 users for Nix takes an annoyingly long time, measuring seconds per user on some machines.

Early in development we experimented with creating users in parallel to speed up the process. This turned out to be problematic on Mac and Linux due to locking and other parallelism-related issues.

We also looked at directly editing /etc/passwd and other files, but we are concerned this may cause further issues in enterprise environments with central user directories.

In addition, we adopted the auto-allocate-uids feature from Nix, which did make installation much faster but caused other issues. On macOS, for example, we experienced problems building Nix (of all things) because whoami no longer worked. We had problems on Linux, too. In issue #539 we noticed that some distributions experienced errors like setting uid: invalid argument. We utlimately rolled the feature back but one day we'd love to find a solution that would let us adopt it again.

And the fun continues.

In issue #33 we found that deleting users on certain Macs sometimes ends with a permissions error. After quite a bit of investigation, as well as referencing articles like Can't delete a macOS user with dscl and When you "can't" delete a user in MacOS, we uncovered the issue. It seems that you can't delete users on macOS if nobody has logged into the machine graphically.

This was a big problem for us since we run a macOS build farm dedicated to building and testing the installer.

We still don't have an automatic fix but we do detect the error and provide instructions on how to resolve it.

Nix's SSL certificate story needed improvement

Issues like #289 and later #516 made it evident that the existing NIX_SSL_CERT_FILE environment variable was causing some problems for certain installations, as well as confusion in some users. Running nix build, for example, would sometimes produce errors like this:

warning: error: unable to download '...': SSL peer certificate or SSH remote key was not OK (60); retrying in 337 ms

The problem appeared to be stem from inconsistencies in how NIX_SSL_CERT_FILE was being handled.

During discussions with Eelco we concluded that the best solution would be to lift the NIX_SSL_CERT_FILE into a configuration option inside users' nix.conf nixos/nix#8062 configuration files.

This appears to have solved most of the issues we were seeing.

Uninstallation order is important

A recurring issue that cropped up on our issue boards was a positively bizarre CA certificate issue on Macs characterized by pull request #608. Our first few reports made little sense. Why was Nix trying to access /etc/ssl/certs/ca-certificates.crt? That path doesn't normally exist on Mac and the install process doesn't involve it!

Reproducing the issue required these steps:

  1. Install Nix
  2. Install nix-darwin
  3. Uninstall Nix either with /nix/nix-installer uninstall or the official guide
  4. Reinstall Nix

Uninstalling Nix before uninstalling nix-darwin leaves a Launch Daemon called org.nixos.activate-system. Leaving this Launch Daemon lingering causes issues with the NIX_SSL_CERT_FILE environment variable, which in turn spoils reinstalls.

This issue prompted us to add new pre-install and pre-uninstall checks to warn about the issue before you hit it. There is a workaround and we hope that a future release of the installer will provide a robust cure for this issue.

Containers are complicated

There are several popular container runtimes that differ in subtle ways. Installation is pretty normal when targeting a Podman container with systemd, for example, but Docker containers can't run systemd, which complicates the installation.

In some runtimes, Nix's sandboxing isn't a viable option due to highly restrictive sandboxing of the container itself.

At one point, we made a matrix of different options that worked for Podman and Docker but the complexity got the best of us. In the end, we found two configurations that worked in the most common use cases. It feels like there is still A story to be told on this particular issue and we'd be glad to find a better solution.

Thank you to our collaborators

Whether it's a carefully described issue, a drive-by pull request, or even seeing your friendly faces at the Installer Working Group meetings, we want to say thank you 🎉😊 for collaborating with us on this project. It's been extremely uplifting to be able to participate in these greater community discussions. We continue to hope that the upstream project adopts the Determinate Nix Installer for itself.

In particular, a big thank you to Abathur and Mkenigs for continued in-depth collaboration.

What's next?

An early design choice was that our installer should have a public API for building custom installers on top of it. Ultimately, this hasn't received a lot of interest and has made creating the user experience we want much more complicated. We haven't yet decided if we want to keep this API, but if this is important to you, please let us know on our Discord.

We're well on our way to our next million installations, but before we get there it'd be great to call it 1.0.0.

If you'd like to chat about Nix and get help with flakes, please join us on our flake-forward Discord!