NixOS uses virtual machines based on QEMU extensively for running its test suite. In order to avoid generating a disk image for every test, the test driver usually boots using a Plan 9 File Protocol (9p) share (server implemented by QEMU) for the Nix store, which contains all the programs and config necessary for the test.
I was working on a VM test that copied a fairly large amount of data (~278k files totaling ~5.3GiB) out of the 9p-mounted Nix store, and was surprised at how long copying this data took. On NVMe devices, I would expect this to take a matter of seconds or minutes, but the test actually ended up taking over 2 hours, most of which was spent copying files from 9p. Since this is untenably long for incremental work, I decided to dig into it a bit, and was able to reduce the duration of the test to only 7 minutes. In this post, I’ll describe the whole journey.
Profiling QEMU
As a preface: I don’t have much experience in debugging performance issues! Most of what I used was novel to me. The first thing I wanted to do was find out where a lot of time was being spent. My guess was that it was in QEMU and not in the guest, though this guess being correct was a matter of pure luck. This stack overflow question described a problem mostly but not entirely unlike mine. This led me down the wonky path of trying to use the poor man’s profiler, a little hack composed of gdb, some shell, and awk.
Stories of surprises and failure: poor man’s profiler
I immediately ran into a minor roadblock with this approach. gdb said:
Nix uses Linux namespaces to provide builds with some isolation from the system running the build in order to reduce the effects that the environment of a particular machine can have on the result of the build (“purity”). This includes PID namespaces, which prevent processes within the namespace from touching any processes outside the namespace. gdb was unhappy with being in a different PID namespace from the process it was targeting! I first attempted to get my gdb inside the sandbox using nsenter. The first surprise I encountered here was that entering a PID namespace does not cause utilities from procps, such as ps, pgrep and top, to report only on processes inside the new namespace:
What!? That’s not the PID1 I’m expecting! And I certainly haven’t been running this build since the 2nd of September.
I’ll omit the details of the hour of frustration that ensued, but it turns out
that the /proc
which ps
and friends were reading from was still that of the
root PID namespace—even though they were no longer running in it! This might
cause some funny unexpected behaviour when using tools like pkill
… But that’s
a problem for another day.
With my newfound knowledge, I was able to work around this issue by also
creating a new mount namespace and mounting the desired proc
filesystem on
top of the /proc
we inherited from the root namespace.
- Locking resources, at 1500 and 743 thread samples. Since the top result was on worker threads, I guessed that wasn’t interesting and they were just waiting for work to become available.
- Waiting for things to happen inside the VM, at 100 and 53 thread samples. That didn’t seem like a relevant issue though—I’d expect a VM monitor to be waiting for its guest to need something most of the time.
The rest of the numbers were small enough that I (erroneously!) considered them uninteresting too. At this point I became frustrated enough with this branch of my investigation that I gave up on it.
I moved on to quickstack, packaged it, and observed that it couldn’t get me any information on threads, before going all the way back to the Stack Overflow question to follow the other link provided in the answer.
Flame graphs with perf
This was what really got me somewhere. After recording performance data using
perf record -F max -a -g -- sleep 20
while the build was running, I was able
to generate a flame graph which made the source of the performance problems
quite apparent. This command:
produced this nifty interactive SVG graph:
The prevalence of “fid”-related functions is quite apparent in this graph. So I jumped into the 9p docs and QEMU source code to find out that fids are numbers which refer to open files in a 9p connection, similar to file descriptors in the POSIX file API—so there is one fid for every file that the guest has open.
Let’s look at the previous implementation of get_fid
, which QEMU’s 9p server
uses in implementations of stat
(getting file metadata), open
(opening a
file), and read
(reading data from an open file), amongst others:
QEMU iterates over fid_list
to find the desired fid
’s data. Finding an entry
in a list by iterating over has a time complexity of O(n)
where n
is the
size of the list—in this case, the list of all the files the guest has open—so this
is expensive! Moreover, some inspection of QSIMPLEQ (QEMU simple queue
) reveals that it’s implemented as a linked list, a data structure which
tends to exhibit poor cache locality.
Since my test copies many files from the 9p filesystem, as well as being booted from it, this lookup happens quite often:
- One
stat
for getting a file’s metadata (permissions in particular) - One
open
to get a handle on a file - One
read
for small files, or many for larger files, to get the contents of the file
That makes at least 3 operations which perform the lookup, for each of the 278000 files, which bring the inefficient lookup into a hot code path. This was the reason for the slowness.
Fixing it
What we really want is a structure where we can look up entries by fid
more
cheaply. We can’t just use an array-based vector, which would consistently give
us O(1)
lookup, because fid
s are chosen by the client: we can’t rely on every
newly allocated fid
just being the smallest unoccupied one, and need to support
arbitrary 32-bit integers. I opted for a hash table, as conveniently
implemented by
glib, which
QEMU already depends on. That provides us with O(1)
best-case complexity,
while keeping the worst case at O(n)
. The exact real-world performance
characteristics are significantly more complex than with the linked list, and
there may be marginally more suitable data structures (or hash table
implementations) out there, but we’re looking for a big win and not
micro-optimisation here.
Rewriting the relevant code was surprisingly uneventful, to the point that once I had it compiling, it just worked (an experience I’m not sure I’ve ever had before with C!). The results were spectacular: my previously >2h test finished in 7 minutes. It also reduces the build time of NixOS’s ZFS AWS image from 19 minutes to 1. It was pretty clear to me that this needed to go upstream.
Find out more about how Determinate Systems is contributing to the Nix community
Contributing the fix
QEMU uses an email-based workflow, where patches are sent as emails to a mailing list which maintainers are subscribed to. I hadn’t done this before, and it went somewhat chaotically, as you can see by looking at the threads (v1 v3 v1a v5 v6) in the list archives. There’s a lot of space for mistakes in email-based patch submission, and I made several:
- Forgetting to reply-all when answering review comments, so that the reply is visible to everyone interested and not just the reviewer
- Missing the version tag on a resubmission (in my case, this was because of misleading docs)
- Sending a resubmission as a reply to the previous thread (that was just me failing to read the docs)
- Losing a bunch of patch-documenting work while using git-publish because
git send-email
failed (nixpkgs does not enable Git’s email functionality in the default package, and git-publish unconditionally deletes the working files once the send command has run).
But the reviewers, especially Christian Schoenebeck (thanks!) were helpful and patient, and in the end the patch made it (though an apparently unrelated bug was discovered) and will soon be finding its way to a QEMU version near you! If you can’t wait and you use Nix, you can pull the patch in with this overlay:
Outcomes
While the core of what I changed wasn’t hugely complex, this adventure had many collateral (positive!) outcomes for me:
- I wrote Nix packages for quickstack (including an upstream PR that makes the build more generic) and git-publish.
- I used
perf
to work on performance issues for the first time, learning how useful it is and basics of how to use it. - I learnt that different methods of profiling can yield wildly different results—the results from perf painted a different picture from the poor man’s profiler; I don’t understand why yet, but I’ll spend some more time learning about them soon.
- I submitted my first patch to QEMU, getting to know the code a little.
- I submitted my second patch to QEMU (trying to improve the patch submission docs).
- I learnt how to submit patches by email, and which mistakes to avoid.
- I learnt about the 9p protocol.
- My tests now run much faster!
- And I got the warm fuzzy feeling of having made a valuable contribution.
My takeaway from it? Digging into a frustrating problem—even if I’m totally unfamiliar with the code involved and the technology I’m using—can be hugely rewarding!
Not only have I made my own work on the tests a lot more pleasant by reducing turnaround time, but this will go out to all QEMU users, and significantly reduce the load generated by installer tests on NixOS’s build farm. This is what I love about working on open source software: if something’s wrong, I have the means to go in there and fix it, and can pass the benefits on to everyone else who uses it.
It doesn’t always go as well as this, but this kind of experience can make a great many less successful adventures worth it.
Also posted on Linus’s own blog