1--- 2title: Storage Daemons for the Root File System 3category: Interfaces 4layout: default 5SPDX-License-Identifier: LGPL-2.1-or-later 6--- 7 8# systemd and Storage Daemons for the Root File System 9 10a.k.a. _Pax Cellae pro Radix Arbor_ 11 12(or something like that, my Latin is a bit rusty) 13 14A number of complex storage technologies on Linux (e.g. RAID, volume 15management, networked storage) require user space services to run while the 16storage is active and mountable. This requirement becomes tricky as soon as the 17root file system of the Linux operating system is stored on such storage 18technology. Previously no clear path to make this work was available. This text 19tries to clear up the resulting confusion, and what is now supported and what 20is not. 21 22## A Bit of Background 23 24When complex storage technologies are used as backing for the root file system 25this needs to be set up by the initial RAM file system (initrd), i.e. on Fedora 26by Dracut. In newer systemd versions tear-down of the root file system backing 27is also done by the initrd: after terminating all remaining running processes 28and unmounting all file systems it can (which means excluding the root fs) 29systemd will jump back into the initrd code allowing it to unmount the final 30file systems (and its storage backing) that could not be unmounted as long as 31the OS was still running from the main root file system. The initrd' job is to 32detach/unmount the root fs, i.e. inverting the exact commands it used to set 33them up in the first place. This is not only cleaner, but also allows for the 34first time arbitrary complex stacks of storage technology. 35 36Previous attempts to handle root file system setups with complex storage as 37backing usually tried to maintain the root storage with program code stored on 38the root storage itself, thus creating a number of dependency loops. Safely 39detaching such a root file system becomes messy, since the program code on the 40storage needs to stay around longer than the storage, which is technically 41contradicting. 42 43 44## What's new? 45 46As a result, we hereby clarify that we do not support storage technology setups 47where the storage daemons are being run from the storage it maintains 48itself. In other words: a storage daemon backing the root file system cannot be 49stored on the root file system itself. 50 51What we do support instead is that these storage daemons are started from the 52initrd, stay running all the time during normal operation and are terminated 53only after we returned control back to the initrd and by the initrd. As such, 54storage daemons involved with maintaining the root file system storage 55conceptually are more like kernel threads than like normal system services: 56from the perspective of the init system (i.e. systemd) these services have been 57started before systemd got initialized and stay around until after systemd is 58already gone. These daemons can only be updated by updating the initrd and 59rebooting, a takeover from initrd-supplied services to replacements from the 60root file system is not supported. 61 62 63## What does this mean? 64 65Near the end of system shutdown, systemd executes a small tool called 66systemd-shutdown, replacing its own process. This tool (which runs as PID 1, as 67it entirely replaces the systemd init process) then iterates through the 68mounted file systems and running processes (as well as a couple of other 69resources) and tries to unmount/read-only mount/detach/kill them. It continues 70to do this in a tight loop as long as this results in any effect. From this 71killing spree a couple of processes are automatically excluded: PID 1 itself of 72course, as well as all kernel threads. After the killing/unmounting spree 73control is passed back to the initrd, whose job is then to unmount/detach 74whatever might be remaining. 75 76The same killing spree logic (but not the unmount/detach/read-only logic) is 77applied during the transition from the initrd to the main system (i.e. the 78"`switch_root`" operation), so that no processes from the initrd survive to the 79main system. 80 81To implement the supported logic proposed above (i.e. where storage daemons 82needed for the root fs which are started by the initrd stay around during 83normal operation and are only killed after control is passed back to the 84initrd) we need to exclude these daemons from the shutdown/switch_root killing 85spree. To accomplish this the following logic is available starting with 86systemd 38: 87 88Processes (run by the root user) whose first character of the zeroth command 89line argument is `@` are excluded from the killing spree, much the same way as 90kernel threads are excluded too. Thus, a daemon which wants to take advantage 91of this logic needs to place the following at the top of its `main()` function: 92 93```c 94... 95argv[0][0] = '@'; 96... 97``` 98 99And that's already it. Note that this functionality is only to be used by 100programs running from the initrd, and **not** for programs running from the 101root file system itself. Programs which use this functionality and are running 102from the root file system are considered buggy since they effectively prohibit 103clean unmounting/detaching of the root file system and its backing storage. 104 105_Again: if your code is being run from the root file system, then this logic 106suggested above is **NOT** for you. Sorry. Talk to us, we can probably help you 107to find a different solution to your problem._ 108 109The recommended way to distinguish between run-from-initrd and run-from-rootfs 110for a daemon is to check for `/etc/initrd-release` (which exists on all modern 111initrd implementations, see the [initrd Interface](INITRD_INTERFACE.md) for 112details) which when exists results in `argv[0][0]` being set to `@`, and 113otherwise doesn't. Something like this: 114 115```c 116#include <unistd.h> 117 118int main(int argc, char *argv[]) { 119 ... 120 if (access("/etc/initrd-release", F_OK) >= 0) 121 argv[0][0] = '@'; 122 ... 123 } 124``` 125 126Why `@`? Why `argv[0][0]`? First of all, a technique like this is not without 127precedent: traditionally Unix login shells set `argv[0][0]` to `-` to clarify 128they are login shells. This logic is also very easy to implement. We have been 129looking for other ways to mark processes for exclusion from the killing spree, 130but could not find any that was equally simple to implement and quick to read 131when traversing through `/proc/`. Also, as a side effect replacing the first 132character of `argv[0]` with `@` also visually invalidates the path normally 133stored in `argv[0]` (which usually starts with `/`) thus helping the 134administrator to understand that your daemon is actually not originating from 135the actual root file system, but from a path in a completely different 136namespace (i.e. the initrd namespace). Other than that we just think that `@` 137is a cool character which looks pretty in the ps output... 138 139Note that your code should only modify `argv[0][0]` and leave the comm name 140(i.e. `/proc/self/comm`) of your process untouched. 141 142## To which technologies does this apply? 143 144These recommendations apply to those storage daemons which need to stay around 145until after the storage they maintain is unmounted. If your storage daemon is 146fine with being shut down before its storage device is unmounted you may ignore 147the recommendations above. 148 149This all applies to storage technology only, not to daemons with any other 150(non-storage related) purposes. 151 152## What else to keep in mind? 153 154If your daemon implements the logic pointed out above it should work nicely 155from initrd environments. In many cases it might be necessary to additionally 156support storage daemons to be started from within the actual OS, for example 157when complex storage setups are used for auxiliary file systems, i.e. not the 158root file system, or created by the administrator during runtime. Here are a 159few additional notes for supporting these setups: 160 161* If your storage daemon is run from the main OS (i.e. not the initrd) it will 162 also be terminated when the OS shuts down (i.e. before we pass control back 163 to the initrd). Your daemon needs to handle this properly. 164 165* It is not acceptable to spawn off background processes transparently from 166 user commands or udev rules. Whenever a process is forked off on Unix it 167 inherits a multitude of process attributes (ranging from the obvious to the 168 not-so-obvious such as security contexts or audit trails) from its parent 169 process. It is practically impossible to fully detach a service from the 170 process context of the spawning process. In particular, systemd tracks which 171 processes belong to a service or login sessions very closely, and by spawning 172 off your storage daemon from udev or an administrator command you thus make 173 it part of its service/login. Effectively this means that whenever udev is 174 shut down, your storage daemon is killed too, resp. whenever the login 175 session goes away your storage might be terminated as well. (Also note that 176 recent udev versions will automatically kill all long running background 177 processes forked off udev rules now.) So, in summary: double-forking off 178 processes from user commands or udev rules is **NOT** OK! 179 180* To automatically spawn storage daemons from udev rules or administrator 181 commands, the recommended technology is socket-based activation as 182 implemented by systemd. Transparently for your client code connecting to the 183 socket of your storage daemon will result in the storage to be started. For 184 that it is simply necessary to inform systemd about the socket you'd like it 185 to listen on on behalf of your daemon and minimally modify the daemon to 186 receive the listening socket for its services from systemd instead of 187 creating it on its own. Such modifications can be minimal, and are easily 188 written in a way that does not negatively impact usability on non-systemd 189 systems. For more information on making use of socket activation in your 190 program consult this blog story: [Socket 191 Activation](http://0pointer.de/blog/projects/socket-activation.html) 192 193* Consider having a look at the [initrd Interface of systemd](INITRD_INTERFACE.md). 194