Merge pull request #57519 (systemd-confinement)

Currently if you want to properly chroot a systemd service, you could do
it using BindReadOnlyPaths=/nix/store or use a separate derivation which
gathers the runtime closure of the service you want to chroot. The
former is the easier method and there is also a method directly offered
by systemd, called ProtectSystem, which still leaves the whole store
accessible. The latter however is a bit more involved, because you need
to bind-mount each store path of the runtime closure of the service you
want to chroot.

This can be achieved using pkgs.closureInfo and a small derivation that
packs everything into a systemd unit, which later can be added to
systemd.packages.

However, this process is a bit tedious, so the changes here implement
this in a more generic way.

Now if you want to chroot a systemd service, all you need to do is:

  {
    systemd.services.myservice = {
      description = "My Shiny Service";
      wantedBy = [ "multi-user.target" ];

      confinement.enable = true;
      serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice";
    };
  }

If more than the dependencies for the ExecStart* and ExecStop* (which
btw. also includes script and {pre,post}Start) need to be in the chroot,
it can be specified using the confinement.packages option. By default
(which uses the full-apivfs confinement mode), a user namespace is set
up as well and /proc, /sys and /dev are mounted appropriately.

In addition - and by default - a /bin/sh executable is provided, which
is useful for most programs that use the system() C library call to
execute commands via shell.

Unfortunately, there are a few limitations at the moment. The first
being that DynamicUser doesn't work in conjunction with tmpfs, because
systemd seems to ignore the TemporaryFileSystem option if DynamicUser is
enabled. I started implementing a workaround to do this, but I decided
to not include it as part of this pull request, because it needs a lot
more testing to ensure it's consistent with the behaviour without
DynamicUser.

The second limitation/issue is that RootDirectoryStartOnly doesn't work
right now, because it only affects the RootDirectory option and doesn't
include/exclude the individual bind mounts or the tmpfs.

A quirk we do have right now is that systemd tries to create a /usr
directory within the chroot, which subsequently fails. Fortunately, this
is just an ugly error and not a hard failure.

The changes also come with a changelog entry for NixOS 19.03, which is
why I asked for a vote of the NixOS 19.03 stable maintainers whether to
include it (I admit it's a bit late a few days before official release,
sorry for that):

  @samueldr:

    Via pull request comment[1]:

      +1 for backporting as this only enhances the feature set of nixos,
      and does not (at a glance) change existing behaviours.

    Via IRC:

      new feature: -1, tests +1, we're at zero, self-contained, with no
      global effects without actively using it, +1, I think it's good

  @lheckemann:

    Via pull request comment[2]:

      I'm neutral on backporting. On the one hand, as @samueldr says,
      this doesn't change any existing functionality. On the other hand,
      it's a new feature and we're well past the feature freeze, which
      AFAIU is intended so that new, potentially buggy features aren't
      introduced in the "stabilisation period". It is a cool feature
      though? :)

A few other people on IRC didn't have opposition either against late
inclusion into NixOS 19.03:

  @edolstra:  "I'm not against it"
  @Infinisil: "+1 from me as well"
  @grahamc:   "IMO its up to the RMs"

So that makes +1 from @samueldr, 0 from @lheckemann, 0 from @edolstra
and +1 from @Infinisil (even though he's not a release manager) and no
opposition from anyone, which is the reason why I'm merging this right
now.

I also would like to thank @Infinisil, @edolstra and @danbst for their
reviews.

[1]: https://github.com/NixOS/nixpkgs/pull/57519#issuecomment-477322127
[2]: https://github.com/NixOS/nixpkgs/pull/57519#issuecomment-477548395
This commit is contained in:
aszlig 2019-03-29 04:37:53 +01:00
commit dcf40f7c24
No known key found for this signature in database
GPG Key ID: 684089CE67EBB691
6 changed files with 384 additions and 5 deletions

View File

@ -68,6 +68,17 @@
<xref linkend="sec-kubernetes"/> for details.
</para>
</listitem>
<listitem>
<para>
There is now a set of <option>confinement</option> options for
<option>systemd.services</option>, which allows to restrict services
into a <citerefentry>
<refentrytitle>chroot</refentrytitle>
<manvolnum>2</manvolnum>
</citerefentry>ed environment that only contains the store paths from
the runtime closure of the service.
</para>
</listitem>
</itemizedlist>
</section>

View File

@ -172,6 +172,7 @@
./security/rtkit.nix
./security/wrappers/default.nix
./security/sudo.nix
./security/systemd-confinement.nix
./services/admin/oxidized.nix
./services/admin/salt/master.nix
./services/admin/salt/minion.nix

View File

@ -0,0 +1,199 @@
{ config, pkgs, lib, ... }:
let
toplevelConfig = config;
inherit (lib) types;
inherit (import ../system/boot/systemd-lib.nix {
inherit config pkgs lib;
}) mkPathSafeName;
in {
options.systemd.services = lib.mkOption {
type = types.attrsOf (types.submodule ({ name, config, ... }: {
options.confinement.enable = lib.mkOption {
type = types.bool;
default = false;
description = ''
If set, all the required runtime store paths for this service are
bind-mounted into a <literal>tmpfs</literal>-based <citerefentry>
<refentrytitle>chroot</refentrytitle>
<manvolnum>2</manvolnum>
</citerefentry>.
'';
};
options.confinement.fullUnit = lib.mkOption {
type = types.bool;
default = false;
description = ''
Whether to include the full closure of the systemd unit file into the
chroot, instead of just the dependencies for the executables.
<warning><para>While it may be tempting to just enable this option to
make things work quickly, please be aware that this might add paths
to the closure of the chroot that you didn't anticipate. It's better
to use <option>confinement.packages</option> to <emphasis
role="strong">explicitly</emphasis> add additional store paths to the
chroot.</para></warning>
'';
};
options.confinement.packages = lib.mkOption {
type = types.listOf (types.either types.str types.package);
default = [];
description = let
mkScOption = optName: "<option>serviceConfig.${optName}</option>";
in ''
Additional packages or strings with context to add to the closure of
the chroot. By default, this includes all the packages from the
${lib.concatMapStringsSep ", " mkScOption [
"ExecReload" "ExecStartPost" "ExecStartPre" "ExecStop"
"ExecStopPost"
]} and ${mkScOption "ExecStart"} options. If you want to have all the
dependencies of this systemd unit, you can use
<option>confinement.fullUnit</option>.
<note><para>The store paths listed in <option>path</option> are
<emphasis role="strong">not</emphasis> included in the closure as
well as paths from other options except those listed
above.</para></note>
'';
};
options.confinement.binSh = lib.mkOption {
type = types.nullOr types.path;
default = toplevelConfig.environment.binsh;
defaultText = "config.environment.binsh";
example = lib.literalExample "\${pkgs.dash}/bin/dash";
description = ''
The program to make available as <filename>/bin/sh</filename> inside
the chroot. If this is set to <literal>null</literal>, no
<filename>/bin/sh</filename> is provided at all.
This is useful for some applications, which for example use the
<citerefentry>
<refentrytitle>system</refentrytitle>
<manvolnum>3</manvolnum>
</citerefentry> library function to execute commands.
'';
};
options.confinement.mode = lib.mkOption {
type = types.enum [ "full-apivfs" "chroot-only" ];
default = "full-apivfs";
description = ''
The value <literal>full-apivfs</literal> (the default) sets up
private <filename class="directory">/dev</filename>, <filename
class="directory">/proc</filename>, <filename
class="directory">/sys</filename> and <filename
class="directory">/tmp</filename> file systems in a separate user
name space.
If this is set to <literal>chroot-only</literal>, only the file
system name space is set up along with the call to <citerefentry>
<refentrytitle>chroot</refentrytitle>
<manvolnum>2</manvolnum>
</citerefentry>.
<note><para>This doesn't cover network namespaces and is solely for
file system level isolation.</para></note>
'';
};
config = let
rootName = "${mkPathSafeName name}-chroot";
inherit (config.confinement) binSh fullUnit;
wantsAPIVFS = lib.mkDefault (config.confinement.mode == "full-apivfs");
in lib.mkIf config.confinement.enable {
serviceConfig = {
RootDirectory = pkgs.runCommand rootName {} "mkdir \"$out\"";
TemporaryFileSystem = "/";
PrivateMounts = lib.mkDefault true;
# https://github.com/NixOS/nixpkgs/issues/14645 is a future attempt
# to change some of these to default to true.
#
# If we run in chroot-only mode, having something like PrivateDevices
# set to true by default will mount /dev within the chroot, whereas
# with "chroot-only" it's expected that there are no /dev, /proc and
# /sys file systems available.
#
# However, if this suddenly becomes true, the attack surface will
# increase, so let's explicitly set these options to true/false
# depending on the mode.
MountAPIVFS = wantsAPIVFS;
PrivateDevices = wantsAPIVFS;
PrivateTmp = wantsAPIVFS;
PrivateUsers = wantsAPIVFS;
ProtectControlGroups = wantsAPIVFS;
ProtectKernelModules = wantsAPIVFS;
ProtectKernelTunables = wantsAPIVFS;
};
confinement.packages = let
execOpts = [
"ExecReload" "ExecStart" "ExecStartPost" "ExecStartPre" "ExecStop"
"ExecStopPost"
];
execPkgs = lib.concatMap (opt: let
isSet = config.serviceConfig ? ${opt};
in lib.optional isSet config.serviceConfig.${opt}) execOpts;
unitAttrs = toplevelConfig.systemd.units."${name}.service";
allPkgs = lib.singleton (builtins.toJSON unitAttrs);
unitPkgs = if fullUnit then allPkgs else execPkgs;
in unitPkgs ++ lib.optional (binSh != null) binSh;
};
}));
};
config.assertions = lib.concatLists (lib.mapAttrsToList (name: cfg: let
whatOpt = optName: "The 'serviceConfig' option '${optName}' for"
+ " service '${name}' is enabled in conjunction with"
+ " 'confinement.enable'";
in lib.optionals cfg.confinement.enable [
{ assertion = !cfg.serviceConfig.RootDirectoryStartOnly or false;
message = "${whatOpt "RootDirectoryStartOnly"}, but right now systemd"
+ " doesn't support restricting bind-mounts to 'ExecStart'."
+ " Please either define a separate service or find a way to run"
+ " commands other than ExecStart within the chroot.";
}
{ assertion = !cfg.serviceConfig.DynamicUser or false;
message = "${whatOpt "DynamicUser"}. Please create a dedicated user via"
+ " the 'users.users' option instead as this combination is"
+ " currently not supported.";
}
]) config.systemd.services);
config.systemd.packages = lib.concatLists (lib.mapAttrsToList (name: cfg: let
rootPaths = let
contents = lib.concatStringsSep "\n" cfg.confinement.packages;
in pkgs.writeText "${mkPathSafeName name}-string-contexts.txt" contents;
chrootPaths = pkgs.runCommand "${mkPathSafeName name}-chroot-paths" {
closureInfo = pkgs.closureInfo { inherit rootPaths; };
serviceName = "${name}.service";
excludedPath = rootPaths;
} ''
mkdir -p "$out/lib/systemd/system"
serviceFile="$out/lib/systemd/system/$serviceName"
echo '[Service]' > "$serviceFile"
# /bin/sh is special here, because the option value could contain a
# symlink and we need to properly resolve it.
${lib.optionalString (cfg.confinement.binSh != null) ''
binsh=${lib.escapeShellArg cfg.confinement.binSh}
realprog="$(readlink -e "$binsh")"
echo "BindReadOnlyPaths=$realprog:/bin/sh" >> "$serviceFile"
''}
while read storePath; do
if [ -L "$storePath" ]; then
# Currently, systemd can't cope with symlinks in Bind(ReadOnly)Paths,
# so let's just bind-mount the target to that location.
echo "BindReadOnlyPaths=$(readlink -e "$storePath"):$storePath"
elif [ "$storePath" != "$excludedPath" ]; then
echo "BindReadOnlyPaths=$storePath"
fi
done < "$closureInfo/store-paths" >> "$serviceFile"
'';
in lib.optional cfg.confinement.enable chrootPaths) config.systemd.services);
}

View File

@ -9,12 +9,11 @@ in rec {
shellEscape = s: (replaceChars [ "\\" ] [ "\\\\" ] s);
mkPathSafeName = lib.replaceChars ["@" ":" "\\" "[" "]"] ["-" "-" "-" "" ""];
makeUnit = name: unit:
let
pathSafeName = lib.replaceChars ["@" ":" "\\" "[" "]"] ["-" "-" "-" "" ""] name;
in
if unit.enable then
pkgs.runCommand "unit-${pathSafeName}"
pkgs.runCommand "unit-${mkPathSafeName name}"
{ preferLocalBuild = true;
allowSubstitutes = false;
inherit (unit) text;
@ -24,7 +23,7 @@ in rec {
echo -n "$text" > $out/${shellEscape name}
''
else
pkgs.runCommand "unit-${pathSafeName}-disabled"
pkgs.runCommand "unit-${mkPathSafeName name}-disabled"
{ preferLocalBuild = true;
allowSubstitutes = false;
}

View File

@ -221,6 +221,7 @@ in
switchTest = handleTest ./switch-test.nix {};
syncthing-relay = handleTest ./syncthing-relay.nix {};
systemd = handleTest ./systemd.nix {};
systemd-confinement = handleTest ./systemd-confinement.nix {};
taskserver = handleTest ./taskserver.nix {};
telegraf = handleTest ./telegraf.nix {};
tomcat = handleTest ./tomcat.nix {};

View File

@ -0,0 +1,168 @@
import ./make-test.nix {
name = "systemd-confinement";
machine = { pkgs, lib, ... }: let
testServer = pkgs.writeScript "testserver.sh" ''
#!${pkgs.stdenv.shell}
export PATH=${lib.escapeShellArg "${pkgs.coreutils}/bin"}
${lib.escapeShellArg pkgs.stdenv.shell} 2>&1
echo "exit-status:$?"
'';
testClient = pkgs.writeScriptBin "chroot-exec" ''
#!${pkgs.stdenv.shell} -e
output="$(echo "$@" | nc -NU "/run/test$(< /teststep).sock")"
ret="$(echo "$output" | sed -nre '$s/^exit-status:([0-9]+)$/\1/p')"
echo "$output" | head -n -1
exit "''${ret:-1}"
'';
mkTestStep = num: { description, config ? {}, testScript }: {
systemd.sockets."test${toString num}" = {
description = "Socket for Test Service ${toString num}";
wantedBy = [ "sockets.target" ];
socketConfig.ListenStream = "/run/test${toString num}.sock";
socketConfig.Accept = true;
};
systemd.services."test${toString num}@" = {
description = "Confined Test Service ${toString num}";
confinement = (config.confinement or {}) // { enable = true; };
serviceConfig = (config.serviceConfig or {}) // {
ExecStart = testServer;
StandardInput = "socket";
};
} // removeAttrs config [ "confinement" "serviceConfig" ];
__testSteps = lib.mkOrder num ''
subtest '${lib.escape ["\\" "'"] description}', sub {
$machine->succeed('echo ${toString num} > /teststep');
${testScript}
};
'';
};
in {
imports = lib.imap1 mkTestStep [
{ description = "chroot-only confinement";
config.confinement.mode = "chroot-only";
testScript = ''
$machine->succeed(
'test "$(chroot-exec ls -1 / | paste -sd,)" = bin,nix',
'test "$(chroot-exec id -u)" = 0',
'chroot-exec chown 65534 /bin',
);
'';
}
{ description = "full confinement with APIVFS";
testScript = ''
$machine->fail(
'chroot-exec ls -l /etc',
'chroot-exec ls -l /run',
'chroot-exec chown 65534 /bin',
);
$machine->succeed(
'test "$(chroot-exec id -u)" = 0',
'chroot-exec chown 0 /bin',
);
'';
}
{ description = "check existence of bind-mounted /etc";
config.serviceConfig.BindReadOnlyPaths = [ "/etc" ];
testScript = ''
$machine->succeed('test -n "$(chroot-exec cat /etc/passwd)"');
'';
}
{ description = "check if User/Group really runs as non-root";
config.serviceConfig.User = "chroot-testuser";
config.serviceConfig.Group = "chroot-testgroup";
testScript = ''
$machine->succeed('chroot-exec ls -l /dev');
$machine->succeed('test "$(chroot-exec id -u)" != 0');
$machine->fail('chroot-exec touch /bin/test');
'';
}
(let
symlink = pkgs.runCommand "symlink" {
target = pkgs.writeText "symlink-target" "got me\n";
} "ln -s \"$target\" \"$out\"";
in {
description = "check if symlinks are properly bind-mounted";
config.confinement.packages = lib.singleton symlink;
testScript = ''
$machine->fail('chroot-exec test -e /etc');
$machine->succeed('chroot-exec cat ${symlink} >&2');
$machine->succeed('test "$(chroot-exec cat ${symlink})" = "got me"');
'';
})
{ description = "check if StateDirectory works";
config.serviceConfig.User = "chroot-testuser";
config.serviceConfig.Group = "chroot-testgroup";
config.serviceConfig.StateDirectory = "testme";
testScript = ''
$machine->succeed('chroot-exec touch /tmp/canary');
$machine->succeed('chroot-exec "echo works > /var/lib/testme/foo"');
$machine->succeed('test "$(< /var/lib/testme/foo)" = works');
$machine->succeed('test ! -e /tmp/canary');
'';
}
{ description = "check if /bin/sh works";
testScript = ''
$machine->succeed(
'chroot-exec test -e /bin/sh',
'test "$(chroot-exec \'/bin/sh -c "echo bar"\')" = bar',
);
'';
}
{ description = "check if suppressing /bin/sh works";
config.confinement.binSh = null;
testScript = ''
$machine->succeed(
'chroot-exec test ! -e /bin/sh',
'test "$(chroot-exec \'/bin/sh -c "echo foo"\')" != foo',
);
'';
}
{ description = "check if we can set /bin/sh to something different";
config.confinement.binSh = "${pkgs.hello}/bin/hello";
testScript = ''
$machine->succeed(
'chroot-exec test -e /bin/sh',
'test "$(chroot-exec /bin/sh -g foo)" = foo',
);
'';
}
{ description = "check if only Exec* dependencies are included";
config.environment.FOOBAR = pkgs.writeText "foobar" "eek\n";
testScript = ''
$machine->succeed('test "$(chroot-exec \'cat "$FOOBAR"\')" != eek');
'';
}
{ description = "check if all unit dependencies are included";
config.environment.FOOBAR = pkgs.writeText "foobar" "eek\n";
config.confinement.fullUnit = true;
testScript = ''
$machine->succeed('test "$(chroot-exec \'cat "$FOOBAR"\')" = eek');
'';
}
];
options.__testSteps = lib.mkOption {
type = lib.types.lines;
description = "All of the test steps combined as a single script.";
};
config.environment.systemPackages = lib.singleton testClient;
config.users.groups.chroot-testgroup = {};
config.users.users.chroot-testuser = {
description = "Chroot Test User";
group = "chroot-testgroup";
};
};
testScript = { nodes, ... }: ''
$machine->waitForUnit('multi-user.target');
${nodes.machine.config.__testSteps}
'';
}