Drop the next_char() and is_delimiter() macros. They are difficult to
understand, because they both have a state-variable (escaped).
Instead, the state of whether we handle an escape or not, shall only
depend on the current line of code.
The caller should make a conscious decision which delimiters to use.
Unfortunately, there is a variety of different demiters in use. This
should be unitfied and the callers should use one of a few specific
set of delimiters.
This could be unified by (re)using a define as delimiters, like
strv = nm_utils_strsplit_set_full (value, MULTILIST_WITH_ESCAPE_CHARS, NM_UTILS_STRSPLIT_SET_FLAGS_ALLOW_ESCAPING);
where MULTILIST_WITH_ESCAPE_CHARS has a particular meaning that should
be reused for similar uses.
However, leaving the delimiter at NULL is not good because it's unclear who
wants that default behavior (and what the default should be). Don't allow that.
There are almost no callers that relied on this default anyway.
If NM fails to connect to teamd, it currently just sets the device
state to FAILED and waits that deactivate() is called later. However,
the 5 seconds timeout on teamd process start can hit in the meantime,
which fails with an assertion "nm_device_is_activating (device)".
Clean up the device state when the connection to teamd fails.
https://bugzilla.redhat.com/show_bug.cgi?id=1697900
We call GetConnectionUnixProcessID and GetConnectionUnixUser *a lot*.
And we do so synchronously. Both is a problem.
To avoid the first problem, cache the last few requests with each cached
value being valid for one second.
On a quick test, this saves 98% of the requests:
59 GetConnectionUnixProcessID(*)
3201 GetConnectionUnixProcessID(*) (served from cache)
59 GetConnectionUnixUser(*)
3201 GetConnectionUnixUser(*) (served from cache)
Note that now as we serve requests from the cache, it might be the case
that the D-Bus endpoint already disconnected. Previously, the request would
have failed but now we return the cached user-id and process-id. This
problem is mitigated by only caching the values for up to one second.
Also, it's not really a problem because we cache sender names. Those
are supposed to be unique and not repeat. So, even if the peer already
disconnected, it is still true that the corresponding PID/UID was as
we have cached it. We don't use this API for checking whether the peer
is still connected, but what UID/PID it has/had. That answer is still
correct for the cached value after the peer disconnected.
The proxy does nothing for us, except overhead.
We can directly subscribe to "NameOwnerChanged" signals on the
GDBusConnection. Also, instead of asynchronously creating the
GDBusProxy, asynchronously call "GetNameOwner". That's what the
proxy does anyway.
GDBusConnection is actually a decent API. We don't need another layer on
top of that, for functionality that we don't use.
Also, don't use G_BUS_TYPE_SYSTEM, but use the GDBusConnection that
also the bus-manager uses. For all practical purposes, that is the
connection was want to use also in NMDnsSystemdResolved.
Every (failed) attempt to D-Bus activate a service results in log-messages
from dbus-daemon. It must be avoided to spam the logs that way.
Let connectivity check not only ask whether systemd-resolved is enabled
(and NetworkManager would like to push information there), but also
whether it looks like the service is actually available. That is,
either it has a name-owner or it's not blocked from starting.
The previous workaround was to configure main.systemd-resolved=no
in NetworkManager.conf. But that requires explict configuration.
Previously, we would create the D-Bus proxy without
%G_DBUS_PROXY_FLAGS_DO_NOT_AUTO_START_AT_CONSTRUCTION
flag.
That means, when systemd-resolved was not available or masked, the creation
of the D-Bus proxy would fail with
dns-sd-resolved[0x561905dc92d0]: failure to create D-Bus proxy for systemd-resolved: Error calling StartServiceByName for org.freedesktop.resolve1: GDBus.Error:org.freedesktop.systemd1.NoSuchUnit: Unit dbus-org.freedesktop.resolve1.service not found.
and never retried.
Now, when creating the D-Bus proxy don't autostart the instance.
Instead, each D-Bus call will try to poke and start the service.
There is a problem however: if systemd-resolved is not available, then
we must not constantly trying to start it, because it results in a slur
or syslog messages from dbus-daemon:
dbus-daemon[991]: [system] Activating via systemd: service name='org.freedesktop.resolve1' unit='dbus-org.freedesktop.resolve1.service' requested by ':1.23' (uid=0 pid=1012 comm="/usr/bin/NetworkManager --no-daemon ")
dbus-daemon[991]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
dbus-daemon[991]: [system] Activating via systemd: service name='org.freedesktop.resolve1' unit='dbus-org.freedesktop.resolve1.service' requested by ':1.23' (uid=0 pid=1012 comm="/usr/bin/NetworkManager --no-daemon ")
Avoid that by watching the name owner.
But, since systemd-resolved is D-Bus activated, watching the name owner
alone is not enough to know whether we should try to autostart the service.
Instead:
- if we have a name owner, assume the service runs and we send the update
- if we have no name owner, and we did not recently try to start
the service by name, poke it via "StartServiceByName". The idea
is, that in total we only try this once and remember a previous
attempt in priv->try_start_blocked.
- if we get a name-owner, priv->try_start_blocked gets reset.
Either it was us who started the service, or somebody else.
Either way, we are good to send updates again.
The nice thing is that we only try once to start resolved and only
generate one logging message from dbus-daemon about failure to do so.
But still, after blocking start on failure, when somebody else starts
resolved, we notice it and start using it again.
As we frequently send updates to systemd-resolved and for each update
send multiple messages, it can happen that we log a large number of
warnings if they all fail.
Rate limit the warnings to only warn once (until the failure is
recovered).
Currently, if systemd-resolved is not installed (or disabled) we already
fail once to create the D-Bus proxy (and never retry). That should be
fixed, to create the proxy with G_DBUS_PROXY_FLAGS_DO_NOT_AUTO_START_AT_CONSTRUCTION.
If we allow creating the proxy we would repeatedly try to send messages
and they would all fail. This is one example, where we need to ratelimit
the warning.
And no longer use "fedora:lastest". While "fedora:rawhide" names the very
latest branch (and we want to test that), for all proper releases we want
name them explicitly.
When we delete the runner.name property, the runner object itself gets
deleted if that was the only property, and @runner becomes invalid.
==13818== Invalid read of size 1
==13818== at 0x55EAF4: nm_streq (nm-macros-internal.h:869)
==13818== by 0x55EAF4: _json_team_normalize_defaults (nm-utils.c:5573)
==13818== by 0x566C89: _nm_utils_team_config_set (nm-utils.c:6057)
==13818== by 0x5498A6: _nm_utils_json_append_gvalue (nm-utils-private.h:228)
==13818== by 0x5498A6: set_property (nm-setting-team.c:1622)
==13818== Address 0x182a9330 is 0 bytes inside a block of size 13 free'd
==13818== at 0x4839A0C: free (vg_replace_malloc.c:530)
==13818== by 0x4857868: json_delete_string (value.c:763)
==13818== by 0x4857868: json_delete (value.c:975)
==13818== by 0x4851FA1: UnknownInlinedFun (jansson.h:129)
==13818== by 0x4851FA1: hashtable_do_del (hashtable.c:131)
==13818== by 0x4851FA1: hashtable_del (hashtable.c:289)
==13818== by 0x55DFDD: _json_del_object (nm-utils.c:5384)
==13818== by 0x55EA70: _json_delete_object_on_string_match (nm-utils.c:5532)
==13818== by 0x55EADB: _json_team_normalize_defaults (nm-utils.c:5549)
==13818== by 0x566C89: _nm_utils_team_config_set (nm-utils.c:6057)
==13818== by 0x5498A6: _nm_utils_json_append_gvalue (nm-utils-private.h:228)
==13818== by 0x5498A6: set_property (nm-setting-team.c:1622)
==13818== Block was alloc'd at
==13818== at 0x483880B: malloc (vg_replace_malloc.c:299)
==13818== by 0x4852E8C: lex_scan_string (load.c:389)
==13818== by 0x4852E8C: lex_scan (load.c:620)
==13818== by 0x4853458: parse_object (load.c:738)
==13818== by 0x4853458: parse_value (load.c:862)
==13818== by 0x4853466: parse_object (load.c:739)
==13818== by 0x4853466: parse_value (load.c:862)
==13818== by 0x4853655: parse_json.constprop.7 (load.c:899)
==13818== by 0x48537CF: json_loads (load.c:959)
==13818== by 0x566780: _nm_utils_team_config_set (nm-utils.c:5961)
==13818== by 0x5498A6: _nm_utils_json_append_gvalue (nm-utils-private.h:228)
==13818== by 0x5498A6: set_property (nm-setting-team.c:1622)
Fixes: a5642fd93a ('libnm-core: team: rework defaults management on runner properties')
Just in case anyone uses this file as-is; I believe that these drivers
don't need to be in a blacklist. There's a good chance they either work
or we can detect failure to change a MAC address on runtime and act
accordingly.
Blacklisting the driver gives the driver maintainers no chance to fix
the behavior and we are not able to log a warning on failure.
Open vSwitch is the special kid on the block -- it likes to be in charge of
the link lifetime and so we shouldn't be. This means that we shouldn't be
attempting to remove the link: we'd just (gracefully) fail anyways.
More importantly, this also means that we shouldn't care if we see the link
go away.
https://bugzilla.redhat.com/show_bug.cgi?id=1543557
If the ovsdb entry gets removed without the device being deactivated,
it's because its parent was removed and we should use the
DEPENDENCY_FAILED reason.
This is important because, with that reason, policy knows not to
autoconnect and bring the port that was being removed back.
Going directly to unmanaged just to prevent auto-connection turns out to
be the wrong thing to do. Perhaps we're reactivating the device, and
unmanaging it would interfere with the new activation.
This reverts commit 045b88a5b5.
In general shortcutting state is a no-no. But putting a device to FAILED
state because its master is going down is a crime. It's the wrong state:
the devices should enter it when their connections themselves failed
unexpectedly, and can potentially recover with another actiation.
Otherwise bad things happen,
In particular, the devices automatically enter DISCONNECTED state and
eventually retry autoconnecting. In this case they would attempt to
bring the master back up. Ugh.
This situation happens when a topomost master of multiple levels of
master-slave relationship is deactivated.
Aside from that, shortcutting to DISCONNECTED on unknown change reason
doesn't make sense either. Like, wtf, just traverse through DEACTIVATING
like all the other kids do.
Seems on a busy system, we can hit this timeout. Increase it.
ERROR:../src/platform/tests/test-common.c:939:_ip_address_add: code should not be reached
Enabling eBPF causes src/devices/tests/test-acd to fail:
strace: bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_HASH, key_size=4, value_size=1, max_entries=8, map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 112) = -1 EPERM (Operation not permitted)
NetworkManager-Message: 10:07:04.404: <warn> [1554631624.4046] acd[0xa2b400,10]: couldn't init ACD for announcing addresses on interface 'nm-test-veth0': Operation not permitted
Interestingly it does not always fail. Seems to depend on the kernel
which is used in the containerized test environments of gitlab-ci.
For now, just disable eBPF and use the fallback implementation.
Connection defaults should correspond in range to the per-profile values.
"infiniband.mtu" is required to be not larger than 65520, so we also
need to honor that when parsing the connection default.
Traditionally, the MTU in "datagram" transport mode was restricted to
2044. That is no longer the case, relax that.
In fact, choose a very large maximum and don't differenciate between
"connected" mode (they now both use now 65520). This is only the
limitation of the connection profile. Whether setting such large MTUs
actually works must be determined when activating the profile.
Initscripts "ifup-ib" from rdma-core package originally had a limit of 2044.
This was raised to 4092 in rh#1186498. It is suggested to raise it further
in bug rh#1647541.
In general, kernel often does not allow setting large MTUs. And even if it
allows it, it may not work because it also requires the entire network to
be configured accordingly. But that means, it is generally not helpful to
limit the MTU in the connection profile too strictly. Just allow large
MTUs, we need to see at activation time whether the configuration works.
Note also that all other setting types don't validate the range for MTU at
all.
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1186498
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1593334
(rdma-core: raise limit from 2044 to 4092 in ifup-ib)
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1647541
(rdma-core: raise limit beyond 4092 in ifup-ib)
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1532638#c4
(rdma-core: MTU related discussion)
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1534869
(NetworkManager bug about this topic, but with lots of unrelated
discussion. See in particular #c16)
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1653494
Previously, this would re-implement what nm_strstrip_avoid_copy()
was doing.
Use nm_strstrip_avoid_copy_a() instead, which avoids the code
duplication and the heap allocation in common cases.