When we're deactivating an externally created device that has a master
because we're activating a connection on it, actually remove the device
from the master. Otherwise unpleasant things happen:
active-connection[0x55ed7ba78400]: constructed (NMActRequest, version-id 4, type managed)
device[0a458361f9fed8f5] (dummy0): sys-iface-state: external -> managed
device[0a458361f9fed8f5] (dummy0): queue activation request waiting for currently active connection to disconnect
device (dummy0): disconnecting for new activation request.
device (dummy0): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
device (br0): master: release one slave 0a458361f9fed8f5/dummy0 (enslaved)(no-config)
Note the "no-config" above. We'set priv->master = NULL, but didn't
communicate the change to the platform. I believe this is not good.
This patch changes that.
device (br0): bridge port dummy0 was detached
device (dummy0): released from master device br0
active-connection[0x55ed7ba782e0]: set state deactivating (was activated)
device (dummy0): ip4: set state none (was done, reason: ip-state-clear)
device (dummy0): ip6: set state none (was done, reason: ip-state-clear)
device (dummy0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
platform: (dummy0) emit signal link-changed changed: 102: dummy0
<NOARP,UP,LOWER_UP;broadcast,noarp,up,running,lowerup> mtu 1500 master 101 arp 1 dummy* init
addrgenmode none addr EA:8D:DD:DF:1F:B7 brd FF:FF:FF:FF:FF:FF driver dummy rx:0,0 tx:39,4746
Now the platform sent us a new link, the "master" property is still set.
device[0a458361f9fed8f5] (dummy0): queued link change for ifindex 102
device[0a458361f9fed8f5] (dummy0): deactivating device (reason 'new-activation') [60]
device (dummy0): ip: set (combined) state none (was done, reason: ip-state-clear)
config: device-state: write #102 (/run/NetworkManager/devices/102); managed=managed, perm-hw-addr-fake=EA:8D:DD:DF:1F:B7, route-metric-default=0-0
active-connection[0x55ed7ba782e0]: set state deactivated (was deactivating)
active-connection[0x55ed7ba782e0]: check-master-ready: already signalled (state deactivated, master 0x55ed7ba781c0 is in state activated)
device (dummy0): Activation: starting connection 'dummy1' (ec6fca51-84e6-4a5b-a297-f602252c9f69)
device[0a458361f9fed8f5] (dummy0): activation-stage: schedule activate_stage1_device_prepare
l3cfg[ae290b5c1f585d6c,ifindex=102]: emit signal (platform-change-on-idle, obj-type-flags=0x2a)
device (br0): master: add one slave 0a458361f9fed8f5/dummy0
Amidst the new activation we're processing the netlink message we got.
We set priv->master back, effectively nullifying the release above. Sad.
device (dummy0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
device[0a458361f9fed8f5] (dummy0): add_pending_action (2): 'in-state-change'
active-connection[0x55ed7ba78400]: set state activating (was unknown)
manager: NetworkManager state is now CONNECTING
active-connection[0x55ed7ba78400]: check-master-ready: not signalling (state activating, no master)
device[8fff58d61c7686ce] (br0): slave dummy0 state change 30 (disconnected) -> 40 (prepare)
device[0a458361f9fed8f5] (dummy0): remove_pending_action (1): 'in-state-change'
device (br0): master: release one slave 0a458361f9fed8f5/dummy0 (not enslaved) (force-configure)
platform: (dummy0) link: releasing 102 from master 'br0' (101)
device (br0): detached bridge port dummy0
Now things go south. The stage1 cleans the device up, removing it from
the master and the device itself decides it should deactivate itself
because it lots its master regardless of the fact that it should not
have one and it's in fact an unwanted carryover from previous activation.
I believe this is also wrong.
device[0a458361f9fed8f5] (dummy0): Activation: connection 'dummy1' master deactivated
device (dummy0): ip4: set state none (was pending, reason: ip-state-clear)
device (dummy0): ip6: set state none (was pending, reason: ip-state-clear)
device[0a458361f9fed8f5] (dummy0): add_pending_action (2): 'queued-state-change-deactivating'
device[0a458361f9fed8f5] (dummy0): queue-state[deactivating, reason:connection-assumed, id:298]: queue state change
device[0a458361f9fed8f5] (dummy0): activation-stage: synchronously invoke activate_stage2_device_config
device (dummy0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Now things are really weird. We synchronously go to config, effectively
overriding the queued deactivation. We've really messed up.
Sometimes weird things happen.
Let dummy0 be an externally created device that has a master. We decide
to activate a connection that has no master on it:
active-connection[0x55ed7ba78400]: constructed (NMActRequest, version-id 4, type managed)
device[0a458361f9fed8f5] (dummy0): sys-iface-state: external -> managed
device[0a458361f9fed8f5] (dummy0): queue activation request waiting for currently active connection to disconnect
device (dummy0): disconnecting for new activation request.
device (dummy0): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
device (br0): master: release one slave 0a458361f9fed8f5/dummy0 (enslaved)(no-config)
Note the "no-config" above. We'set priv->master = NULL, but didn't
communicate the change to the platform. I believe this is not good.
device (br0): bridge port dummy0 was detached
device (dummy0): released from master device br0
active-connection[0x55ed7ba782e0]: set state deactivating (was activated)
device (dummy0): ip4: set state none (was done, reason: ip-state-clear)
device (dummy0): ip6: set state none (was done, reason: ip-state-clear)
device (dummy0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
platform: (dummy0) emit signal link-changed changed: 102: dummy0
<NOARP,UP,LOWER_UP;broadcast,noarp,up,running,lowerup> mtu 1500 master 101 arp 1 dummy* init
addrgenmode none addr EA:8D:DD:DF:1F:B7 brd FF:FF:FF:FF:FF:FF driver dummy rx:0,0 tx:39,4746
Now the platform sent us a new link, the "master" property is still set.
device[0a458361f9fed8f5] (dummy0): queued link change for ifindex 102
device[0a458361f9fed8f5] (dummy0): deactivating device (reason 'new-activation') [60]
device (dummy0): ip: set (combined) state none (was done, reason: ip-state-clear)
config: device-state: write #102 (/run/NetworkManager/devices/102); managed=managed, perm-hw-addr-fake=EA:8D:DD:DF:1F:B7, route-metric-default=0-0
active-connection[0x55ed7ba782e0]: set state deactivated (was deactivating)
active-connection[0x55ed7ba782e0]: check-master-ready: already signalled (state deactivated, master 0x55ed7ba781c0 is in state activated)
device (dummy0): Activation: starting connection 'dummy1' (ec6fca51-84e6-4a5b-a297-f602252c9f69)
device[0a458361f9fed8f5] (dummy0): activation-stage: schedule activate_stage1_device_prepare
l3cfg[ae290b5c1f585d6c,ifindex=102]: emit signal (platform-change-on-idle, obj-type-flags=0x2a)
device (br0): master: add one slave 0a458361f9fed8f5/dummy0
Amidst the new activation we're processing the netlink message we got.
We set priv->master back, effectively nullifying the release above.
device (dummy0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
device[0a458361f9fed8f5] (dummy0): add_pending_action (2): 'in-state-change'
active-connection[0x55ed7ba78400]: set state activating (was unknown)
manager: NetworkManager state is now CONNECTING
active-connection[0x55ed7ba78400]: check-master-ready: not signalling (state activating, no master)
device[8fff58d61c7686ce] (br0): slave dummy0 state change 30 (disconnected) -> 40 (prepare)
device[0a458361f9fed8f5] (dummy0): remove_pending_action (1): 'in-state-change'
device (br0): master: release one slave 0a458361f9fed8f5/dummy0 (not enslaved) (force-configure)
platform: (dummy0) link: releasing 102 from master 'br0' (101)
device (br0): detached bridge port dummy0
Now stage1 cleans the device up, removing it from the master.
device[0a458361f9fed8f5] (dummy0): Activation: connection 'dummy1' master deactivated
device (dummy0): ip4: set state none (was pending, reason: ip-state-clear)
device (dummy0): ip6: set state none (was pending, reason: ip-state-clear)
device[0a458361f9fed8f5] (dummy0): add_pending_action (2): 'queued-state-change-deactivating'
We decide to deal with this by enqueuing a deactivation. That is not
great -- we shouldn't even have had this master!
This patch takes the deactivation path only if we were willingly
enslaved to the master in question.
The @bond_mode_8023ad test has been seen failing, with a log like this:
<debug> [...3.0484] device[...] (eth1): Activation: connection 'bond0.0' master deactivated
<debug> [...3.0484] device[...] (eth1): add_pending_action (2): 'queued-state-change-deactivating'
<debug> [...3.0484] device[...] (eth1): queue-state[deactivating, reason:new-activation, id:709]: queue state change
What happened is that eth1 has been activating. It was already enslaved
to a bond and was in an ip-config state when the bond was removed.
A change to "deactivating" state has been enqueued. But then this
happened:
<trace> [...3.0942] device[...] (eth1): ip4: check-state: state done => done, is_failed=0, is_pending=0,
is_started=0 temp_na=0, may-fail-4=1, may-fail-6=1; disabled4; manualip4=done; ignore6 manualip6=done
<trace> [...3.0942] device[...] (eth1): ip: check-state: (combined) state pending => done
<debug> [...3.0943] device[...] (eth1): ip: set (combined) state done (was pending, reason: check-ip-state)
<info> [...3.0943] device (eth1): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
<debug> [...3.0943] device[...] (eth1): add_pending_action (3): 'in-state-change'
<debug> [...3.0943] device[...] (eth1): queue-state[deactivating, reason:new-activation, id:709]: clear queued state change
The IP config succeeded and the queued "deactivating" change was
overriden by the IP4 check result, prompting a change to "ip-check".
With the master still missing. Not good.
Let's terminate the appempts to check the IP state when we cancel the
activation, so that it doesn't override the enqueued state change.
Fixes-test: @bond_mode_8023ad
https://bugzilla.redhat.com/show_bug.cgi?id=2080928https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1245
pppd also tries to configure addresses by itself through some
ioctls. If we remove between those calls an address that was added,
pppd fails and quits.
To avoid this race condition, don't remove addresses while IPCP and
IPV6CP are running. Once pppd sends an IP configuration, it has
finished configuring the interface and we can proceed normally.
https://bugzilla.redhat.com/show_bug.cgi?id=2085382
Currently we call nm_device_update_dynamic_ip_setup() in
carrier_changed() every time the carrier goes up again and the device
is activating, to kick a restart of DHCP.
Since we process link events in a idle handler, it can happen that the
handler is called only once for different events; in particular
device_link_changed() might be called once for a link-down/link-up
sequence.
carrier_changed() is "level-triggered" - it cares only about the
current carrier state. nm_device_update_dynamic_ip_setup() should
instead be "edge-triggered" - invoked every time the link goes from
down to up. We have a mechanism for that in device_link_changed(), use
it.
Fixes-test: @ipv4_spurious_leftover_route
https://bugzilla.redhat.com/show_bug.cgi?id=2079406https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1250
ipv6 DNS received on ppp interface were being ignored because their
priority was not set.
Fix this by using default priority in impl_ppp_manager_set_ip6_config(),
as was done for ip4_config in b2e559fab2 ("core: initialize l3cd
dns-priority for ppp and wwan")
Fixes: 58287cbcc0 ('core: rework IP configuration in NetworkManager using layer 3 configuration')
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1022
Yes, we anyway log the timestamps for every log message. So one could
always calculate the offset. However, when you read a logfile, it can be
cumbersome to stop looking at where you currently are to find the
start/end of a call. For convenience, log the duration explicitly.
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1251
- add code comments explaining some things.
- for NM_CMP_FIELD*() variants have a corresponding NM_CMP_DIRECT*()
macro and use it (aside the "memcmp" variants, which don't translate
directly).
l3cd instances must be removed from the old l3cfg before calling
_cleanup_ip_pre(). Otherwise, _cleanup_ip_pre() unregisters them from
the device, and later _dev_l3_register_l3cds(self, l3cfg_old, FALSE,
FALSE) does nothing because the device doesn't have any l3cd.
Previously the l3cds would linger in the l3cfg, keeping a reference to
it and causing a memory leak; the leak was not detected by valgrind
because the l3cfg was still referenced by the NMNetns.
Fixes: 58287cbcc0 ('core: rework IP configuration in NetworkManager using layer 3 configuration')
Fixes-test: @stable_mem_consumption2
https://bugzilla.redhat.com/show_bug.cgi?id=2083453https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1252
7db7dc4bab53 probe: merge branch 'th/decline-fixes'
bb61737788dd probe: fix internal state after declining lease
c5d0f38ab7a9 probe: maintain the probe's lease list in "n-dhcp4-c-probe.c"
48bf2788336e probe: return error when calling accept/decline/select in unexpected state
git-subtree-dir: src/n-dhcp4
git-subtree-split: 7db7dc4bab5312218135464d8550a86845ca6fdd
On python2 the following error is raised:
`LookupError: unknown encoding: unicode`
Seems like `unicode` is a correct encoding in Python 3 but not 2.
Fix:
1. Change encoding to `utf-8`
2. Pass output path string instead of opening file and passing
opened file object. Python2 and 3 might need different file
modes, passing just path lets ElementTree select appropriate
file mode.
Fixes: f00e90923c ('tools: Use ElementTree to write XML in generate-docs-nm-settings-docs-gir.py')
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1249
Currently, for all tests we have python3 installed. So effectively,
even on CentOS 7, we would test the build with python3 only.
The package build on CentOS7/epel7 however uses python2. This happens
for example for our copr builds.
Also test that configuration in gitlab-ci.
This was working for internal plugin in the past, but broken by l3cfg
rework with 1.36. Re-add it. Not it also works with dhclient. For other
plugins, it's not really working, because we can't decline.
Now NMDhcpClient does ACD (using NML3Cfg) and abstracts that from
the caller (NMDevice).
It is complicated. Because there is state involved, meaning, we need
to remember the current state for ACD and react on and handle a
multitude of events. Getting this right, is non-trivial.
What we want is that if ACD fails, we decline the lease (and don't use
it).
https://bugzilla.redhat.com/show_bug.cgi?id=1713380
dhclient itself doesn't do ACD. However, it expects the dhclient-script
to exit with non-zero status, which causes dhclient to send a DECLINE.
`man dhclient-script`:
BOUND:
Before actually configuring the address, dhclient-script should
somehow ARP for it and exit with a nonzero status if it receives a
reply. In this case, the client will send a DHCPDECLINE message to
the server and acquire a different address. This may also be done in
the RENEW, REBIND, or REBOOT states, but is not required, and indeed may
not be desirable.
See also Fedora's dhclient-script ([1]).
https://gitlab.isc.org/isc-projects/dhcp/-/issues/67#note_9722633226f2d76/client/dhclient.c (L1652)
[1] a8f6fd046f/f/dhclient-script (_878)https://bugzilla.redhat.com/show_bug.cgi?id=1713380
- assign the result of NM_DHCP_CLIENT_GET_CLASS() to a local variable.
It feels nicer to only call the macro once. Of course, the macro
expands to plain pointer dereferences, so there is little difference
in terms of executed code.
- handle the default case with no virtual function first.
It's pretty pointless to log
<trace> [1653389116.6288] dhcp4 (br0): client event 7
<debug> [1653389116.6288] dhcp4 (br0): received OFFER of 192.168.121.110 from 192.168.121.1
where the obscure event #7 is only telling you that we are going
to log something. Handle logging events first.
In general, drop the "client event %d" message and make sure that all
code paths log something (useful), so we can see in the log that the
event was reached.
When we accept/decline a lease, then that only works if we are in state
GRANTED. n-dhcp4 API also requires us, to provide the exact lease, that
we were announced earlier.
As such, we need to make sure that we don't accept/decline in the wrong
state. That means, to keep track of what we are doing more carefully.
The functions _dhcp_client_accept()/_dhcp_client_decline() now take
a l3cd argument, the one that we announced earlier. And we check that it
still matches.
They are no longer used from outside, NMDhcpClient fully handles this.
Make them static and internal.
Also, decline is currently unused. It will be used soon, with ACD
support.
Previously, during decline we would clear probe->current_lease,
however leave the state at GRANTED.
That is a wrong state, and can easily lead to a crash later.
For example, on the next timeout we will end up at
n_dhcp4_client_dispatch_timer(), then current-lease gets
accessed unconditionally:
case N_DHCP4_CLIENT_PROBE_STATE_GRANTED:
if (ns_now >= probe->current_lease->lifetime) {
Instead, return to INIT state and schedule a timer. As suggested
by RFC 2131, section 3.1, 5) ([1]).
[1] https://datatracker.ietf.org/doc/html/rfc2131#section-3.1
The lease list and the probe's state are strongly related. That is
evidenced by the fact that sometimes we check the state and then
access probe->current_lease without further checking.
The code in "n-dhcp4-c-probe.c" (select_lease, accept, decline) already
changes and maintains the state, it should also maintain the lease list.
Move the code.
The caller is supposed to call accept/decline/select with the lease that
was just announced. Calling it in the wrong state or with the wrong
lease is a user error.
Return an error when called in the wrong state, so that the user
notices they did something wrong.
The same check is also for nettools' n-dhcp4 client. It's useful to
being able to rely on certain things, like that an DHCPv4 lease always
has exactly one address (not equal to 0.0.0.0).
The l3_cfg_notify_cb() handler is used for different purposes, and
different events will be considered.
Usually a switch statement is very nice for enums, especially if all
enum values should be handled (because the compiler can warn about
unhandled cases). In this case, not all events are supposed to be
handled. At this point, it seems nicer to just use an if block. It
better composes.
The compiler should be able to optimize both variants to the same
result. In any case, checking some integers for equality is in any case
going to be efficient.
I think the previous was technically correct in any case too.
Still change it, because I feel with union and struct initialization,
we should always explicitly pick one union member that we fully
initialize.