platform: avoid routes resync for routes that we don't track
When we recibe a Netlink message with a "route change" event, normally we just ignore it if it's a route that we don't track (i.e. because of the route protocol). However, it's not that easy if it has the NLM_F_REPLACE flag because that means that it might be replacing another route. If the kernel has similar routes which are candidates for the replacement, it's hard for NM to guess which one of those is being replaced (as the kernel doesn't have a "route ID" or similar field to indicate it). Moreover, the kernel might choose to replace a route that we don't have on cache, so we know nothing about it. It is important to note that we cannot just discard Netlink messages of routes that we don't track if they has the NLM_F_REPLACE. For example, if we are tracking a route with proto=static, we might receive a replace message, changing that route to proto=other_proto_that_we_dont_track. We need to process that message and remove the route from our cache. As NM doesn't know what route is being replaced, trying to guess will lead to errors that will leave the cache in an inconsistent state. Because of that, it just do a cache resync for the routes. For IPv4 there was an optimization to this: if we don't have in the cache any route candidate for the replacement there are only 2 possible options: either add the new route to the cache or discard it if we are not interested on it. We don't need a resync for that. This commit is extending that optimization to IPv6 routes. There is no reason why it shouldn't work in the same way than with IPv4. This optimization will only work well as long as we find potential candidate routes in the same way than the kernel (comparing the same fields). NM calls to this "comparing by WEAK_ID". But this can also happen with IPv4 routes. It is worth it to enable this optimization because there are routing daemons using custom routing protocols that makes tens or hundreds of updates per second. If they use NLM_F_REPLACE, this caused NM to do a resync hundreds of times per second leading to a 100% CPU usage: https://issues.redhat.com/browse/RHEL-26195 An additional but smaller optimization is done in this commit: if we receive a route message for routes that we don't track AND doesn't have the NLM_F_REPLACE flag, we can ignore the entire message, thus avoiding the memory allocation of the nmp_object. That nmp_object was going to be ignored later, anyway, so better to avoid these allocations that, with the routing daemon of the above's example, can happen hundreds of times per second. With this changes, the CPU usage doing `ip route replace` 300 times/s drops from 100% to 1%. Doing `ip route replace` as fast as possible, without any rate limitting, still keeps NM with a 3% CPU usage in the system that I have used to test.
This commit is contained in:
2
NEWS
2
NEWS
@@ -18,6 +18,8 @@ USE AT YOUR OWN RISK. NOT RECOMMENDED FOR PRODUCTION USE!
|
||||
* Fix detection of 6 GHz band capability for WiFi devices
|
||||
* Allow IPv6 SLAAC and static IPv6 DNS server assignment for modem broadband
|
||||
when IPv6 device address was not explicitly passed on by ModemManager
|
||||
* Fix a performance issue that was leading to 100% CPU usage by NetworkManager
|
||||
if external programs were doing a big amount of routes updates.
|
||||
|
||||
=============================================
|
||||
NetworkManager-1.46
|
||||
|
Reference in New Issue
Block a user