Message ID | 20210109172624.2028156-15-olteanv@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v6,net-next,01/15] net: mark dev_base_lock for deprecation | expand |
On Sat, 2021-01-09 at 19:26 +0200, Vladimir Oltean wrote: > From: Vladimir Oltean <vladimir.oltean@nxp.com> > > There is an effort to convert .ndo_get_stats64 to sleepable context, > and > for that to work, we need to prevent callers of dev_get_stats from > using > atomic locking. > > The bonding driver retrieves its statistics recursively from its > lower > interfaces, with additional care to only count packets sent/received > while those lowers were actually enslaved to the bond - see commit > 5f0c5f73e5ef ("bonding: make global bonding stats more reliable"). > > Since commit 87163ef9cda7 ("bonding: remove last users of bond->lock > and > bond->lock itself"), the bonding driver uses the following protection > for its array of slaves: RCU for readers and rtnl_mutex for updaters. > > The aforementioned commit removed an interesting comment: > > /* [...] we can't hold bond->lock [...] because we'll > * deadlock. The only solution is to rely on the fact > * that we're under rtnl_lock here, and the slaves > * list won't change. This doesn't solve the problem > * of setting the slave's MTU while it is > * transmitting, but the assumption is that the base > * driver can handle that. > * > * TODO: figure out a way to safely iterate the slaves > * list, but without holding a lock around the actual > * call to the base driver. > */ > > The above summarizes pretty well the challenges we have with nested > bonding interfaces (bond over bond over bond over...) and locking for > their slaves. > > To solve the nesting problem, the simple way is to not hold any locks > when recursing into the slave netdev operation. We can "cheat" and > use > dev_hold to take a reference on the slave net_device, which is enough > to > ensure that netdev_wait_allrefs() waits until we finish, and the > kernel > won't fault. > > However, the slave structure might no longer be valid, just its > associated net_device. So we need to do some more work to ensure that > the slave exists after we took the statistics, and if it still does, > reapply the logic from Andy's commit 5f0c5f73e5ef. > > Tested using the following two scripts running in parallel: > > #!/bin/bash > > while :; do > ip link del bond0 > ip link del bond1 > ip link add bond0 type bond mode 802.3ad > ip link add bond1 type bond mode 802.3ad > ip link set sw0p1 down && ip link set sw0p1 master > bond0 && ip link set sw0p1 up > ip link set sw0p2 down && ip link set sw0p2 master > bond0 && ip link set sw0p2 up > ip link set sw0p3 down && ip link set sw0p3 master > bond0 && ip link set sw0p3 up > ip link set bond0 down && ip link set bond0 master > bond1 && ip link set bond0 up > ip link set sw1p1 down && ip link set sw1p1 master > bond1 && ip link set sw1p1 up > ip link set bond1 up > ip -s -s link show > cat /sys/class/net/bond1/statistics/* > done > > #!/bin/bash > > while :; do > echo spi2.0 > /sys/bus/spi/drivers/sja1105/unbind > echo spi2.0 > /sys/bus/spi/drivers/sja1105/bind > sleep 30 > done > > where the sja1105 driver was explicitly modified for the purpose of > this > test to have a msleep(500) in its .ndo_get_stats64 method, to catch > some > more potential races. > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > --- > [...] > > +/* Helpers for reference counting the struct net_device behind the > bond slaves. > + * These can be used to propagate the net_device_ops from the bond > to the > + * slaves while not holding rcu_read_lock() or the rtnl_mutex. > + */ > +struct bonding_slave_dev { > + struct net_device *ndev; > + struct list_head list; > +}; > + > +static inline void bond_put_slaves(struct list_head *slaves) > +{ > + struct bonding_slave_dev *s, *tmp; > + > + list_for_each_entry_safe(s, tmp, slaves, list) { > + dev_put(s->ndev); > + list_del(&s->list); > + kfree(s); > + } > +} > + > +static inline int bond_get_slaves(struct bonding *bond, > + struct list_head *slaves, > + int *num_slaves) > +{ > + struct list_head *iter; > + struct slave *slave; > + > + INIT_LIST_HEAD(slaves); > + *num_slaves = 0; > + > + rcu_read_lock(); > + > + bond_for_each_slave_rcu(bond, slave, iter) { > + struct bonding_slave_dev *s; > + > + s = kzalloc(sizeof(*s), GFP_ATOMIC); GFP_ATOMIC is a little bit aggressive especially when user daemons are periodically reading stats. This can be avoided. You can pre-allocate with GFP_KERNEL an array with an "approximate" size. then fill the array up with whatever slaves the the bond has at that moment, num_of_slaves can be less, equal or more than the array you just allocated but we shouldn't care .. something like: rcu_read_lock() nslaves = bond_get_num_slaves(); rcu_read_unlock() sarray = kcalloc(nslaves, sizeof(struct bonding_slave_dev), GFP_KERNEL); rcu_read_lock(); bond_fill_slaves_array(bond, sarray); // also do: dev_hold() rcu_read_unlock(); bond_get_slaves_array_stats(sarray); bond_put_slaves_array(sarray); > + if (!s) { > + rcu_read_unlock(); > + bond_put_slaves(slaves); > + return -ENOMEM; > + } > + > + s->ndev = slave->dev; > + dev_hold(s->ndev); > + list_add_tail(&s->list, slaves); > + (*num_slaves)++; > + } > + > + rcu_read_unlock(); > + > + return 0; > +} > + > #define BOND_PRI_RESELECT_ALWAYS 0 > #define BOND_PRI_RESELECT_BETTER 1 > #define BOND_PRI_RESELECT_FAILURE 2
On Mon, Jan 11, 2021 at 03:38:49PM -0800, Saeed Mahameed wrote: > GFP_ATOMIC is a little bit aggressive especially when user daemons are > periodically reading stats. This can be avoided. > > You can pre-allocate with GFP_KERNEL an array with an "approximate" > size. > then fill the array up with whatever slaves the the bond has at that > moment, num_of_slaves can be less, equal or more than the array you > just allocated but we shouldn't care .. > > something like: > rcu_read_lock() > nslaves = bond_get_num_slaves(); > rcu_read_unlock() > sarray = kcalloc(nslaves, sizeof(struct bonding_slave_dev), > GFP_KERNEL); > rcu_read_lock(); > bond_fill_slaves_array(bond, sarray); // also do: dev_hold() > rcu_read_unlock(); > > > bond_get_slaves_array_stats(sarray); > > bond_put_slaves_array(sarray); I don't know what to say about acquiring RCU read lock twice and traversing the list of interfaces three or four times. On the other hand, what's the worst that can happen if the GFP_ATOMIC memory allocation fails. It's not like there is any data loss. User space will retry when there is less memory pressure.
On Tue, 2021-01-12 at 16:37 +0200, Vladimir Oltean wrote: > On Mon, Jan 11, 2021 at 03:38:49PM -0800, Saeed Mahameed wrote: > > GFP_ATOMIC is a little bit aggressive especially when user daemons > > are > > periodically reading stats. This can be avoided. > > > > You can pre-allocate with GFP_KERNEL an array with an "approximate" > > size. > > then fill the array up with whatever slaves the the bond has at > > that > > moment, num_of_slaves can be less, equal or more than the array > > you > > just allocated but we shouldn't care .. > > > > something like: > > rcu_read_lock() > > nslaves = bond_get_num_slaves(); > > rcu_read_unlock() > > sarray = kcalloc(nslaves, sizeof(struct bonding_slave_dev), > > GFP_KERNEL); > > rcu_read_lock(); > > bond_fill_slaves_array(bond, sarray); // also do: dev_hold() > > rcu_read_unlock(); > > > > > > bond_get_slaves_array_stats(sarray); > > > > bond_put_slaves_array(sarray); > > I don't know what to say about acquiring RCU read lock twice and > traversing the list of interfaces three or four times. You can optimize this by tracking #num_slaves. > On the other hand, what's the worst that can happen if the GFP_ATOMIC > memory allocation fails. It's not like there is any data loss. > User space will retry when there is less memory pressure. Anyway Up to you, i just don't like it when we use GFP_ATOMIC when it can be avoided, especially for periodic jobs, like stats polling..
Saeed Mahameed <saeed@kernel.org> wrote: >On Tue, 2021-01-12 at 16:37 +0200, Vladimir Oltean wrote: >> On Mon, Jan 11, 2021 at 03:38:49PM -0800, Saeed Mahameed wrote: >> > GFP_ATOMIC is a little bit aggressive especially when user daemons >> > are >> > periodically reading stats. This can be avoided. >> > >> > You can pre-allocate with GFP_KERNEL an array with an "approximate" >> > size. >> > then fill the array up with whatever slaves the the bond has at >> > that >> > moment, num_of_slaves can be less, equal or more than the array >> > you >> > just allocated but we shouldn't care .. >> > >> > something like: >> > rcu_read_lock() >> > nslaves = bond_get_num_slaves(); >> > rcu_read_unlock() Can be nslaves = READ_ONCE(bond->slave_cnt), or, for just active slaves: struct bond_up_slave *slaves; slaves = rcu_dereference(bond->slave_arr); nslaves = slaves ? READ_ONCE(slaves->count) : 0; >> > sarray = kcalloc(nslaves, sizeof(struct bonding_slave_dev), >> > GFP_KERNEL); >> > rcu_read_lock(); >> > bond_fill_slaves_array(bond, sarray); // also do: dev_hold() >> > rcu_read_unlock(); >> > >> > >> > bond_get_slaves_array_stats(sarray); >> > >> > bond_put_slaves_array(sarray); >> >> I don't know what to say about acquiring RCU read lock twice and >> traversing the list of interfaces three or four times. > >You can optimize this by tracking #num_slaves. I think that the set of active slaves changing between the two calls will be a rare exception, and that the number of slaves is generally small (more than 2 is uncommon in my experience). >> On the other hand, what's the worst that can happen if the GFP_ATOMIC >> memory allocation fails. It's not like there is any data loss. >> User space will retry when there is less memory pressure. > >Anyway Up to you, i just don't like it when we use GFP_ATOMIC when it >can be avoided, especially for periodic jobs, like stats polling.. And, for the common case, I suspect that an array allocation will have lower overhead than a loop that allocates once per slave. -J --- -Jay Vosburgh, jay.vosburgh@canonical.com
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index b70ca0e41142..84d4e97007cb 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3705,80 +3705,75 @@ static void bond_fold_stats(struct rtnl_link_stats64 *_res, } } -#ifdef CONFIG_LOCKDEP -static int bond_get_lowest_level_rcu(struct net_device *dev) -{ - struct net_device *ldev, *next, *now, *dev_stack[MAX_NEST_DEV + 1]; - struct list_head *niter, *iter, *iter_stack[MAX_NEST_DEV + 1]; - int cur = 0, max = 0; - - now = dev; - iter = &dev->adj_list.lower; - - while (1) { - next = NULL; - while (1) { - ldev = netdev_next_lower_dev_rcu(now, &iter); - if (!ldev) - break; - - next = ldev; - niter = &ldev->adj_list.lower; - dev_stack[cur] = now; - iter_stack[cur++] = iter; - if (max <= cur) - max = cur; - break; - } - - if (!next) { - if (!cur) - return max; - next = dev_stack[--cur]; - niter = iter_stack[cur]; - } - - now = next; - iter = niter; - } - - return max; -} -#endif - static int bond_get_stats(struct net_device *bond_dev, struct rtnl_link_stats64 *stats) { struct bonding *bond = netdev_priv(bond_dev); - struct rtnl_link_stats64 temp; - struct list_head *iter; - struct slave *slave; - int nest_level = 0; - int res = 0; + struct rtnl_link_stats64 *dev_stats; + struct bonding_slave_dev *s; + struct list_head slaves; + int res, num_slaves; + int i = 0; - rcu_read_lock(); -#ifdef CONFIG_LOCKDEP - nest_level = bond_get_lowest_level_rcu(bond_dev); -#endif + res = bond_get_slaves(bond, &slaves, &num_slaves); + if (res) + return res; - spin_lock_nested(&bond->stats_lock, nest_level); - memcpy(stats, &bond->bond_stats, sizeof(*stats)); + dev_stats = kcalloc(num_slaves, sizeof(*dev_stats), GFP_KERNEL); + if (!dev_stats) { + bond_put_slaves(&slaves); + return -ENOMEM; + } - bond_for_each_slave_rcu(bond, slave, iter) { - res = dev_get_stats(slave->dev, &temp); + /* Recurse with no locks taken */ + list_for_each_entry(s, &slaves, list) { + res = dev_get_stats(s->ndev, &dev_stats[i]); if (res) goto out; + i++; + } + + spin_lock(&bond->stats_lock); + + memcpy(stats, &bond->bond_stats, sizeof(*stats)); + + /* Because we released the RCU lock in bond_get_slaves, the new slave + * array might be different from the original one, so we need to take + * it again and only update the stats of the slaves that still exist. + */ + rcu_read_lock(); + + i = 0; + + list_for_each_entry(s, &slaves, list) { + struct list_head *iter; + struct slave *slave; - bond_fold_stats(stats, &temp, &slave->slave_stats); + bond_for_each_slave_rcu(bond, slave, iter) { + if (slave->dev != s->ndev) + continue; - /* save off the slave stats for the next run */ - memcpy(&slave->slave_stats, &temp, sizeof(temp)); + bond_fold_stats(stats, &dev_stats[i], + &slave->slave_stats); + + /* save off the slave stats for the next run */ + memcpy(&slave->slave_stats, &dev_stats[i], + sizeof(dev_stats[i])); + break; + } + + i++; } + rcu_read_unlock(); + memcpy(&bond->bond_stats, stats, sizeof(*stats)); -out: + spin_unlock(&bond->stats_lock); - rcu_read_unlock(); + +out: + kfree(dev_stats); + bond_put_slaves(&slaves); return res; } diff --git a/include/net/bonding.h b/include/net/bonding.h index adc3da776970..2df1f7ea9036 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -449,6 +449,59 @@ static inline void bond_hw_addr_copy(u8 *dst, const u8 *src, unsigned int len) memcpy(dst, src, len); } +/* Helpers for reference counting the struct net_device behind the bond slaves. + * These can be used to propagate the net_device_ops from the bond to the + * slaves while not holding rcu_read_lock() or the rtnl_mutex. + */ +struct bonding_slave_dev { + struct net_device *ndev; + struct list_head list; +}; + +static inline void bond_put_slaves(struct list_head *slaves) +{ + struct bonding_slave_dev *s, *tmp; + + list_for_each_entry_safe(s, tmp, slaves, list) { + dev_put(s->ndev); + list_del(&s->list); + kfree(s); + } +} + +static inline int bond_get_slaves(struct bonding *bond, + struct list_head *slaves, + int *num_slaves) +{ + struct list_head *iter; + struct slave *slave; + + INIT_LIST_HEAD(slaves); + *num_slaves = 0; + + rcu_read_lock(); + + bond_for_each_slave_rcu(bond, slave, iter) { + struct bonding_slave_dev *s; + + s = kzalloc(sizeof(*s), GFP_ATOMIC); + if (!s) { + rcu_read_unlock(); + bond_put_slaves(slaves); + return -ENOMEM; + } + + s->ndev = slave->dev; + dev_hold(s->ndev); + list_add_tail(&s->list, slaves); + (*num_slaves)++; + } + + rcu_read_unlock(); + + return 0; +} + #define BOND_PRI_RESELECT_ALWAYS 0 #define BOND_PRI_RESELECT_BETTER 1 #define BOND_PRI_RESELECT_FAILURE 2