Message ID | 20210609103326.278782-1-toke@redhat.com |
---|---|
Headers | show |
Series | Clean up and document RCU-based object protection for XDP_REDIRECT | expand |
On Wed, Jun 09, 2021 at 12:33:14PM +0200, Toke Høiland-Jørgensen wrote: > The ena driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP > program invocations. However, the actual lifetime of the objects referred > by the XDP program invocation is longer, all the way through to the call to > xdp_do_flush(), making the scope of the rcu_read_lock() too small. This > turns out to be harmless because it all happens in a single NAPI poll > cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() > misleading. > > Rather than extend the scope of the rcu_read_lock(), just get rid of it > entirely. With the addition of RCU annotations to the XDP_REDIRECT map > types that take bh execution into account, lockdep even understands this to > be safe, so there's really no reason to keep it around. It might be worth adding a comment, perhaps where the rcu_read_lock() used to be, stating what the protection is. Maybe something like this? /* * This code is invoked within a single NAPI poll cycle * and thus under local_bh_disable(), which provides the * needed RCU protection. */ Thanx, Paul > Cc: Guy Tzalik <gtzalik@amazon.com> > Cc: Saeed Bishara <saeedb@amazon.com> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> > --- > drivers/net/ethernet/amazon/ena/ena_netdev.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c > index 881f88754bf6..a4378b14af4c 100644 > --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c > +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c > @@ -385,7 +385,6 @@ static int ena_xdp_execute(struct ena_ring *rx_ring, struct xdp_buff *xdp) > u64 *xdp_stat; > int qid; > > - rcu_read_lock(); > xdp_prog = READ_ONCE(rx_ring->xdp_bpf_prog); > > if (!xdp_prog) > @@ -443,8 +442,6 @@ static int ena_xdp_execute(struct ena_ring *rx_ring, struct xdp_buff *xdp) > > ena_increase_stat(xdp_stat, 1, &rx_ring->syncp); > out: > - rcu_read_unlock(); > - > return verdict; > } > > -- > 2.31.1 >
On 09/06/2021 13:33, Toke Høiland-Jørgensen wrote: > The cpsw driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP > program invocations. However, the actual lifetime of the objects referred > by the XDP program invocation is longer, all the way through to the call to > xdp_do_flush(), making the scope of the rcu_read_lock() too small. This > turns out to be harmless because it all happens in a single NAPI poll > cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() > misleading. > > Rather than extend the scope of the rcu_read_lock(), just get rid of it > entirely. With the addition of RCU annotations to the XDP_REDIRECT map > types that take bh execution into account, lockdep even understands this to > be safe, so there's really no reason to keep it around. > > Cc: Grygorii Strashko <grygorii.strashko@ti.com> > Cc: linux-omap@vger.kernel.org > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> > --- > drivers/net/ethernet/ti/cpsw_priv.c | 10 ++-------- > 1 file changed, 2 insertions(+), 8 deletions(-) Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
On 6/9/21 3:33 AM, Toke Høiland-Jørgensen wrote: > During the discussion[0] of Hangbin's multicast patch series, Martin pointed out > that the lifetime of the RCU-protected map entries used by XDP_REDIRECT is by > no means obvious. I promised to look into cleaning this up, and Paul helpfully > provided some hints and a new unrcu_pointer() helper to aid in this. > > This is mostly a documentation exercise, clearing up the description of the > lifetime expectations and adding __rcu annotations so sparse and lockdep can > help verify it. > > Patches 1-2 are prepatory: Patch 1 adds Paul's unrcu_pointer() helper (which has > already been added to his tree) and patch 2 is a small fix for > dev_get_by_index_rcu() so lockdep understands _bh-disabled access to it. Patch 3 > is the main bit that adds the __rcu annotations and updates documentation > comments, and the rest are patches updating the drivers, with one patch per > distinct maintainer. > > Unfortunately I don't have any hardware to test any of the driver patches; > Jesper helpfully verified that it doesn't break anything on i40e, but the rest > of the driver patches are only compile-tested. > > [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/ > > Paul E. McKenney (1): > rcu: Create an unrcu_pointer() to remove __rcu from a pointer > > Toke Høiland-Jørgensen (16): > bpf: allow RCU-protected lookups to happen from bh context > dev: add rcu_read_lock_bh_held() as a valid check when getting a RCU > dev ref > xdp: add proper __rcu annotations to redirect map entries > ena: remove rcu_read_lock() around XDP program invocation > bnxt: remove rcu_read_lock() around XDP program invocation > thunderx: remove rcu_read_lock() around XDP program invocation > freescale: remove rcu_read_lock() around XDP program invocation > net: intel: remove rcu_read_lock() around XDP program invocation > marvell: remove rcu_read_lock() around XDP program invocation > mlx4: remove rcu_read_lock() around XDP program invocation > nfp: remove rcu_read_lock() around XDP program invocation > qede: remove rcu_read_lock() around XDP program invocation > sfc: remove rcu_read_lock() around XDP program invocation > netsec: remove rcu_read_lock() around XDP program invocation > stmmac: remove rcu_read_lock() around XDP program invocation > net: ti: remove rcu_read_lock() around XDP program invocation > > drivers/net/ethernet/amazon/ena/ena_netdev.c | 3 -- > drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 - > .../net/ethernet/cavium/thunder/nicvf_main.c | 2 - > .../net/ethernet/freescale/dpaa/dpaa_eth.c | 8 +-- > .../net/ethernet/freescale/dpaa2/dpaa2-eth.c | 3 -- > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 - > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 6 +-- > drivers/net/ethernet/intel/ice/ice_txrx.c | 6 +-- > drivers/net/ethernet/intel/ice/ice_xsk.c | 6 +-- > drivers/net/ethernet/intel/igb/igb_main.c | 2 - > drivers/net/ethernet/intel/igc/igc_main.c | 7 +-- > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 - > drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 6 +-- > .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 - > drivers/net/ethernet/marvell/mvneta.c | 2 - > .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 4 -- > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 8 +-- > .../ethernet/netronome/nfp/nfp_net_common.c | 2 - > drivers/net/ethernet/qlogic/qede/qede_fp.c | 6 --- > drivers/net/ethernet/sfc/rx.c | 9 +--- > drivers/net/ethernet/socionext/netsec.c | 3 -- > .../net/ethernet/stmicro/stmmac/stmmac_main.c | 10 +--- > drivers/net/ethernet/ti/cpsw_priv.c | 10 +--- > include/linux/rcupdate.h | 14 +++++ > include/net/xdp_sock.h | 2 +- > kernel/bpf/cpumap.c | 14 +++-- > kernel/bpf/devmap.c | 52 ++++++++----------- > kernel/bpf/hashtab.c | 21 +++++--- > kernel/bpf/helpers.c | 6 +-- > kernel/bpf/lpm_trie.c | 6 ++- > net/core/dev.c | 2 +- > net/core/filter.c | 28 ++++++++++ > net/xdp/xsk.c | 4 +- > net/xdp/xsk.h | 4 +- > net/xdp/xskmap.c | 29 ++++++----- > 35 files changed, 134 insertions(+), 159 deletions(-) Martin, could you help review this patch set? You had participated in early discussions related to this patch. Thanks!
On Wed, 9 Jun 2021 at 13:33, Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > The netsec driver has a rcu_read_lock()/rcu_read_unlock() pair around the > full RX loop, covering everything up to and including xdp_do_flush(). This > is actually the correct behaviour, but because it all happens in a single > NAPI poll cycle (and thus under local_bh_disable()), it is also technically > redundant. > > With the addition of RCU annotations to the XDP_REDIRECT map types that > take bh execution into account, lockdep even understands this to be safe, > so there's really no reason to keep the rcu_read_lock() around anymore, so > let's just remove it. > > Cc: Jassi Brar <jaswinder.singh@linaro.org> > Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> > --- > drivers/net/ethernet/socionext/netsec.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c > index dfc85cc68173..20d148c019d8 100644 > --- a/drivers/net/ethernet/socionext/netsec.c > +++ b/drivers/net/ethernet/socionext/netsec.c > @@ -958,7 +958,6 @@ static int netsec_process_rx(struct netsec_priv *priv, int budget) > > xdp_init_buff(&xdp, PAGE_SIZE, &dring->xdp_rxq); > > - rcu_read_lock(); > xdp_prog = READ_ONCE(priv->xdp_prog); > dma_dir = page_pool_get_dma_dir(dring->page_pool); > > @@ -1069,8 +1068,6 @@ static int netsec_process_rx(struct netsec_priv *priv, int budget) > } > netsec_finalize_xdp_rx(priv, xdp_act, xdp_xmit); > > - rcu_read_unlock(); > - > return done; > } > > -- > 2.31.1 > Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
On 6/9/2021 1:33 PM, Toke Høiland-Jørgensen wrote: > The mlx4 driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP > program invocations. However, the actual lifetime of the objects referred > by the XDP program invocation is longer, all the way through to the call to > xdp_do_flush(), making the scope of the rcu_read_lock() too small. This > turns out to be harmless because it all happens in a single NAPI poll > cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() > misleading. > > Rather than extend the scope of the rcu_read_lock(), just get rid of it > entirely. With the addition of RCU annotations to the XDP_REDIRECT map > types that take bh execution into account, lockdep even understands this to > be safe, so there's really no reason to keep it around. Also switch the RCU > dereferences in the driver loop itself to the _bh variants. > > Cc: Tariq Toukan <tariqt@nvidia.com> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> > --- Thanks for your patch. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Regards, Tariq
On Wed, Jun 9, 2021 at 7:24 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > XDP programs are called from a NAPI poll context, which means the RCU > reference liveness is ensured by local_bh_disable(). Add > rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so > lockdep understands that the dereferences are safe from inside *either* an > rcu_read_lock() section *or* a local_bh_disable() section. This is done in > preparation for removing the redundant rcu_read_lock()s from the drivers. > > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> > --- > kernel/bpf/hashtab.c | 21 ++++++++++++++------- > kernel/bpf/helpers.c | 6 +++--- > kernel/bpf/lpm_trie.c | 6 ++++-- > 3 files changed, 21 insertions(+), 12 deletions(-) > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index 6f6681b07364..72c58cc516a3 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -596,7 +596,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) > struct htab_elem *l; > u32 hash, key_size; > > - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && > + !rcu_read_lock_bh_held()); It's not clear to me whether rcu_read_lock_held() is still needed. All comments sound like rcu_read_lock_bh_held() is a superset of rcu that includes bh. But reading rcu source code it looks like RCU_BH is its own rcu flavor... which is confusing.
On Wed, Jun 09, 2021 at 12:33:11PM +0200, Toke Høiland-Jørgensen wrote: > XDP programs are called from a NAPI poll context, which means the RCU > reference liveness is ensured by local_bh_disable(). Add > rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so > lockdep understands that the dereferences are safe from inside *either* an > rcu_read_lock() section *or* a local_bh_disable() section. This is done in > preparation for removing the redundant rcu_read_lock()s from the drivers. > > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> > --- > kernel/bpf/hashtab.c | 21 ++++++++++++++------- > kernel/bpf/helpers.c | 6 +++--- > kernel/bpf/lpm_trie.c | 6 ++++-- > 3 files changed, 21 insertions(+), 12 deletions(-) > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index 6f6681b07364..72c58cc516a3 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -596,7 +596,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) > struct htab_elem *l; > u32 hash, key_size; > > - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && > + !rcu_read_lock_bh_held()); > > key_size = map->key_size; > > @@ -989,7 +990,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, > /* unknown flags */ > return -EINVAL; > > - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && > + !rcu_read_lock_bh_held()); > > key_size = map->key_size; > > @@ -1082,7 +1084,8 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value, > /* unknown flags */ > return -EINVAL; > > - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && > + !rcu_read_lock_bh_held()); > > key_size = map->key_size; > > @@ -1148,7 +1151,8 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key, > /* unknown flags */ > return -EINVAL; > > - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && > + !rcu_read_lock_bh_held()); > > key_size = map->key_size; > > @@ -1202,7 +1206,8 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, > /* unknown flags */ > return -EINVAL; > > - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && > + !rcu_read_lock_bh_held()); > > key_size = map->key_size; > > @@ -1276,7 +1281,8 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key) > u32 hash, key_size; > int ret; > > - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && > + !rcu_read_lock_bh_held()); > > key_size = map->key_size; > > @@ -1311,7 +1317,8 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key) > u32 hash, key_size; > int ret; > > - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && > + !rcu_read_lock_bh_held()); > > key_size = map->key_size; > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 544773970dbc..e880f6bb6f28 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -28,7 +28,7 @@ > */ > BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key) > { > - WARN_ON_ONCE(!rcu_read_lock_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held()); There is a discrepancy in rcu_read_lock_trace_held() here but I think the patch_map_ops_generic step in the verifier has skipped these helper calls. It is unrelated and can be addressed later until it is needed. Acked-by: Martin KaFai Lau <kafai@fb.com> > return (unsigned long) map->ops->map_lookup_elem(map, key); > } > > @@ -44,7 +44,7 @@ const struct bpf_func_proto bpf_map_lookup_elem_proto = { > BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key, > void *, value, u64, flags) > { > - WARN_ON_ONCE(!rcu_read_lock_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held()); > return map->ops->map_update_elem(map, key, value, flags); > } > > @@ -61,7 +61,7 @@ const struct bpf_func_proto bpf_map_update_elem_proto = { > > BPF_CALL_2(bpf_map_delete_elem, struct bpf_map *, map, void *, key) > { > - WARN_ON_ONCE(!rcu_read_lock_held()); > + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held()); > return map->ops->map_delete_elem(map, key); > }
On Wed, Jun 09, 2021 at 12:33:13PM +0200, Toke Høiland-Jørgensen wrote: [ ... ] > @@ -551,7 +551,8 @@ static void cpu_map_free(struct bpf_map *map) > for (i = 0; i < cmap->map.max_entries; i++) { > struct bpf_cpu_map_entry *rcpu; > > - rcpu = READ_ONCE(cmap->cpu_map[i]); > + rcpu = rcu_dereference_check(cmap->cpu_map[i], > + rcu_read_lock_bh_held()); Is rcu_read_lock_bh_held() true during map_free()? [ ... ] > @@ -149,7 +152,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value, > u64 map_flags) > { > struct xsk_map *m = container_of(map, struct xsk_map, map); > - struct xdp_sock *xs, *old_xs, **map_entry; > + struct xdp_sock __rcu **map_entry; > + struct xdp_sock *xs, *old_xs; > u32 i = *(u32 *)key, fd = *(u32 *)value; > struct xsk_map_node *node; > struct socket *sock; > @@ -179,7 +183,7 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value, > } > > spin_lock_bh(&m->lock); > - old_xs = READ_ONCE(*map_entry); > + old_xs = rcu_dereference_check(*map_entry, rcu_read_lock_bh_held()); Is it actually protected by the m->lock at this point? [ ... ] > void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, > - struct xdp_sock **map_entry) > + struct xdp_sock __rcu **map_entry) > { > spin_lock_bh(&map->lock); > - if (READ_ONCE(*map_entry) == xs) { > - WRITE_ONCE(*map_entry, NULL); > + if (rcu_dereference(*map_entry) == xs) { nit. rcu_access_pointer()? > + rcu_assign_pointer(*map_entry, NULL); > xsk_map_sock_delete(xs, map_entry); > } > spin_unlock_bh(&map->lock); > -- > 2.31.1 >
Hi Paul, On 6/10/21 8:38 PM, Alexei Starovoitov wrote: > On Wed, Jun 9, 2021 at 7:24 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> XDP programs are called from a NAPI poll context, which means the RCU >> reference liveness is ensured by local_bh_disable(). Add >> rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so >> lockdep understands that the dereferences are safe from inside *either* an >> rcu_read_lock() section *or* a local_bh_disable() section. This is done in >> preparation for removing the redundant rcu_read_lock()s from the drivers. >> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> >> --- >> kernel/bpf/hashtab.c | 21 ++++++++++++++------- >> kernel/bpf/helpers.c | 6 +++--- >> kernel/bpf/lpm_trie.c | 6 ++++-- >> 3 files changed, 21 insertions(+), 12 deletions(-) >> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >> index 6f6681b07364..72c58cc516a3 100644 >> --- a/kernel/bpf/hashtab.c >> +++ b/kernel/bpf/hashtab.c >> @@ -596,7 +596,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) >> struct htab_elem *l; >> u32 hash, key_size; >> >> - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); >> + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && >> + !rcu_read_lock_bh_held()); > > It's not clear to me whether rcu_read_lock_held() is still needed. > All comments sound like rcu_read_lock_bh_held() is a superset of rcu > that includes bh. > But reading rcu source code it looks like RCU_BH is its own rcu flavor... > which is confusing. The series is a bit confusing to me as well. I recall we had a discussion with Paul, but it was back in 2016 aka very early days of XDP to get some clarifications about RCU vs RCU-bh flavour on this. Paul, given the series in here, I assume the below is not true anymore, and in this case (since we're removing rcu_read_lock() from drivers), the RCU-bh acts as a real superset? Back then from your clarifications this was not the case: On Mon, Jul 25, 2016 at 11:26:02AM -0700, Alexei Starovoitov wrote: > On Mon, Jul 25, 2016 at 11:03 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: [...] >>> The crux of the question is whether a particular driver rx handler, when >>> called from __do_softirq, needs to add an additional rcu_read_lock or >>> whether it can rely on the mechanics of softirq. >> >> If it was rcu_read_lock_bh(), you could. >> >> But you didn't say rcu_read_lock_bh(), you instead said rcu_read_lock(), >> which means that you absolutely cannot rely on softirq semantics. >> >> In particular, in CONFIG_PREEMPT=y kernels, rcu_preempt_check_callbacks() >> will notice that there is no rcu_read_lock() in effect and report >> a quiescent state for that CPU. Because rcu_preempt_check_callbacks() >> is invoked from the scheduling-clock interrupt, it absolutely can >> execute during do_softirq(), and therefore being in softirq context >> in no way provides rcu_read_lock()-style protection. >> >> Now, Alexei's question was for CONFIG_PREEMPT=n kernels. However, in >> that case, rcu_read_lock() and rcu_read_unlock() generate no code >> in recent production kernels, so there is no performance penalty for >> using them. (In older kernels, they implied a barrier().) >> >> So either way, with or without CONFIG_PREEMPT, you should use >> rcu_read_lock() to get RCU protection. >> >> One alternative might be to switch to rcu_read_lock_bh(), but that >> will add local_disable_bh() overhead to your read paths. >> >> Does that help, or am I missing the point of the question? > > thanks a lot for explanation. Glad you liked it! > I mistakenly assumed that _bh variants are 'stronger' and > act as inclusive, but sounds like they're completely orthogonal > especially with preempt_rcu=y. Yes, they are pretty much orthogonal. > With preempt_rcu=n and preempt=y, it would be the case, since > bh disables preemption and rcu_read_lock does the same as well, > right? Of course, the code shouldn't be relying on that, so we > have to fix our stuff. Indeed, especially given that the kernel currently won't allow you to configure CONFIG_PREEMPT_RCU=n and CONFIG_PREEMPT=y. If it does, please let me know, as that would be a bug that needs to be fixed. (For one thing, I do not test that combination.) Thanx, Paul And now, fast-forward again to 2021 ... :) Thanks, Daniel
Daniel Borkmann <daniel@iogearbox.net> writes: > Hi Paul, > > On 6/10/21 8:38 PM, Alexei Starovoitov wrote: >> On Wed, Jun 9, 2021 at 7:24 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >>> >>> XDP programs are called from a NAPI poll context, which means the RCU >>> reference liveness is ensured by local_bh_disable(). Add >>> rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so >>> lockdep understands that the dereferences are safe from inside *either* an >>> rcu_read_lock() section *or* a local_bh_disable() section. This is done in >>> preparation for removing the redundant rcu_read_lock()s from the drivers. >>> >>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> >>> --- >>> kernel/bpf/hashtab.c | 21 ++++++++++++++------- >>> kernel/bpf/helpers.c | 6 +++--- >>> kernel/bpf/lpm_trie.c | 6 ++++-- >>> 3 files changed, 21 insertions(+), 12 deletions(-) >>> >>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>> index 6f6681b07364..72c58cc516a3 100644 >>> --- a/kernel/bpf/hashtab.c >>> +++ b/kernel/bpf/hashtab.c >>> @@ -596,7 +596,8 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) >>> struct htab_elem *l; >>> u32 hash, key_size; >>> >>> - WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); >>> + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() && >>> + !rcu_read_lock_bh_held()); >> >> It's not clear to me whether rcu_read_lock_held() is still needed. >> All comments sound like rcu_read_lock_bh_held() is a superset of rcu >> that includes bh. >> But reading rcu source code it looks like RCU_BH is its own rcu flavor... >> which is confusing. > > The series is a bit confusing to me as well. I recall we had a discussion with > Paul, but it was back in 2016 aka very early days of XDP to get some clarifications > about RCU vs RCU-bh flavour on this. Paul, given the series in here, I assume the > below is not true anymore, and in this case (since we're removing rcu_read_lock() > from drivers), the RCU-bh acts as a real superset? > > Back then from your clarifications this was not the case: > > On Mon, Jul 25, 2016 at 11:26:02AM -0700, Alexei Starovoitov wrote: > > On Mon, Jul 25, 2016 at 11:03 AM, Paul E. McKenney > > <paulmck@linux.vnet.ibm.com> wrote: > [...] > >>> The crux of the question is whether a particular driver rx handler, when > >>> called from __do_softirq, needs to add an additional rcu_read_lock or > >>> whether it can rely on the mechanics of softirq. > >> > >> If it was rcu_read_lock_bh(), you could. > >> > >> But you didn't say rcu_read_lock_bh(), you instead said rcu_read_lock(), > >> which means that you absolutely cannot rely on softirq semantics. > >> > >> In particular, in CONFIG_PREEMPT=y kernels, rcu_preempt_check_callbacks() > >> will notice that there is no rcu_read_lock() in effect and report > >> a quiescent state for that CPU. Because rcu_preempt_check_callbacks() > >> is invoked from the scheduling-clock interrupt, it absolutely can > >> execute during do_softirq(), and therefore being in softirq context > >> in no way provides rcu_read_lock()-style protection. > >> > >> Now, Alexei's question was for CONFIG_PREEMPT=n kernels. However, in > >> that case, rcu_read_lock() and rcu_read_unlock() generate no code > >> in recent production kernels, so there is no performance penalty for > >> using them. (In older kernels, they implied a barrier().) > >> > >> So either way, with or without CONFIG_PREEMPT, you should use > >> rcu_read_lock() to get RCU protection. > >> > >> One alternative might be to switch to rcu_read_lock_bh(), but that > >> will add local_disable_bh() overhead to your read paths. > >> > >> Does that help, or am I missing the point of the question? > > > > thanks a lot for explanation. > > Glad you liked it! > > > I mistakenly assumed that _bh variants are 'stronger' and > > act as inclusive, but sounds like they're completely orthogonal > > especially with preempt_rcu=y. > > Yes, they are pretty much orthogonal. > > > With preempt_rcu=n and preempt=y, it would be the case, since > > bh disables preemption and rcu_read_lock does the same as well, > > right? Of course, the code shouldn't be relying on that, so we > > have to fix our stuff. > > Indeed, especially given that the kernel currently won't allow you > to configure CONFIG_PREEMPT_RCU=n and CONFIG_PREEMPT=y. If it does, > please let me know, as that would be a bug that needs to be fixed. > (For one thing, I do not test that combination.) > > Thanx, Paul > > And now, fast-forward again to 2021 ... :) We covered this in the thread I linked from the cover letter. Specifically, this seems to have been a change from v4.20, see Paul's reply here: https://lore.kernel.org/bpf/20210417002301.GO4212@paulmck-ThinkPad-P17-Gen-1/ and the follow-up covering -rt here: https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/ -Toke
Martin KaFai Lau <kafai@fb.com> writes: > On Wed, Jun 09, 2021 at 12:33:13PM +0200, Toke Høiland-Jørgensen wrote: > [ ... ] > >> @@ -551,7 +551,8 @@ static void cpu_map_free(struct bpf_map *map) >> for (i = 0; i < cmap->map.max_entries; i++) { >> struct bpf_cpu_map_entry *rcpu; >> >> - rcpu = READ_ONCE(cmap->cpu_map[i]); >> + rcpu = rcu_dereference_check(cmap->cpu_map[i], >> + rcu_read_lock_bh_held()); > Is rcu_read_lock_bh_held() true during map_free()? Hmm, no, I guess not since that's called from a workqueue. Will fix! >> @@ -149,7 +152,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value, >> u64 map_flags) >> { >> struct xsk_map *m = container_of(map, struct xsk_map, map); >> - struct xdp_sock *xs, *old_xs, **map_entry; >> + struct xdp_sock __rcu **map_entry; >> + struct xdp_sock *xs, *old_xs; >> u32 i = *(u32 *)key, fd = *(u32 *)value; >> struct xsk_map_node *node; >> struct socket *sock; >> @@ -179,7 +183,7 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value, >> } >> >> spin_lock_bh(&m->lock); >> - old_xs = READ_ONCE(*map_entry); >> + old_xs = rcu_dereference_check(*map_entry, rcu_read_lock_bh_held()); > Is it actually protected by the m->lock at this point? True, can just add that to the check. >> void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, >> - struct xdp_sock **map_entry) >> + struct xdp_sock __rcu **map_entry) >> { >> spin_lock_bh(&map->lock); >> - if (READ_ONCE(*map_entry) == xs) { >> - WRITE_ONCE(*map_entry, NULL); >> + if (rcu_dereference(*map_entry) == xs) { > nit. rcu_access_pointer()? Yup.
On Fri, Jun 11, 2021 at 01:19:04AM +0200, Toke Høiland-Jørgensen wrote: > >> @@ -149,7 +152,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value, > >> u64 map_flags) > >> { > >> struct xsk_map *m = container_of(map, struct xsk_map, map); > >> - struct xdp_sock *xs, *old_xs, **map_entry; > >> + struct xdp_sock __rcu **map_entry; > >> + struct xdp_sock *xs, *old_xs; > >> u32 i = *(u32 *)key, fd = *(u32 *)value; > >> struct xsk_map_node *node; > >> struct socket *sock; > >> @@ -179,7 +183,7 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value, > >> } > >> > >> spin_lock_bh(&m->lock); > >> - old_xs = READ_ONCE(*map_entry); > >> + old_xs = rcu_dereference_check(*map_entry, rcu_read_lock_bh_held()); > > Is it actually protected by the m->lock at this point? > > True, can just add that to the check. this should be enough rcu_dereference_protected(*map_entry, lockdep_is_held(&m->lock));
Martin KaFai Lau <kafai@fb.com> writes: > On Fri, Jun 11, 2021 at 01:19:04AM +0200, Toke Høiland-Jørgensen wrote: >> >> @@ -149,7 +152,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value, >> >> u64 map_flags) >> >> { >> >> struct xsk_map *m = container_of(map, struct xsk_map, map); >> >> - struct xdp_sock *xs, *old_xs, **map_entry; >> >> + struct xdp_sock __rcu **map_entry; >> >> + struct xdp_sock *xs, *old_xs; >> >> u32 i = *(u32 *)key, fd = *(u32 *)value; >> >> struct xsk_map_node *node; >> >> struct socket *sock; >> >> @@ -179,7 +183,7 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value, >> >> } >> >> >> >> spin_lock_bh(&m->lock); >> >> - old_xs = READ_ONCE(*map_entry); >> >> + old_xs = rcu_dereference_check(*map_entry, rcu_read_lock_bh_held()); >> > Is it actually protected by the m->lock at this point? >> >> True, can just add that to the check. > this should be enough > rcu_dereference_protected(*map_entry, lockdep_is_held(&m->lock)); Right, that's what I had in mind as well :) -Toke