Message ID | 20210209204533.327360-9-alobakin@pm.me |
---|---|
State | New |
Headers | show |
Series | None | expand |
Hello, I'm sorry for the late feedback, I could not step-in before. Also adding Jesper for awareness, as he introduced the bulk free infrastructure. On Tue, 2021-02-09 at 20:48 +0000, Alexander Lobakin wrote: > @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) > */ > struct sk_buff *build_skb(void *data, unsigned int frag_size) > { > - struct sk_buff *skb = __build_skb(data, frag_size); > + struct sk_buff *skb = __build_skb(data, frag_size, true); I must admit I'm a bit scared of this. There are several high speed device drivers that will move to bulk allocation, and we don't have any performance figure for them. In my experience with (low end) MIPS board, cache misses cost tend to be much less visible there compared to reasonably recent server H/W, because the CPU/memory access time difference is much lower. When moving to higher end H/W the performance gain you measured could be completely countered by less optimal cache usage. I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a single skb would be visible e.g. in a round-robin test. Generally speaking bulk allocating 32 skbs looks a bit too much. IIRC, when Edward added listification to GRO, he did several measures with different list size and found 8 to be the optimal value (for the tested workload). Above such number the list become too big and the pressure on the cache outweighted the bulking benefits. Perhaps giving the device drivers the ability to opt-in on this infra via a new helper - as done back then with napi_consume_skb() - would make this change safer? > @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb) > kfree_skbmem(skb); > } > > -static inline void _kfree_skb_defer(struct sk_buff *skb) > +static void napi_skb_cache_put(struct sk_buff *skb) > { > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); > + u32 i; > > /* drop skb->head and call any destructors for packet */ > skb_release_all(skb); > > - /* record skb to CPU local list */ > + kasan_poison_object_data(skbuff_head_cache, skb); > nc->skb_cache[nc->skb_count++] = skb; > > -#ifdef CONFIG_SLUB > - /* SLUB writes into objects when freeing */ > - prefetchw(skb); > -#endif It looks like this chunk has been lost. Is that intentional? Thanks! Paolo
From: Paolo Abeni <pabeni@redhat.com> Date: Wed, 10 Feb 2021 11:21:06 +0100 > Hello, Hi! > I'm sorry for the late feedback, I could not step-in before. > > Also adding Jesper for awareness, as he introduced the bulk free > infrastructure. > > On Tue, 2021-02-09 at 20:48 +0000, Alexander Lobakin wrote: > > @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) > > */ > > struct sk_buff *build_skb(void *data, unsigned int frag_size) > > { > > - struct sk_buff *skb = __build_skb(data, frag_size); > > + struct sk_buff *skb = __build_skb(data, frag_size, true); > > I must admit I'm a bit scared of this. There are several high speed > device drivers that will move to bulk allocation, and we don't have any > performance figure for them. > > In my experience with (low end) MIPS board, cache misses cost tend to > be much less visible there compared to reasonably recent server H/W, > because the CPU/memory access time difference is much lower. > > When moving to higher end H/W the performance gain you measured could > be completely countered by less optimal cache usage. > > I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a > single skb would be visible e.g. in a round-robin test. Generally > speaking bulk allocating 32 skbs looks a bit too much. IIRC, when > Edward added listification to GRO, he did several measures with > different list size and found 8 to be the optimal value (for the tested > workload). Above such number the list become too big and the pressure > on the cache outweighted the bulking benefits. I can change to logics the way so it would allocate the first 8. I think I've already seen this batch value somewhere in XDP code, so this might be a balanced one. Regarding bulk-freeing: can the batch size make sense when freeing or it's okay to wipe 32 (currently 64 in baseline) in a row? > Perhaps giving the device drivers the ability to opt-in on this infra > via a new helper - as done back then with napi_consume_skb() - would > make this change safer? That's actually a very nice idea. There's only a little in the code to change to introduce an ability to take heads from the cache optionally. This way developers could switch to it when needed. Thanks for the suggestions! I'll definitely absorb them into the code and give it a test. > > @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb) > > kfree_skbmem(skb); > > } > > > > -static inline void _kfree_skb_defer(struct sk_buff *skb) > > +static void napi_skb_cache_put(struct sk_buff *skb) > > { > > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); > > + u32 i; > > > > /* drop skb->head and call any destructors for packet */ > > skb_release_all(skb); > > > > - /* record skb to CPU local list */ > > + kasan_poison_object_data(skbuff_head_cache, skb); > > nc->skb_cache[nc->skb_count++] = skb; > > > > -#ifdef CONFIG_SLUB > > - /* SLUB writes into objects when freeing */ > > - prefetchw(skb); > > -#endif > > It looks like this chunk has been lost. Is that intentional? Yep. This prefetchw() assumed that skbuff_heads will be wiped immediately or at the end of network softirq. Reusing this cache means that heads can be reused later or may be kept in a cache for some time, so prefetching makes no sense anymore. > Thanks! > > Paolo Al
On Wed, 2021-02-10 at 12:25 +0000, Alexander Lobakin wrote: > Paolo Abeni <pabeni@redhat.com> on Wed, 10 Feb 2021 11:21:06 +0100 wrote: > > Perhaps giving the device drivers the ability to opt-in on this infra > > via a new helper - as done back then with napi_consume_skb() - would > > make this change safer? > > That's actually a very nice idea. There's only a little in the code > to change to introduce an ability to take heads from the cache > optionally. This way developers could switch to it when needed. > > Thanks for the suggestions! I'll definitely absorb them into the code > and give it a test. Quick reply before is too late. I suggest to wait a bit for others opinions before coding - if others dislike this I would regret wasting your time. Cheers, Paolo
Alexander Lobakin <alobakin@pm.me> wrote: > we're in such context. This includes: build_skb() (called only > from NIC drivers in NAPI Rx context) and {,__}napi_alloc_skb() > (called from the same place or from kernel network softirq > functions). build_skb is called from sleepable context in drivers/net/tun.c . Perhaps there are other cases.
On Wed, 10 Feb 2021 12:25:04 +0000 Alexander Lobakin <alobakin@pm.me> wrote: > From: Paolo Abeni <pabeni@redhat.com> > Date: Wed, 10 Feb 2021 11:21:06 +0100 > > > I'm sorry for the late feedback, I could not step-in before. > > > > Also adding Jesper for awareness, as he introduced the bulk free > > infrastructure. Thanks (and Alexander Duyck also did part of the work while at Red Hat). In my initial versions of my patchsets I actually also had reuse of the SKBs that were defer freed during NAPI context. But I dropped that part because it was getting nitpicked and the merge window was getting close, so I ended up dropping that part. > > On Tue, 2021-02-09 at 20:48 +0000, Alexander Lobakin wrote: > > > @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) > > > */ > > > struct sk_buff *build_skb(void *data, unsigned int frag_size) > > > { > > > - struct sk_buff *skb = __build_skb(data, frag_size); > > > + struct sk_buff *skb = __build_skb(data, frag_size, true); > > > > I must admit I'm a bit scared of this. There are several high speed > > device drivers that will move to bulk allocation, and we don't have any > > performance figure for them. > > > > In my experience with (low end) MIPS board, cache misses cost tend to > > be much less visible there compared to reasonably recent server H/W, > > because the CPU/memory access time difference is much lower. > > > > When moving to higher end H/W the performance gain you measured could > > be completely countered by less optimal cache usage. > > > > I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a > > single skb would be visible e.g. in a round-robin test. Generally > > speaking bulk allocating 32 skbs looks a bit too much. IIRC, when > > Edward added listification to GRO, he did several measures with > > different list size and found 8 to be the optimal value (for the tested > > workload). Above such number the list become too big and the pressure > > on the cache outweighted the bulking benefits. > > I can change to logics the way so it would allocate the first 8. > I think I've already seen this batch value somewhere in XDP code, > so this might be a balanced one. (Speaking about SLUB code): Bulk ALLOC side disables interrupts, and can call slow path (___slab_alloc), which is bad for latency sensitive workloads. This I don't recommend large bulk ALLOCATIONS. > Regarding bulk-freeing: can the batch size make sense when freeing > or it's okay to wipe 32 (currently 64 in baseline) in a row? (Speaking about SLUB code): You can bulk FREE large amount of object without hurting latency sensitive workloads, because it doesn't disable interrupts (I'm quite proud that this was possible). > > Perhaps giving the device drivers the ability to opt-in on this infra > > via a new helper - as done back then with napi_consume_skb() - would > > make this change safer? > > That's actually a very nice idea. There's only a little in the code > to change to introduce an ability to take heads from the cache > optionally. This way developers could switch to it when needed. Well, I actually disagree that this should be hidden behind a switch for drivers to enable, as this will take forever to get proper enabled. > Thanks for the suggestions! I'll definitely absorb them into the code > and give it a test. > > > > @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb) > > > kfree_skbmem(skb); > > > } > > > > > > -static inline void _kfree_skb_defer(struct sk_buff *skb) > > > +static void napi_skb_cache_put(struct sk_buff *skb) > > > { > > > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); > > > + u32 i; > > > > > > /* drop skb->head and call any destructors for packet */ > > > skb_release_all(skb); > > > > > > - /* record skb to CPU local list */ > > > + kasan_poison_object_data(skbuff_head_cache, skb); > > > nc->skb_cache[nc->skb_count++] = skb; > > > > > > -#ifdef CONFIG_SLUB > > > - /* SLUB writes into objects when freeing */ > > > - prefetchw(skb); > > > -#endif > > > > It looks like this chunk has been lost. Is that intentional? > > Yep. This prefetchw() assumed that skbuff_heads will be wiped > immediately or at the end of network softirq. Reusing this cache > means that heads can be reused later or may be kept in a cache for > some time, so prefetching makes no sense anymore. I agree with this statement, the prefetchw() is no-longer needed. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0e0707296098..5bb443d37bf4 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1082,7 +1082,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags, int node); -struct sk_buff *__build_skb(void *data, unsigned int frag_size); +struct sk_buff *__build_skb(void *data, unsigned int frag_size, bool napi_safe); struct sk_buff *build_skb(void *data, unsigned int frag_size); struct sk_buff *build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 860a9d4f752f..8747566a8136 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -120,6 +120,7 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) } #define NAPI_SKB_CACHE_SIZE 64 +#define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2) struct napi_alloc_cache { struct page_frag_cache page; @@ -164,6 +165,30 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) } EXPORT_SYMBOL(__netdev_alloc_frag_align); +static struct sk_buff *napi_skb_cache_get(bool napi_safe) +{ + struct napi_alloc_cache *nc; + struct sk_buff *skb; + + if (!napi_safe && unlikely(!in_serving_softirq())) + return kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); + + nc = this_cpu_ptr(&napi_alloc_cache); + + if (unlikely(!nc->skb_count)) + nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache, + GFP_ATOMIC, + NAPI_SKB_CACHE_HALF, + nc->skb_cache); + if (unlikely(!nc->skb_count)) + return NULL; + + skb = nc->skb_cache[--nc->skb_count]; + kasan_unpoison_object_data(skbuff_head_cache, skb); + + return skb; +} + /* Caller must provide SKB that is memset cleared */ static void __build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size) @@ -210,11 +235,11 @@ static void __build_skb_around(struct sk_buff *skb, void *data, * before giving packet to stack. * RX rings only contains data buffers, not full skbs. */ -struct sk_buff *__build_skb(void *data, unsigned int frag_size) +struct sk_buff *__build_skb(void *data, unsigned int frag_size, bool napi_safe) { struct sk_buff *skb; - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); + skb = napi_skb_cache_get(napi_safe); if (unlikely(!skb)) return NULL; @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) */ struct sk_buff *build_skb(void *data, unsigned int frag_size) { - struct sk_buff *skb = __build_skb(data, frag_size); + struct sk_buff *skb = __build_skb(data, frag_size, true); if (skb && frag_size) { skb->head_frag = 1; @@ -443,7 +468,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len, if (unlikely(!data)) return NULL; - skb = __build_skb(data, len); + skb = __build_skb(data, len, false); if (unlikely(!skb)) { skb_free_frag(data); return NULL; @@ -507,7 +532,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (unlikely(!data)) return NULL; - skb = __build_skb(data, len); + skb = __build_skb(data, len, true); if (unlikely(!skb)) { skb_free_frag(data); return NULL; @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb) kfree_skbmem(skb); } -static inline void _kfree_skb_defer(struct sk_buff *skb) +static void napi_skb_cache_put(struct sk_buff *skb) { struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + u32 i; /* drop skb->head and call any destructors for packet */ skb_release_all(skb); - /* record skb to CPU local list */ + kasan_poison_object_data(skbuff_head_cache, skb); nc->skb_cache[nc->skb_count++] = skb; -#ifdef CONFIG_SLUB - /* SLUB writes into objects when freeing */ - prefetchw(skb); -#endif - - /* flush skb_cache if it is filled */ if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) { - kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_SIZE, - nc->skb_cache); - nc->skb_count = 0; + for (i = NAPI_SKB_CACHE_HALF; i < NAPI_SKB_CACHE_SIZE; i++) + kasan_unpoison_object_data(skbuff_head_cache, + nc->skb_cache[i]); + + kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_HALF, + nc->skb_cache + NAPI_SKB_CACHE_HALF); + nc->skb_count = NAPI_SKB_CACHE_HALF; } } + void __kfree_skb_defer(struct sk_buff *skb) { - _kfree_skb_defer(skb); + napi_skb_cache_put(skb); } void napi_consume_skb(struct sk_buff *skb, int budget) @@ -887,7 +912,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget) return; } - _kfree_skb_defer(skb); + napi_skb_cache_put(skb); } EXPORT_SYMBOL(napi_consume_skb); diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index dd488938447f..afba4e11a526 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -1190,7 +1190,7 @@ static struct sk_buff *netlink_alloc_large_skb(unsigned int size, if (data == NULL) return NULL; - skb = __build_skb(data, size); + skb = __build_skb(data, size, false); if (skb == NULL) vfree(data); else
Instead of just bulk-flushing skbuff_heads queued up through napi_consume_skb() or __kfree_skb_defer(), try to reuse them on allocation path. If the cache is empty on allocation, bulk-allocate the first half, which is more efficient than per-skb allocation. If the cache is full on freeing, bulk-wipe the second half. This also includes custom KASAN poisoning/unpoisoning to be double sure there are no use-after-free cases. Functions that got cache fastpath: - {,__}build_skb(); - {,__}netdev_alloc_skb(); - {,__}napi_alloc_skb(). Note on "napi_safe" argument: NAPI cache should be accessed only from BH-disabled or (better) NAPI context. To make sure access is safe, in_serving_softirq() check is used. Hovewer, there are plenty of cases when we know for sure that we're in such context. This includes: build_skb() (called only from NIC drivers in NAPI Rx context) and {,__}napi_alloc_skb() (called from the same place or from kernel network softirq functions). We can use that knowledge to avoid unnecessary checks. Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Unified cache part Suggested-by: Eric Dumazet <edumazet@google.com> # KASAN poisoning Suggested-by: Dmitry Vyukov <dvyukov@google.com> # Help with KASAN Signed-off-by: Alexander Lobakin <alobakin@pm.me> --- include/linux/skbuff.h | 2 +- net/core/skbuff.c | 61 ++++++++++++++++++++++++++++------------ net/netlink/af_netlink.c | 2 +- 3 files changed, 45 insertions(+), 20 deletions(-)