Message ID | 20210210162732.80467-9-alobakin@pm.me |
---|---|
State | Superseded |
Headers | show |
Series | [v4,net-next,01/11] skbuff: move __alloc_skb() next to the other skb allocation functions | expand |
On Wed, 10 Feb 2021 16:30:23 +0000 Alexander Lobakin <alobakin@pm.me> wrote: > Instead of just bulk-flushing skbuff_heads queued up through > napi_consume_skb() or __kfree_skb_defer(), try to reuse them > on allocation path. Maybe you are already aware of this dynamics, but high speed NICs will usually run the TX "cleanup" (opportunistic DMA-completion) in the napi poll function call, and often before processing RX packets. Like ixgbe_poll[1] calls ixgbe_clean_tx_irq() before ixgbe_clean_rx_irq(). If traffic is symmetric (or is routed-back same interface) then this SKB recycle scheme will be highly efficient. (I had this part of my initial patchset and tested it on ixgbe). [1] https://elixir.bootlin.com/linux/v5.11-rc7/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L3149 > If the cache is empty on allocation, bulk-allocate the first > 16 elements, which is more efficient than per-skb allocation. > If the cache is full on freeing, bulk-wipe the second half of > the cache (32 elements). > This also includes custom KASAN poisoning/unpoisoning to be > double sure there are no use-after-free cases. > > To not change current behaviour, introduce a new function, > napi_build_skb(), to optionally use a new approach later > in drivers. > > Note on selected bulk size, 16: > - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE > and especially VETH_XDP_BATCH, which is also used to > bulk-allocate skbuff_heads and was tested on powerful > setups; > - this also showed the best performance in the actual > test series (from the array of {8, 16, 32}). > > Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves > Suggested-by: Eric Dumazet <edumazet@google.com> # KASAN poisoning > Cc: Dmitry Vyukov <dvyukov@google.com> # Help with KASAN > Cc: Paolo Abeni <pabeni@redhat.com> # Reduced batch size > Signed-off-by: Alexander Lobakin <alobakin@pm.me> > --- > include/linux/skbuff.h | 2 + > net/core/skbuff.c | 94 ++++++++++++++++++++++++++++++++++++------ > 2 files changed, 83 insertions(+), 13 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 0e0707296098..906122eac82a 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size); > struct sk_buff *build_skb_around(struct sk_buff *skb, > void *data, unsigned int frag_size); > > +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size); > + > /** > * alloc_skb - allocate a network buffer > * @size: size to allocate > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 860a9d4f752f..9e1a8ded4acc 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) > } > > #define NAPI_SKB_CACHE_SIZE 64 > +#define NAPI_SKB_CACHE_BULK 16 > +#define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2) > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Thu, 11 Feb 2021 13:54:59 +0100 > On Wed, 10 Feb 2021 16:30:23 +0000 > Alexander Lobakin <alobakin@pm.me> wrote: > > > Instead of just bulk-flushing skbuff_heads queued up through > > napi_consume_skb() or __kfree_skb_defer(), try to reuse them > > on allocation path. > > Maybe you are already aware of this dynamics, but high speed NICs will > usually run the TX "cleanup" (opportunistic DMA-completion) in the napi > poll function call, and often before processing RX packets. Like > ixgbe_poll[1] calls ixgbe_clean_tx_irq() before ixgbe_clean_rx_irq(). Sure. 1G MIPS is my home project (I'll likely migrate to ARM64 cluster in 2-3 months). I mostly work with 10-100G NICs at work. > If traffic is symmetric (or is routed-back same interface) then this > SKB recycle scheme will be highly efficient. (I had this part of my > initial patchset and tested it on ixgbe). > > [1] https://elixir.bootlin.com/linux/v5.11-rc7/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L3149 That's exactly why I introduced this feature. Firstly driver enriches the cache with the consumed skbs from Tx completion queue, and then it just decaches them back on Rx completion cycle. That's how things worked most of the time on my test setup. The reason why Paolo proposed this as an option, and why I agreed it's safer to do instead of unconditional switching, is that different platforms and setup may react differently on this. We don't have an ability to test the entire zoo, so we propose an option for driver and network core developers to test and use "on demand". As I wrote in reply to Paolo, there might be cases when even the core networking code may benefit from this. > > If the cache is empty on allocation, bulk-allocate the first > > 16 elements, which is more efficient than per-skb allocation. > > If the cache is full on freeing, bulk-wipe the second half of > > the cache (32 elements). > > This also includes custom KASAN poisoning/unpoisoning to be > > double sure there are no use-after-free cases. > > > > To not change current behaviour, introduce a new function, > > napi_build_skb(), to optionally use a new approach later > > in drivers. > > > > Note on selected bulk size, 16: > > - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE > > and especially VETH_XDP_BATCH, which is also used to > > bulk-allocate skbuff_heads and was tested on powerful > > setups; > > - this also showed the best performance in the actual > > test series (from the array of {8, 16, 32}). > > > > Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves > > Suggested-by: Eric Dumazet <edumazet@google.com> # KASAN poisoning > > Cc: Dmitry Vyukov <dvyukov@google.com> # Help with KASAN > > Cc: Paolo Abeni <pabeni@redhat.com> # Reduced batch size > > Signed-off-by: Alexander Lobakin <alobakin@pm.me> > > --- > > include/linux/skbuff.h | 2 + > > net/core/skbuff.c | 94 ++++++++++++++++++++++++++++++++++++------ > > 2 files changed, 83 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > index 0e0707296098..906122eac82a 100644 > > --- a/include/linux/skbuff.h > > +++ b/include/linux/skbuff.h > > @@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size); > > struct sk_buff *build_skb_around(struct sk_buff *skb, > > void *data, unsigned int frag_size); > > > > +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size); > > + > > /** > > * alloc_skb - allocate a network buffer > > * @size: size to allocate > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 860a9d4f752f..9e1a8ded4acc 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) > > } > > > > #define NAPI_SKB_CACHE_SIZE 64 > > +#define NAPI_SKB_CACHE_BULK 16 > > +#define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2) > > > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer Thanks, Al
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0e0707296098..906122eac82a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1087,6 +1087,8 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size); struct sk_buff *build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size); +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size); + /** * alloc_skb - allocate a network buffer * @size: size to allocate diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 860a9d4f752f..9e1a8ded4acc 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -120,6 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) } #define NAPI_SKB_CACHE_SIZE 64 +#define NAPI_SKB_CACHE_BULK 16 +#define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2) struct napi_alloc_cache { struct page_frag_cache page; @@ -164,6 +166,25 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) } EXPORT_SYMBOL(__netdev_alloc_frag_align); +static struct sk_buff *napi_skb_cache_get(void) +{ + struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + struct sk_buff *skb; + + if (unlikely(!nc->skb_count)) + nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache, + GFP_ATOMIC, + NAPI_SKB_CACHE_BULK, + nc->skb_cache); + if (unlikely(!nc->skb_count)) + return NULL; + + skb = nc->skb_cache[--nc->skb_count]; + kasan_unpoison_object_data(skbuff_head_cache, skb); + + return skb; +} + /* Caller must provide SKB that is memset cleared */ static void __build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size) @@ -265,6 +286,53 @@ struct sk_buff *build_skb_around(struct sk_buff *skb, } EXPORT_SYMBOL(build_skb_around); +/** + * __napi_build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Version of __build_skb() that uses NAPI percpu caches to obtain + * skbuff_head instead of inplace allocation. + * + * Returns a new &sk_buff on success, %NULL on allocation failure. + */ +static struct sk_buff *__napi_build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb; + + skb = napi_skb_cache_get(); + if (unlikely(!skb)) + return NULL; + + memset(skb, 0, offsetof(struct sk_buff, tail)); + __build_skb_around(skb, data, frag_size); + + return skb; +} + +/** + * napi_build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Version of __napi_build_skb() that takes care of skb->head_frag + * and skb->pfmemalloc when the data is a page or page fragment. + * + * Returns a new &sk_buff on success, %NULL on allocation failure. + */ +struct sk_buff *napi_build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb = __napi_build_skb(data, frag_size); + + if (likely(skb) && frag_size) { + skb->head_frag = 1; + skb_propagate_pfmemalloc(virt_to_head_page(data), skb); + } + + return skb; +} +EXPORT_SYMBOL(napi_build_skb); + /* * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells * the caller if emergency pfmemalloc reserves are being used. If it is and @@ -838,31 +906,31 @@ void __consume_stateless_skb(struct sk_buff *skb) kfree_skbmem(skb); } -static inline void _kfree_skb_defer(struct sk_buff *skb) +static void napi_skb_cache_put(struct sk_buff *skb) { struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); + u32 i; /* drop skb->head and call any destructors for packet */ skb_release_all(skb); - /* record skb to CPU local list */ + kasan_poison_object_data(skbuff_head_cache, skb); nc->skb_cache[nc->skb_count++] = skb; -#ifdef CONFIG_SLUB - /* SLUB writes into objects when freeing */ - prefetchw(skb); -#endif - - /* flush skb_cache if it is filled */ if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) { - kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_SIZE, - nc->skb_cache); - nc->skb_count = 0; + for (i = NAPI_SKB_CACHE_HALF; i < NAPI_SKB_CACHE_SIZE; i++) + kasan_unpoison_object_data(skbuff_head_cache, + nc->skb_cache[i]); + + kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_HALF, + nc->skb_cache + NAPI_SKB_CACHE_HALF); + nc->skb_count = NAPI_SKB_CACHE_HALF; } } + void __kfree_skb_defer(struct sk_buff *skb) { - _kfree_skb_defer(skb); + napi_skb_cache_put(skb); } void napi_consume_skb(struct sk_buff *skb, int budget) @@ -887,7 +955,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget) return; } - _kfree_skb_defer(skb); + napi_skb_cache_put(skb); } EXPORT_SYMBOL(napi_consume_skb);
Instead of just bulk-flushing skbuff_heads queued up through napi_consume_skb() or __kfree_skb_defer(), try to reuse them on allocation path. If the cache is empty on allocation, bulk-allocate the first 16 elements, which is more efficient than per-skb allocation. If the cache is full on freeing, bulk-wipe the second half of the cache (32 elements). This also includes custom KASAN poisoning/unpoisoning to be double sure there are no use-after-free cases. To not change current behaviour, introduce a new function, napi_build_skb(), to optionally use a new approach later in drivers. Note on selected bulk size, 16: - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE and especially VETH_XDP_BATCH, which is also used to bulk-allocate skbuff_heads and was tested on powerful setups; - this also showed the best performance in the actual test series (from the array of {8, 16, 32}). Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves Suggested-by: Eric Dumazet <edumazet@google.com> # KASAN poisoning Cc: Dmitry Vyukov <dvyukov@google.com> # Help with KASAN Cc: Paolo Abeni <pabeni@redhat.com> # Reduced batch size Signed-off-by: Alexander Lobakin <alobakin@pm.me> --- include/linux/skbuff.h | 2 + net/core/skbuff.c | 94 ++++++++++++++++++++++++++++++++++++------ 2 files changed, 83 insertions(+), 13 deletions(-)