Message ID | 20240530201616.1316526-1-almasrymina@google.com |
---|---|
Headers | show |
Series | Device Memory TCP | expand |
On 2024-05-30 13:16, Mina Almasry wrote: [...] > +err_start_queue: > + /* Restarting the queue with old_mem should be successful as we haven't > + * changed any of the queue configuration, and there is not much we can > + * do to recover from a failure here. > + * > + * WARN if the we fail to recover the old rx queue, and at least free > + * old_mem so we don't also leak that. > + */ > + if (dev->queue_mgmt_ops->ndo_queue_start(dev, old_mem, rxq_idx)) { > + WARN(1, > + "Failed to restart old queue in error path. RX queue %d may be unhealthy.", > + rxq_idx); > + dev->queue_mgmt_ops->ndo_queue_mem_free(dev, &old_mem); This should be ->ndo_queue_mem_free(dev, old_mem).
On Thu, May 30, 2024 at 08:16:12PM +0000, Mina Almasry wrote: > Add documentation outlining the usage and details of devmem TCP. > > Signed-off-by: Mina Almasry <almasrymina@google.com> > The doc LGTM, thanks! Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
On 5/30/24 21:16, Mina Almasry wrote: > Add netdev_rx_queue_restart() function to netdev_rx_queue.h > > Signed-off-by: David Wei <dw@davidwei.uk> > Signed-off-by: Mina Almasry <almasrymina@google.com> > > --- ... > diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c > new file mode 100644 > index 0000000000000..b3899358e5a9c > --- /dev/null > +++ b/net/core/netdev_rx_queue.c > @@ -0,0 +1,74 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > + > +#include <linux/netdevice.h> > +#include <net/netdev_queues.h> > +#include <net/netdev_rx_queue.h> > + > +int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx) > +{ > + void *new_mem, *old_mem; > + int err; I believe it should also do: if (!dev->queue_mgmt_ops) return -EOPNOTSUPP; > + > + if (!dev->queue_mgmt_ops->ndo_queue_stop || > + !dev->queue_mgmt_ops->ndo_queue_mem_free || > + !dev->queue_mgmt_ops->ndo_queue_mem_alloc || > + !dev->queue_mgmt_ops->ndo_queue_start) > + return -EOPNOTSUPP; > + > + DEBUG_NET_WARN_ON_ONCE(!rtnl_is_locked());
On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote: > diff --git a/net/core/devmem.c b/net/core/devmem.c > index d82f92d7cf9ce..d5fac8edf621d 100644 > --- a/net/core/devmem.c > +++ b/net/core/devmem.c > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, > kfree(owner); > } > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) Minor nit: please no 'inline' keyword in c files. Thanks, Paolo
On Tue, 04 Jun 2024 12:13:15 +0200 Paolo Abeni <pabeni@redhat.com> wrote: > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote: > > diff --git a/net/core/devmem.c b/net/core/devmem.c > > index d82f92d7cf9ce..d5fac8edf621d 100644 > > --- a/net/core/devmem.c > > +++ b/net/core/devmem.c > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, > > kfree(owner); > > } > > > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) > > Minor nit: please no 'inline' keyword in c files. I'm curious. Is this a networking rule? I use 'inline' in my C code all the time. -- Steve
On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote: > On Tue, 04 Jun 2024 12:13:15 +0200 > Paolo Abeni <pabeni@redhat.com> wrote: > > > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote: > > > diff --git a/net/core/devmem.c b/net/core/devmem.c > > > index d82f92d7cf9ce..d5fac8edf621d 100644 > > > --- a/net/core/devmem.c > > > +++ b/net/core/devmem.c > > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, > > > kfree(owner); > > > } > > > > > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) > > > > Minor nit: please no 'inline' keyword in c files. > > I'm curious. Is this a networking rule? I use 'inline' in my C code all the > time. It mostly comes from Documentation/process/coding-style.rst: 15) The inline disease ---------------------- There appears to be a common misperception that gcc has a magic "make me faster" speedup option called ``inline``. While the use of inlines can be appropriate (for example as a means of replacing macros, see Chapter 12), it very often is not. Abundant use of the inline keyword leads to a much bigger kernel, which in turn slows the system as a whole down, due to a bigger icache footprint for the CPU and simply because there is less memory available for the pagecache. Just think about it; a pagecache miss causes a disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles that can go into these 5 milliseconds. A reasonable rule of thumb is to not put inline at functions that have more than 3 lines of code in them. An exception to this rule are the cases where a parameter is known to be a compiletime constant, and as a result of this constantness you *know* the compiler will be able to optimize most of your function away at compile time. For a good example of this later case, see the kmalloc() inline function. Often people argue that adding inline to functions that are static and used only once is always a win since there is no space tradeoff. While this is technically correct, gcc is capable of inlining these automatically without help, and the maintenance issue of removing the inline when a second user appears outweighs the potential value of the hint that tells gcc to do something it would have done anyway. Jason
On Tue, 4 Jun 2024 13:31:58 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote: > > On Tue, 04 Jun 2024 12:13:15 +0200 > > Paolo Abeni <pabeni@redhat.com> wrote: > > > > > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote: > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c > > > > index d82f92d7cf9ce..d5fac8edf621d 100644 > > > > --- a/net/core/devmem.c > > > > +++ b/net/core/devmem.c > > > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, > > > > kfree(owner); > > > > } > > > > > > > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) > > > > > > Minor nit: please no 'inline' keyword in c files. > > > > I'm curious. Is this a networking rule? I use 'inline' in my C code all the > > time. > > It mostly comes from Documentation/process/coding-style.rst: > > 15) The inline disease > ---------------------- > > There appears to be a common misperception that gcc has a magic "make me > faster" speedup option called ``inline``. While the use of inlines can be > appropriate (for example as a means of replacing macros, see Chapter 12), it > very often is not. Abundant use of the inline keyword leads to a much bigger > kernel, which in turn slows the system as a whole down, due to a bigger > icache footprint for the CPU and simply because there is less memory > available for the pagecache. Just think about it; a pagecache miss causes a > disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles > that can go into these 5 milliseconds. > > A reasonable rule of thumb is to not put inline at functions that have more > than 3 lines of code in them. An exception to this rule are the cases where > a parameter is known to be a compiletime constant, and as a result of this > constantness you *know* the compiler will be able to optimize most of your > function away at compile time. For a good example of this later case, see > the kmalloc() inline function. > > Often people argue that adding inline to functions that are static and used > only once is always a win since there is no space tradeoff. While this is > technically correct, gcc is capable of inlining these automatically without > help, and the maintenance issue of removing the inline when a second user > appears outweighs the potential value of the hint that tells gcc to do > something it would have done anyway. > Interesting, as I sped up the ftrace ring buffer by a substantial amount by adding strategic __always_inline, noinline, likely() and unlikely() throughout the code. It had to do with what was considered the fast path and slow path, and not actually the size of the function. gcc got it horribly wrong. -- Steve
> Interesting, as I sped up the ftrace ring buffer by a substantial amount by > adding strategic __always_inline, noinline, likely() and unlikely() > throughout the code. It had to do with what was considered the fast path > and slow path, and not actually the size of the function. gcc got it > horribly wrong. And what did the compiler people say when you reported gcc was getting it wrong? Our assumption is, the compiler is better than a human at deciding this. Or at least, a human who does not spend a long time profiling and tuning. If this assumption is not true, we probably should be trying to figure out why, and improving the compiler when possible. That will benefit everybody. Andrew
On Wed, 5 Jun 2024 01:44:37 +0200 Andrew Lunn <andrew@lunn.ch> wrote: > > Interesting, as I sped up the ftrace ring buffer by a substantial amount by > > adding strategic __always_inline, noinline, likely() and unlikely() > > throughout the code. It had to do with what was considered the fast path > > and slow path, and not actually the size of the function. gcc got it > > horribly wrong. > > And what did the compiler people say when you reported gcc was getting > it wrong? > > Our assumption is, the compiler is better than a human at deciding > this. Or at least, a human who does not spend a long time profiling > and tuning. If this assumption is not true, we probably should be > trying to figure out why, and improving the compiler when > possible. That will benefit everybody. > How is the compiler going to know which path is going to be taken the most? There's two main paths in the ring buffer logic. One when an event stays on the sub-buffer, the other when the event crosses over to a new sub buffer. As there's 100s of events that happen on the same sub-buffer for every one time there's a cross over, I optimized the paths that stayed on the sub-buffer, which caused the time for those events to go from 250ns down to 150 ns!. That's a 40% speed up. I added the unlikely/likely and 'always_inline' and 'noinline' paths to make sure the "staying on the buffer" path was always the hot path, and keeping it tight in cache. How is a compiler going to know that? -- Steve
> How is the compiler going to know which path is going to be taken the most? > There's two main paths in the ring buffer logic. One when an event stays on > the sub-buffer, the other when the event crosses over to a new sub buffer. > As there's 100s of events that happen on the same sub-buffer for every one > time there's a cross over, I optimized the paths that stayed on the > sub-buffer, which caused the time for those events to go from 250ns down to > 150 ns!. That's a 40% speed up. > > I added the unlikely/likely and 'always_inline' and 'noinline' paths to > make sure the "staying on the buffer" path was always the hot path, and > keeping it tight in cache. > > How is a compiler going to know that? It might have some heuristics to try to guess unlikely/likely, but that is not what we are talking about here. How much difference did 'always_inline' and 'noinline' make? Hopefully the likely is enough of a clue it should prefer to inline whatever is in that branch, where as for the unlikely case it can do a function call. But compilers is not my thing, which is why i would reach out to the compiler people and ask them, is it expected to get this wrong, could it be made better? Andrew
On Wed, 5 Jun 2024 02:52:29 +0200 Andrew Lunn <andrew@lunn.ch> wrote: > > How is a compiler going to know that? > > It might have some heuristics to try to guess unlikely/likely, but > that is not what we are talking about here. > > How much difference did 'always_inline' and 'noinline' make? Hopefully > the likely is enough of a clue it should prefer to inline whatever is > in that branch, where as for the unlikely case it can do a function > call. Perhaps, but one of the issues was that I have lots of small functions that are used all over the place, and gcc tends to change them to function calls, instead of duplicating them. I did this analysis back in 2016, so maybe it became better. > > But compilers is not my thing, which is why i would reach out to the > compiler people and ask them, is it expected to get this wrong, could > it be made better? Well, I actually do work with the compiler folks, and we are actually trying to get a session at GNU Cauldron where Linux kernel folks can talk with the gcc compiler folks. I've stared at so many objdump outputs, that I can now pretty much see the assembly that my C code makes ;-) -- Steve
On Tue, 2024-06-04 at 20:27 -0400, Steven Rostedt wrote: > On Wed, 5 Jun 2024 01:44:37 +0200 > Andrew Lunn <andrew@lunn.ch> wrote: > > > > Interesting, as I sped up the ftrace ring buffer by a substantial amount by > > > adding strategic __always_inline, noinline, likely() and unlikely() > > > throughout the code. It had to do with what was considered the fast path > > > and slow path, and not actually the size of the function. gcc got it > > > horribly wrong. > > > > And what did the compiler people say when you reported gcc was getting > > it wrong? > > > > Our assumption is, the compiler is better than a human at deciding > > this. Or at least, a human who does not spend a long time profiling > > and tuning. If this assumption is not true, we probably should be > > trying to figure out why, and improving the compiler when > > possible. That will benefit everybody. > > > > How is the compiler going to know which path is going to be taken the most? > There's two main paths in the ring buffer logic. One when an event stays on > the sub-buffer, the other when the event crosses over to a new sub buffer. > As there's 100s of events that happen on the same sub-buffer for every one > time there's a cross over, I optimized the paths that stayed on the > sub-buffer, which caused the time for those events to go from 250ns down to > 150 ns!. That's a 40% speed up. > > I added the unlikely/likely and 'always_inline' and 'noinline' paths to > make sure the "staying on the buffer" path was always the hot path, and > keeping it tight in cache. > > How is a compiler going to know that? > > -- Steve > Isn't this basically a perfect example of something where profile guided optimization should work? Thanks, Niklas