Message ID | 20161130123810.581ba21f@redhat.com |
---|---|
State | New |
Headers | show |
On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote: > I've played with a somewhat similar patch (from Achiad Shochat) for > mlx5 (attached). While it gives huge improvements, the problem I ran > into was that; TX performance became a function of the TX completion > time/interrupt and could easily be throttled if configured too > high/slow. > > Can your patch be affected by this too? Like all TX business, you should know this Jesper. No need to constantly remind us something very well known. > > Adjustable via: > ethtool -C mlx5p2 tx-usecs 16 tx-frames 32 > Generally speaking these settings impact TX, throughput/latencies. Since NIC IRQ then trigger softirqs, we already know that IRQ handling is critical, and some tuning can help, depending on particular load. Now the trick is when you want a generic kernel, being deployed on hosts shared by millions of jobs. > > On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is > > not used. > > > > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt > > lpaa23:~# sar -n DEV 1 10|grep eth0 > [...] > > Average: eth0 9.50 5707925.60 0.99 585285.69 0.00 0.00 0.50 > > lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt > > lpaa23:~# sar -n DEV 1 10|grep eth0 > [...] > > Average: eth0 2.40 9985214.90 0.31 1023874.60 0.00 0.00 0.50 > > These +75% number is pktgen without "burst", and definitely show that > your patch activate xmit_more. > What is the pps performance number when using pktgen "burst" option? About the same really. About all packets now get the xmit_more effect. > > > > And about 11 % improvement on an mono-flow UDP_STREAM test. > > > > skb_set_owner_w() is now the most consuming function. > > > > > > lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 & > > [1] 13696 > > lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt > > lpaa23:~# sar -n DEV 1 10|grep eth0 > [...] > > Average: eth0 9.00 1307380.50 1.00 308356.18 0.00 0.00 0.50 > > lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt > > lpaa23:~# sar -n DEV 1 10|grep eth0 > [...] > > Average: eth0 3.10 1459558.20 0.44 344267.57 0.00 0.00 0.50 > > The +11% number seems consistent with my perf observations that approx > 12% was "fakely" spend on the xmit spin_lock. > > > [...] > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > index 4b597dca5c52..affebb435679 100644 > > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > [...] > > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring) > > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring) > > { > > - return ring->prod - ring->cons > ring->full_size; > > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size; > > } > [...] > > > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev) > > } > > send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue); > > > > + /* Doorbell avoidance : We can omit doorbell if we know a TX completion > > + * will happen shortly. > > + */ > > + if (send_doorbell && > > + dev->doorbell_opt && > > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0) > > It would be nice with a function call with an appropriate name, instead > of an open-coded queue size check. I'm also confused by the "ncons" name. > > > + send_doorbell = false; > > + > [...] > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > index 574bcbb1b38f..c3fd0deda198 100644 > > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring { > > */ > > u32 last_nr_txbb; > > u32 cons; > > + u32 ncons; > > Maybe we can find a better name than "ncons" ? Thats because 'cons' in this driver really means 'old cons' and new cons = old cons + last_nr_txbb; > > > unsigned long wake_queue; > > struct netdev_queue *tx_queue; > > u32 (*free_tx_desc)(struct mlx4_en_priv *priv, > > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring { > > > > /* cache line used and dirtied in mlx4_en_xmit() */ > > u32 prod ____cacheline_aligned_in_smp; > > + u32 prod_bell; > > Good descriptive variable name. > > > unsigned int tx_dropped; > > unsigned long bytes; > > unsigned long packets; > >
On Wed, 30 Nov 2016 07:56:26 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2016-11-30 at 12:38 +0100, Jesper Dangaard Brouer wrote: > > I've played with a somewhat similar patch (from Achiad Shochat) for > > mlx5 (attached). While it gives huge improvements, the problem I ran > > into was that; TX performance became a function of the TX completion > > time/interrupt and could easily be throttled if configured too > > high/slow. > > > > Can your patch be affected by this too? > > Like all TX business, you should know this Jesper. > No need to constantly remind us something very well known. Don't take is as critique Eric. I was hoping your patch would have solved this issue of being sensitive to TX completion adjustments. You usually have good solutions for difficult issues. I basically rejected Achiad's approach/patch because it was too sensitive to these kind of adjustments. > > On Mon, 28 Nov 2016 22:58:36 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote: [...] > > > > These +75% number is pktgen without "burst", and definitely show that > > your patch activate xmit_more. > > What is the pps performance number when using pktgen "burst" option? > > About the same really. About all packets now get the xmit_more effect. Perfect! > > [...] > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > > index 4b597dca5c52..affebb435679 100644 > > > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > [...] > > > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring) > > > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring) > > > { > > > - return ring->prod - ring->cons > ring->full_size; > > > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size; > > > } > > [...] > > > > > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev) > > > } > > > send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue); > > > > > > + /* Doorbell avoidance : We can omit doorbell if we know a TX completion > > > + * will happen shortly. > > > + */ > > > + if (send_doorbell && > > > + dev->doorbell_opt && > > > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0) > > > > It would be nice with a function call with an appropriate name, instead > > of an open-coded queue size check. I'm also confused by the "ncons" name. > > > > > + send_doorbell = false; > > > + > > [...] > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > > index 574bcbb1b38f..c3fd0deda198 100644 > > > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h > > > @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring { > > > */ > > > u32 last_nr_txbb; > > > u32 cons; > > > + u32 ncons; > > > > Maybe we can find a better name than "ncons" ? > > Thats because 'cons' in this driver really means 'old cons' > > and new cons = old cons + last_nr_txbb; It was not clear to me that "n" meant "new". And also not clear that this drive have an issue of "cons" (consumer) is tracking "old" cons. > > > unsigned long wake_queue; > > > struct netdev_queue *tx_queue; > > > u32 (*free_tx_desc)(struct mlx4_en_priv *priv, > > > @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring { > > > > > > /* cache line used and dirtied in mlx4_en_xmit() */ > > > u32 prod ____cacheline_aligned_in_smp; > > > + u32 prod_bell; > > > > Good descriptive variable name. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote: > Don't take is as critique Eric. I was hoping your patch would have > solved this issue of being sensitive to TX completion adjustments. You > usually have good solutions for difficult issues. I basically rejected > Achiad's approach/patch because it was too sensitive to these kind of > adjustments. Well, this patch can hurt latencies, because a doorbell can be delayed, and softirqs can be delayed by many hundred of usec in some cases. I would not enable this behavior by default.
On Wed, 2016-11-30 at 23:30 +0100, Jesper Dangaard Brouer wrote: > On Wed, 30 Nov 2016 11:30:00 -0800 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Wed, 2016-11-30 at 20:17 +0100, Jesper Dangaard Brouer wrote: > > > > > Don't take is as critique Eric. I was hoping your patch would have > > > solved this issue of being sensitive to TX completion adjustments. You > > > usually have good solutions for difficult issues. I basically rejected > > > Achiad's approach/patch because it was too sensitive to these kind of > > > adjustments. > > > > Well, this patch can hurt latencies, because a doorbell can be delayed, > > and softirqs can be delayed by many hundred of usec in some cases. > > > > I would not enable this behavior by default. > > What about another scheme, where dev_hard_start_xmit() can return an > indication that driver choose not to flush (based on TX queue depth), > and there by requesting stack to call flush at a later point. > > Would that introduce less latency issues? Again, how is it going to change anything when your netperf UDP sends one packet at a time ? qdisc gets one packet , dequeue it immediately. No pressure. -> doorbell will be sent as before. Some TX producers do not even use a qdisc to begin with. Please think of a solution that does not involve qdisc layer at all.
Another issue I found during my tests last days, is a problem with BQL, and more generally when driver stops/starts the queue. When under stress and BQL stops the queue, driver TX completion does a lot of work, and servicing CPU also takes over further qdisc_run(). The work-flow is : 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and unmap things, queue skbs for freeing. 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state)) netif_schedule_queue(dev_queue); This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED (They absolutely have no chance to grab it) So we end up with one cpu doing the ndo_start_xmit() and TX completions, and RX work. This problem is magnified when XPS is used, if one mono-threaded application deals with thousands of TCP sockets. We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way to allow another cpu to service the qdisc and spare us.
On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Another issue I found during my tests last days, is a problem with BQL, > and more generally when driver stops/starts the queue. > > When under stress and BQL stops the queue, driver TX completion does a > lot of work, and servicing CPU also takes over further qdisc_run(). > Hmm, hard to say if this is problem or a feature I think ;-). One way to "solve" this would be to use IRQ round robin, that would spread the load of networking across CPUs, but that gives us no additional parallelism and reduces locality-- it's generally considered a bad idea. The question might be: is it better to continuously ding one CPU with all the networking work or try to artificially spread it out? Note this is orthogonal to MQ also, so we should already have multiple CPUs doing netif_schedule_queue for queues they manage. Do you have a test or application that shows this is causing pain? > The work-flow is : > > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and > unmap things, queue skbs for freeing. > > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); > > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state)) > netif_schedule_queue(dev_queue); > > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED > (They absolutely have no chance to grab it) > > So we end up with one cpu doing the ndo_start_xmit() and TX completions, > and RX work. > > This problem is magnified when XPS is used, if one mono-threaded application deals with > thousands of TCP sockets. > Do you know of an application doing this? The typical way XPS and most of the other steering mechanisms are configured assume that workloads tend towards a normal distribution. Such mechanisms can become problematic under asymmetric loads, but then I would expect these are likely dedicated servers so that the mechanisms can be tuned accordingly. For instance, XPS can be configured to map more than one queue to a CPU. Alternatively, IIRC Windows has some functionality to tune networking for the load (spin up queues, reconfigure things similar to XPS/RPS, etc.)-- that's promising up the point that we need a lot of heuristics and measurement; but then we lose determinism and risk edge case where we get completely unsatisfied results (one of my concerns with the recent attempt to put configuration in the kernel). > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way > to allow another cpu to service the qdisc and spare us. > Wouldn't this need to be an active operation? That is to queue the qdisc on another CPU's output_queue? Tom > >
On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote: > On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > Another issue I found during my tests last days, is a problem with BQL, > > and more generally when driver stops/starts the queue. > > > > When under stress and BQL stops the queue, driver TX completion does a > > lot of work, and servicing CPU also takes over further qdisc_run(). > > > Hmm, hard to say if this is problem or a feature I think ;-). One way > to "solve" this would be to use IRQ round robin, that would spread the > load of networking across CPUs, but that gives us no additional > parallelism and reduces locality-- it's generally considered a bad > idea. The question might be: is it better to continuously ding one CPU > with all the networking work or try to artificially spread it out? > Note this is orthogonal to MQ also, so we should already have multiple > CPUs doing netif_schedule_queue for queues they manage. > > Do you have a test or application that shows this is causing pain? Yes, just launch enough TCP senders (more than 10,000) to fully utilize the NIC, with small messages. super_netperf is not good for that, because you would need 10,000 processes and would spend too much cycles just dealing with an enormous working set, you would not activate BQL. > > > The work-flow is : > > > > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and > > unmap things, queue skbs for freeing. > > > > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); > > > > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state)) > > netif_schedule_queue(dev_queue); > > > > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED > > (They absolutely have no chance to grab it) > > > > So we end up with one cpu doing the ndo_start_xmit() and TX completions, > > and RX work. > > > > This problem is magnified when XPS is used, if one mono-threaded application deals with > > thousands of TCP sockets. > > > Do you know of an application doing this? The typical way XPS and most > of the other steering mechanisms are configured assume that workloads > tend towards a normal distribution. Such mechanisms can become > problematic under asymmetric loads, but then I would expect these are > likely dedicated servers so that the mechanisms can be tuned > accordingly. For instance, XPS can be configured to map more than one > queue to a CPU. Alternatively, IIRC Windows has some functionality to > tune networking for the load (spin up queues, reconfigure things > similar to XPS/RPS, etc.)-- that's promising up the point that we need > a lot of heuristics and measurement; but then we lose determinism and > risk edge case where we get completely unsatisfied results (one of my > concerns with the recent attempt to put configuration in the kernel). We have thousands of applications, and some of them 'kind of multicast' events to a broad number of TCP sockets. Very often, applications writers use a single thread for doing this, when all they need is to send small packets to 10,000 sockets, and they do not really care of doing this very fast. They also do not want to hurt other applications sharing the NIC. Very often, process scheduler will also run this single thread in a single cpu, ie avoiding expensive migrations if they are not needed. Problem is this behavior locks one TX queue for the duration of the multicast, since XPS will force all the TX packets to go to one TX queue. Other flows that would share the locked CPU experience high P99 latencies. > > > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way > > to allow another cpu to service the qdisc and spare us. > > > Wouldn't this need to be an active operation? That is to queue the > qdisc on another CPU's output_queue? I simply suggest we try to queue the qdisc for further servicing as we do today, from net_tx_action(), but we might use a different bit, so that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED before we grab it from net_tx_action(), maybe 100 usec later (time to flush all skbs queued in napi_consume_skb() and maybe RX processing, since most NIC handle TX completion before doing RX processing from thei napi poll handler. Should be doable with few changes in __netif_schedule() and net_tx_action(), plus some control paths that will need to take care of the new bit at dismantle time, right ?
On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote: > I simply suggest we try to queue the qdisc for further servicing as we > do today, from net_tx_action(), but we might use a different bit, so > that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED > before we grab it from net_tx_action(), maybe 100 usec later (time to > flush all skbs queued in napi_consume_skb() and maybe RX processing, > since most NIC handle TX completion before doing RX processing from thei > napi poll handler. > > Should be doable with few changes in __netif_schedule() and > net_tx_action(), plus some control paths that will need to take care of > the new bit at dismantle time, right ? Hmm... this is silly. Code already implements a different bit. qdisc_run() seems to run more often from net_tx_action(), I have to find out why.
On Wed, Nov 30, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote: >> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > >> > Another issue I found during my tests last days, is a problem with BQL, >> > and more generally when driver stops/starts the queue. >> > >> > When under stress and BQL stops the queue, driver TX completion does a >> > lot of work, and servicing CPU also takes over further qdisc_run(). >> > >> Hmm, hard to say if this is problem or a feature I think ;-). One way >> to "solve" this would be to use IRQ round robin, that would spread the >> load of networking across CPUs, but that gives us no additional >> parallelism and reduces locality-- it's generally considered a bad >> idea. The question might be: is it better to continuously ding one CPU >> with all the networking work or try to artificially spread it out? >> Note this is orthogonal to MQ also, so we should already have multiple >> CPUs doing netif_schedule_queue for queues they manage. >> >> Do you have a test or application that shows this is causing pain? > > Yes, just launch enough TCP senders (more than 10,000) to fully utilize > the NIC, with small messages. > > super_netperf is not good for that, because you would need 10,000 > processes and would spend too much cycles just dealing with an enormous > working set, you would not activate BQL. > > >> >> > The work-flow is : >> > >> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and >> > unmap things, queue skbs for freeing. >> > >> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); >> > >> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state)) >> > netif_schedule_queue(dev_queue); >> > >> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED >> > (They absolutely have no chance to grab it) >> > >> > So we end up with one cpu doing the ndo_start_xmit() and TX completions, >> > and RX work. >> > >> > This problem is magnified when XPS is used, if one mono-threaded application deals with >> > thousands of TCP sockets. >> > >> Do you know of an application doing this? The typical way XPS and most >> of the other steering mechanisms are configured assume that workloads >> tend towards a normal distribution. Such mechanisms can become >> problematic under asymmetric loads, but then I would expect these are >> likely dedicated servers so that the mechanisms can be tuned >> accordingly. For instance, XPS can be configured to map more than one >> queue to a CPU. Alternatively, IIRC Windows has some functionality to >> tune networking for the load (spin up queues, reconfigure things >> similar to XPS/RPS, etc.)-- that's promising up the point that we need >> a lot of heuristics and measurement; but then we lose determinism and >> risk edge case where we get completely unsatisfied results (one of my >> concerns with the recent attempt to put configuration in the kernel). > > We have thousands of applications, and some of them 'kind of multicast' > events to a broad number of TCP sockets. > > Very often, applications writers use a single thread for doing this, > when all they need is to send small packets to 10,000 sockets, and they > do not really care of doing this very fast. They also do not want to > hurt other applications sharing the NIC. > > Very often, process scheduler will also run this single thread in a > single cpu, ie avoiding expensive migrations if they are not needed. > > Problem is this behavior locks one TX queue for the duration of the > multicast, since XPS will force all the TX packets to go to one TX > queue. > The fact that XPS is forcing TX packets to go over one CPU is a result of how you've configured XPS. There are other potentially configurations that present different tradeoffs. For instance, TX queues are plentiful now days to the point that we could map a number of queues to each CPU while respecting NUMA locality between the sending CPU and where TX completions occur. If the set is sufficiently large this would also helps to address device lock contention. The other thing I'm wondering is if Willem's concepts distributing DOS attacks in RPS might be applicable in XPS. If we could detect that a TX queue is "under attack" maybe we can automatically backoff to distributing the load to more queues in XPS somehow. Tom > Other flows that would share the locked CPU experience high P99 > latencies. > > >> >> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way >> > to allow another cpu to service the qdisc and spare us. >> > >> Wouldn't this need to be an active operation? That is to queue the >> qdisc on another CPU's output_queue? > > I simply suggest we try to queue the qdisc for further servicing as we > do today, from net_tx_action(), but we might use a different bit, so > that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED > before we grab it from net_tx_action(), maybe 100 usec later (time to > flush all skbs queued in napi_consume_skb() and maybe RX processing, > since most NIC handle TX completion before doing RX processing from thei > napi poll handler. > > Should be doable with few changes in __netif_schedule() and > net_tx_action(), plus some control paths that will need to take care of > the new bit at dismantle time, right ? > > >
>>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions, >>> > and RX work. This problem is somewhat tangential to the doorbell avoidance discussion. >>> > >>> > This problem is magnified when XPS is used, if one mono-threaded application deals with >>> > thousands of TCP sockets. >>> > >> We have thousands of applications, and some of them 'kind of multicast' >> events to a broad number of TCP sockets. >> >> Very often, applications writers use a single thread for doing this, >> when all they need is to send small packets to 10,000 sockets, and they >> do not really care of doing this very fast. They also do not want to >> hurt other applications sharing the NIC. >> >> Very often, process scheduler will also run this single thread in a >> single cpu, ie avoiding expensive migrations if they are not needed. >> >> Problem is this behavior locks one TX queue for the duration of the >> multicast, since XPS will force all the TX packets to go to one TX >> queue. >> > The fact that XPS is forcing TX packets to go over one CPU is a result > of how you've configured XPS. There are other potentially > configurations that present different tradeoffs. Right, XPS supports multiple txqueues mappings, using skb_tx_hash to decide among them. Unfortunately cross-cpu is more expensive than tx + completion on the same core, so this is suboptimal for the common case where there is no excessive load imbalance. > For instance, TX > queues are plentiful now days to the point that we could map a number > of queues to each CPU while respecting NUMA locality between the > sending CPU and where TX completions occur. If the set is sufficiently > large this would also helps to address device lock contention. The > other thing I'm wondering is if Willem's concepts distributing DOS > attacks in RPS might be applicable in XPS. If we could detect that a > TX queue is "under attack" maybe we can automatically backoff to > distributing the load to more queues in XPS somehow. If only targeting states of imbalance, that indeed could work. For the 10,000 socket case, instead of load balancing qdisc servicing, we could perhaps modify tx queue selection in __netdev_pick_tx to choose another queue if the the initial choice is paused or otherwise backlogged.
On Wed, 2016-11-30 at 18:50 -0800, Eric Dumazet wrote: > On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote: > > > I simply suggest we try to queue the qdisc for further servicing as we > > do today, from net_tx_action(), but we might use a different bit, so > > that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED > > before we grab it from net_tx_action(), maybe 100 usec later (time to > > flush all skbs queued in napi_consume_skb() and maybe RX processing, > > since most NIC handle TX completion before doing RX processing from thei > > napi poll handler. > > > > Should be doable with few changes in __netif_schedule() and > > net_tx_action(), plus some control paths that will need to take care of > > the new bit at dismantle time, right ? > > Hmm... this is silly. Code already implements a different bit. > > qdisc_run() seems to run more often from net_tx_action(), I have to find > out why. After more analysis I believe TSQ was one of the bottlenecks. I prepared a patch series that helped my use cases.
Return-Path: tariqt@mellanox.com Received: from zmta04.collab.prod.int.phx2.redhat.com (LHLO zmta04.collab.prod.int.phx2.redhat.com) (10.5.81.11) by zmail22.collab.prod.int.phx2.redhat.com with LMTP; Wed, 17 Aug 2016 05:21:47 -0400 (EDT) Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by zmta04.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id B23B4DA128 for <jbrouer@mail.corp.redhat.com>; Wed, 17 Aug 2016 05:21:47 -0400 (EDT) Received: from mx1.redhat.com (ext-mx01.extmail.prod.ext.phx2.redhat.com [10.5.110.25]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u7H9LlWp015796 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for <brouer@redhat.com>; Wed, 17 Aug 2016 05:21:47 -0400 Received: from mellanox.co.il (mail-il-dmz.mellanox.com [193.47.165.129]) by mx1.redhat.com (Postfix) with ESMTP id B3ADB8122E for <brouer@redhat.com>; Wed, 17 Aug 2016 09:21:45 +0000 (UTC) Received: from Internal Mail-Server by MTLPINE1 (envelope-from tariqt@mellanox.com) with ESMTPS (AES256-SHA encrypted); 17 Aug 2016 12:15:03 +0300 Received: from dev-l-vrt-206.mtl.labs.mlnx (dev-l-vrt-206.mtl.labs.mlnx [10.134.206.1]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id u7H9F31D010642; Wed, 17 Aug 2016 12:15:03 +0300 From: Tariq Toukan <tariqt@mellanox.com> To: Jesper Dangaard Brouer <brouer@redhat.com>, Achiad Shochat <achiad@mellanox.com>, Rana Shahout <ranas@mellanox.com>, Saeed Mahameed <saeedm@mellanox.com> Subject: [PATCH] net/mlx5e: force tx skb bulking Date: Wed, 17 Aug 2016 12:14:51 +0300 Message-Id: <1471425291-1782-1-git-send-email-tariqt@mellanox.com> X-Greylist: Delayed for 00:06:41 by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 17 Aug 2016 09:21:46 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 17 Aug 2016 09:21:46 +0000 (UTC) for IP:'193.47.165.129' DOMAIN:'mail-il-dmz.mellanox.com' HELO:'mellanox.co.il' FROM:'tariqt@mellanox.com' RCPT:'' X-RedHat-Spam-Score: 0.251 (BAYES_50,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS,UNPARSEABLE_RELAY) 193.47.165.129 mail-il-dmz.mellanox.com 193.47.165.129 mail-il-dmz.mellanox.com <tariqt@mellanox.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 X-Scanned-By: MIMEDefang 2.78 on 10.5.110.25 From: Achiad Shochat <achiad@mellanox.com> To improve SW message rate in case HW is faster. Heuristically detect cases where the message rate is high and there is still no skb bulking and if so, stops the txq for a while trying to force the bulking. Change-Id: Icb925135e69b030943cb4666117c47d1cc04da97 --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 5 +++++ drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 9 ++++++++- 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 74edd01..78a0661 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -394,6 +394,10 @@ enum { MLX5E_SQ_STATE_TX_TIMEOUT, }; +enum { + MLX5E_SQ_STOP_ONCE, +}; + struct mlx5e_ico_wqe_info { u8 opcode; u8 num_wqebbs; @@ -403,6 +407,7 @@ struct mlx5e_sq { /* data path */ /* dirtied @completion */ + unsigned long flags; u16 cc; u32 dma_fifo_cc; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c index e073bf59..034eef0 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c @@ -351,8 +351,10 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb) if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; - if (unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_SQ_STOP_ROOM))) { + if (test_bit(MLX5E_SQ_STOP_ONCE, &sq->flags) || + unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_SQ_STOP_ROOM))) { netif_tx_stop_queue(sq->txq); + clear_bit(MLX5E_SQ_STOP_ONCE, &sq->flags); sq->stats.stopped++; } @@ -429,6 +431,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget) u32 dma_fifo_cc; u32 nbytes; u16 npkts; + u16 ncqes; u16 sqcc; int i; @@ -439,6 +442,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget) npkts = 0; nbytes = 0; + ncqes = 0; /* sq->cc must be updated only after mlx5_cqwq_update_db_record(), * otherwise a cq overrun may occur @@ -458,6 +462,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget) break; mlx5_cqwq_pop(&cq->wq); + ncqes++; wqe_counter = be16_to_cpu(cqe->wqe_counter); @@ -508,6 +513,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget) sq->dma_fifo_cc = dma_fifo_cc; sq->cc = sqcc; + if ((npkts > 7) && ((npkts >> (ilog2(ncqes))) < 8)) + set_bit(MLX5E_SQ_STOP_ONCE, &sq->flags); netdev_tx_completed_queue(sq->txq, npkts, nbytes); -- 1.8.3.1