Message ID | 20200928183529.471328-1-nitesh@redhat.com |
---|---|
Headers | show |
Series | isolation: limit msix vectors to housekeeping CPUs | expand |
On Mon, Sep 28, 2020 at 02:35:25PM -0400, Nitesh Narayan Lal wrote: > Nitesh Narayan Lal (4): > sched/isolation: API to get number of housekeeping CPUs > sched/isolation: Extend nohz_full to isolate managed IRQs > i40e: Limit msix vectors to housekeeping CPUs > PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs > > drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++- > drivers/pci/msi.c | 18 ++++++++++++++++++ > include/linux/sched/isolation.h | 9 +++++++++ > kernel/sched/isolation.c | 2 +- > 4 files changed, 30 insertions(+), 2 deletions(-) Acked-by: Frederic Weisbecker <frederic@kernel.org> Peter, if you're ok with the set, I guess this should go through the scheduler tree? Thanks.
On 10/1/20 11:49 AM, Frederic Weisbecker wrote: > On Mon, Sep 28, 2020 at 02:35:25PM -0400, Nitesh Narayan Lal wrote: >> Nitesh Narayan Lal (4): >> sched/isolation: API to get number of housekeeping CPUs >> sched/isolation: Extend nohz_full to isolate managed IRQs >> i40e: Limit msix vectors to housekeeping CPUs >> PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs >> >> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++- >> drivers/pci/msi.c | 18 ++++++++++++++++++ >> include/linux/sched/isolation.h | 9 +++++++++ >> kernel/sched/isolation.c | 2 +- >> 4 files changed, 30 insertions(+), 2 deletions(-) > Acked-by: Frederic Weisbecker <frederic@kernel.org> > > Peter, if you're ok with the set, I guess this should go through > the scheduler tree? > > Thanks. Hi Peter, I wanted follow up to check if you have any concerns/suggestions that I should address in this patch-set before this can be picked?
On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote: > >> + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > >> + > >> + /* > >> + * If we have isolated CPUs for use by real-time tasks, to keep the > >> + * latency overhead to a minimum, device-specific IRQ vectors are moved > >> + * to the housekeeping CPUs from the userspace by changing their > >> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > >> + * running out of IRQ vectors. > >> + */ > >> + if (hk_cpus < num_online_cpus()) { > >> + if (hk_cpus < min_vecs) > >> + max_vecs = min_vecs; > >> + else if (hk_cpus < max_vecs) > >> + max_vecs = hk_cpus; > > is that: > > > > max_vecs = clamp(hk_cpus, min_vecs, max_vecs); > > Yes, I think this will do. > > > > > Also, do we really need to have that conditional on hk_cpus < > > num_online_cpus()? That is, why can't we do this unconditionally? > > FWIU most of the drivers using this API already restricts the number of > vectors based on the num_online_cpus, if we do it unconditionally we can > unnecessary duplicate the restriction for cases where we don't have any > isolated CPUs. unnecessary isn't really a concern here, this is a slow path. What's important is code clarity. > Also, different driver seems to take different factors into consideration > along with num_online_cpus while finding the max_vecs to request, for > example in the case of mlx5: > MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() + > MLX5_EQ_VEC_COMP_BASE > > Having hk_cpus < num_online_cpus() helps us ensure that we are only > changing the behavior when we have isolated CPUs. > > Does that make sense? That seems to want to allocate N interrupts per cpu (plus some random static amount, which seems weird, but whatever). This patch breaks that. So I think it is important to figure out what that driver really wants in the nohz_full case. If it wants to retain N interrupts per CPU, and only reduce the number of CPUs, the proposed interface is wrong. > > And what are the (desired) semantics vs hotplug? Using a cpumask without > > excluding hotplug is racy. > > The housekeeping_mask should still remain constant, isn't? > In any case, I can double check this. The goal is very much to have that dynamically configurable.
On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote: > On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote: > > >> + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > > >> + > > >> + /* > > >> + * If we have isolated CPUs for use by real-time tasks, to keep the > > >> + * latency overhead to a minimum, device-specific IRQ vectors are moved > > >> + * to the housekeeping CPUs from the userspace by changing their > > >> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > > >> + * running out of IRQ vectors. > > >> + */ > > >> + if (hk_cpus < num_online_cpus()) { > > >> + if (hk_cpus < min_vecs) > > >> + max_vecs = min_vecs; > > >> + else if (hk_cpus < max_vecs) > > >> + max_vecs = hk_cpus; > > > is that: > > > > > > max_vecs = clamp(hk_cpus, min_vecs, max_vecs); > > > > Yes, I think this will do. > > > > > > > > Also, do we really need to have that conditional on hk_cpus < > > > num_online_cpus()? That is, why can't we do this unconditionally? > > > > FWIU most of the drivers using this API already restricts the number of > > vectors based on the num_online_cpus, if we do it unconditionally we can > > unnecessary duplicate the restriction for cases where we don't have any > > isolated CPUs. > > unnecessary isn't really a concern here, this is a slow path. What's > important is code clarity. > > > Also, different driver seems to take different factors into consideration > > along with num_online_cpus while finding the max_vecs to request, for > > example in the case of mlx5: > > MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() + > > MLX5_EQ_VEC_COMP_BASE > > > > Having hk_cpus < num_online_cpus() helps us ensure that we are only > > changing the behavior when we have isolated CPUs. > > > > Does that make sense? > > That seems to want to allocate N interrupts per cpu (plus some random > static amount, which seems weird, but whatever). This patch breaks that. On purpose. For the isolated CPUs we don't want network device interrupts (in this context). > So I think it is important to figure out what that driver really wants > in the nohz_full case. If it wants to retain N interrupts per CPU, and > only reduce the number of CPUs, the proposed interface is wrong. It wants N interrupts per non-isolated (AKA housekeeping) CPU. Zero interrupts for isolated interrupts. > > > And what are the (desired) semantics vs hotplug? Using a cpumask without > > > excluding hotplug is racy. > > > > The housekeeping_mask should still remain constant, isn't? > > In any case, I can double check this. > > The goal is very much to have that dynamically configurable. Yes, but this patch is a fix for customer bug in the old, static on-boot isolation CPU configuration. --- Discussing the dynamic configuration (not this patch!) case: Would need to enable/disable interrupts for a particular device on a per-CPU basis. Such interface does not exist yet. Perhaps that is what you are looking for when writing "proposed interface is wrong" Peter?
On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote: > > > And what are the (desired) semantics vs hotplug? Using a cpumask without > > > excluding hotplug is racy. > > > > The housekeeping_mask should still remain constant, isn't? > > In any case, I can double check this. > > The goal is very much to have that dynamically configurable. Right but unfortunately we are not there before a little while. And the existing code in these drivers allocating vectors doesn't even take into account hotplug as you spotted. So I agreed to let Nitesh fix this issue on top of the existing code until he can look into providing some infrastructure for these kind of vectors allocation. The first step would be to consolidate similar code from other drivers, then maybe handle hotplug and later dynamic housekeeping.
On 10/19/20 10:00 AM, Marcelo Tosatti wrote: > On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote: >> On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote: [...] >>>> Also, do we really need to have that conditional on hk_cpus < >>>> num_online_cpus()? That is, why can't we do this unconditionally? >>> FWIU most of the drivers using this API already restricts the number of >>> vectors based on the num_online_cpus, if we do it unconditionally we can >>> unnecessary duplicate the restriction for cases where we don't have any >>> isolated CPUs. >> unnecessary isn't really a concern here, this is a slow path. What's >> important is code clarity. Right, I can skip that check then. >> >>> Also, different driver seems to take different factors into consideration >>> along with num_online_cpus while finding the max_vecs to request, for >>> example in the case of mlx5: >>> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() + >>> MLX5_EQ_VEC_COMP_BASE >>> >>> Having hk_cpus < num_online_cpus() helps us ensure that we are only >>> changing the behavior when we have isolated CPUs. >>> >>> Does that make sense? >> That seems to want to allocate N interrupts per cpu (plus some random >> static amount, which seems weird, but whatever). This patch breaks that. > On purpose. For the isolated CPUs we don't want network device > interrupts (in this context). > >> So I think it is important to figure out what that driver really wants >> in the nohz_full case. If it wants to retain N interrupts per CPU, and >> only reduce the number of CPUs, the proposed interface is wrong. > It wants N interrupts per non-isolated (AKA housekeeping) CPU. > Zero interrupts for isolated interrupts. Right, otherwise we may end up in a situation where we run out of per CPU vectors while we move the IRQs from isolated CPUs to housekeeping.
On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote: > > So I think it is important to figure out what that driver really wants > > in the nohz_full case. If it wants to retain N interrupts per CPU, and > > only reduce the number of CPUs, the proposed interface is wrong. > > It wants N interrupts per non-isolated (AKA housekeeping) CPU. Then the patch is wrong and the interface needs changing from @min_vecs, @max_vecs to something that expresses the N*nr_cpus relation.
On 10/20/20 3:30 AM, Peter Zijlstra wrote: > On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote: >>> So I think it is important to figure out what that driver really wants >>> in the nohz_full case. If it wants to retain N interrupts per CPU, and >>> only reduce the number of CPUs, the proposed interface is wrong. >> It wants N interrupts per non-isolated (AKA housekeeping) CPU. > Then the patch is wrong and the interface needs changing from @min_vecs, > @max_vecs to something that expresses the N*nr_cpus relation. Reading Marcelo's comment again I think what is really expected is 1 interrupt per non-isolated (housekeeping) CPU (not N interrupts). My bad that I missed it initially.
On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote: > > On 10/20/20 3:30 AM, Peter Zijlstra wrote: > > On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote: > >>> So I think it is important to figure out what that driver really wants > >>> in the nohz_full case. If it wants to retain N interrupts per CPU, and > >>> only reduce the number of CPUs, the proposed interface is wrong. > >> It wants N interrupts per non-isolated (AKA housekeeping) CPU. > > Then the patch is wrong and the interface needs changing from @min_vecs, > > @max_vecs to something that expresses the N*nr_cpus relation. > > Reading Marcelo's comment again I think what is really expected is 1 > interrupt per non-isolated (housekeeping) CPU (not N interrupts). Then what is the point of them asking for N*nr_cpus when there is no isolation? Either everybody wants 1 interrupts per CPU and we can do the clamp unconditionally, in which case we should go fix this user, or they want multiple per cpu and we should go fix the interface. It cannot be both.
On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote: > > + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > + > + /* > + * If we have isolated CPUs for use by real-time tasks, to keep the > + * latency overhead to a minimum, device-specific IRQ vectors are moved > + * to the housekeeping CPUs from the userspace by changing their > + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > + * running out of IRQ vectors. > + */ This is not true for managed interrupts. The interrupts affinity of those cannot be changed by user space. > + if (hk_cpus < num_online_cpus()) { > + if (hk_cpus < min_vecs) > + max_vecs = min_vecs; > + else if (hk_cpus < max_vecs) > + max_vecs = hk_cpus; > + } So now with that assume a 16 core machine (HT off for simplicity) 17 Requested interrupts (1 general, 16 queues) Managed interrupts will allocate 1 general interrupt which is free movable by user space 16 managed interrupts for queues (one per CPU) This allows the driver to have 16 queues, i.e. one queue per CPU. These interrupts are only used when an application on a CPU issues I/O. With the above change this will result 1 general interrupt which is free movable by user space 1 managed interrupts (possible affinity to all 16 CPUs, but routed to housekeeping CPU as long as there is one online) So the device is now limited to a single queue which also affects the housekeeping CPUs because now they have to share a single queue. With larger machines this gets even worse. So no. This needs way more thought for managed interrupts and you cannot do that at the PCI layer. Only the affinity spreading mechanism can do the right thing here. Thanks, tglx
On 10/20/20 9:41 AM, Peter Zijlstra wrote: > On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote: >> On 10/20/20 3:30 AM, Peter Zijlstra wrote: >>> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote: >>>>> So I think it is important to figure out what that driver really wants >>>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and >>>>> only reduce the number of CPUs, the proposed interface is wrong. >>>> It wants N interrupts per non-isolated (AKA housekeeping) CPU. >>> Then the patch is wrong and the interface needs changing from @min_vecs, >>> @max_vecs to something that expresses the N*nr_cpus relation. >> Reading Marcelo's comment again I think what is really expected is 1 >> interrupt per non-isolated (housekeeping) CPU (not N interrupts). > Then what is the point of them asking for N*nr_cpus when there is no > isolation? > > Either everybody wants 1 interrupts per CPU and we can do the clamp > unconditionally, in which case we should go fix this user, or they want > multiple per cpu and we should go fix the interface. > > It cannot be both. Based on my understanding I don't think this is consistent, the number of interrupts any driver can request varies to an extent that some consumer of this API even request just one interrupt for its use. This was one of the reasons why I thought of having a conditional restriction. But I agree there is a lack of consistency.
On 10/20/20 10:16 AM, Thomas Gleixner wrote: > On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote: >> >> + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); >> + >> + /* >> + * If we have isolated CPUs for use by real-time tasks, to keep the >> + * latency overhead to a minimum, device-specific IRQ vectors are moved >> + * to the housekeeping CPUs from the userspace by changing their >> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from >> + * running out of IRQ vectors. >> + */ > This is not true for managed interrupts. The interrupts affinity of > those cannot be changed by user space. Ah Yes. Perhaps s/IRQs/non-managed IRQ ? > >> + if (hk_cpus < num_online_cpus()) { >> + if (hk_cpus < min_vecs) >> + max_vecs = min_vecs; >> + else if (hk_cpus < max_vecs) >> + max_vecs = hk_cpus; >> + } > So now with that assume a 16 core machine (HT off for simplicity) > > 17 Requested interrupts (1 general, 16 queues) > > Managed interrupts will allocate > > 1 general interrupt which is free movable by user space > 16 managed interrupts for queues (one per CPU) > > This allows the driver to have 16 queues, i.e. one queue per CPU. These > interrupts are only used when an application on a CPU issues I/O. Correct. > > With the above change this will result > > 1 general interrupt which is free movable by user space > 1 managed interrupts (possible affinity to all 16 CPUs, but routed > to housekeeping CPU as long as there is one online) > > So the device is now limited to a single queue which also affects the > housekeeping CPUs because now they have to share a single queue. > > With larger machines this gets even worse. Yes, the change can impact the performance, however, if we don't do that we may have a latency impact instead. Specifically, on larger systems where most of the CPUs are isolated as we will definitely fail in moving all of the IRQs away from the isolated CPUs to the housekeeping. > > So no. This needs way more thought for managed interrupts and you cannot > do that at the PCI layer. Maybe we should not be doing anything in the case of managed IRQs as they are anyways pinned to the housekeeping CPUs as long as we have the 'managed_irq' option included in the kernel cmdline. > Only the affinity spreading mechanism can do > the right thing here. I can definitely explore this further. However, IMHO we would still need a logic to prevent the devices from creating excess vectors. Do you agree? > > Thanks, > > tglx >
On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote: > On 10/20/20 10:16 AM, Thomas Gleixner wrote: >> With the above change this will result >> >> 1 general interrupt which is free movable by user space >> 1 managed interrupts (possible affinity to all 16 CPUs, but routed >> to housekeeping CPU as long as there is one online) >> >> So the device is now limited to a single queue which also affects the >> housekeeping CPUs because now they have to share a single queue. >> >> With larger machines this gets even worse. > > Yes, the change can impact the performance, however, if we don't do that we > may have a latency impact instead. Specifically, on larger systems where > most of the CPUs are isolated as we will definitely fail in moving all of the > IRQs away from the isolated CPUs to the housekeeping. For non managed interrupts I agree. >> So no. This needs way more thought for managed interrupts and you cannot >> do that at the PCI layer. > > Maybe we should not be doing anything in the case of managed IRQs as they > are anyways pinned to the housekeeping CPUs as long as we have the > 'managed_irq' option included in the kernel cmdline. Exactly. For the PCI side this vector limiting has to be restricted to the non managed case. >> Only the affinity spreading mechanism can do >> the right thing here. > > I can definitely explore this further. > > However, IMHO we would still need a logic to prevent the devices from > creating excess vectors. Managed interrupts are preventing exactly that by pinning the interrupts and queues to one or a set of CPUs, which prevents vector exhaustion on CPU hotplug. Non-managed, yes that is and always was a problem. One of the reasons why managed interrupts exist. Thanks, tglx
On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote: > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote: >> However, IMHO we would still need a logic to prevent the devices from >> creating excess vectors. > > Managed interrupts are preventing exactly that by pinning the interrupts > and queues to one or a set of CPUs, which prevents vector exhaustion on > CPU hotplug. > > Non-managed, yes that is and always was a problem. One of the reasons > why managed interrupts exist. But why is this only a problem for isolation? The very same problem exists vs. CPU hotplug and therefore hibernation. On x86 we have at max. 204 vectors available for device interrupts per CPU. So assumed the only device interrupt in use is networking then any machine which has more than 204 network interrupts (queues, aux ...) active will prevent the machine from hibernation. Aside of that it's silly to have multiple queues targeted at a single CPU in case of hotplug. And that's not a theoretical problem. Some power management schemes shut down sockets when the utilization of a system is low enough, e.g. outside of working hours. The whole point of multi-queue is to have locality so that traffic from a CPU goes through the CPU local queue. What's the point of having two or more queues on a CPU in case of hotplug? The right answer to this is to utilize managed interrupts and have according logic in your network driver to handle CPU hotplug. When a CPU goes down, then the queue which is associated to that CPU is quiesced and the interrupt core shuts down the relevant interrupt instead of moving it to an online CPU (which causes the whole vector exhaustion problem on x86). When the CPU comes online again, then the interrupt is reenabled in the core and the driver reactivates the queue. Thanks, tglx
On 10/21/20 4:25 PM, Thomas Gleixner wrote: > On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote: >> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote: >>> However, IMHO we would still need a logic to prevent the devices from >>> creating excess vectors. >> Managed interrupts are preventing exactly that by pinning the interrupts >> and queues to one or a set of CPUs, which prevents vector exhaustion on >> CPU hotplug. >> >> Non-managed, yes that is and always was a problem. One of the reasons >> why managed interrupts exist. > But why is this only a problem for isolation? The very same problem > exists vs. CPU hotplug and therefore hibernation. > > On x86 we have at max. 204 vectors available for device interrupts per > CPU. So assumed the only device interrupt in use is networking then any > machine which has more than 204 network interrupts (queues, aux ...) > active will prevent the machine from hibernation. Yes, that is indeed the case. > > Aside of that it's silly to have multiple queues targeted at a single > CPU in case of hotplug. And that's not a theoretical problem. Some > power management schemes shut down sockets when the utilization of a > system is low enough, e.g. outside of working hours. > > The whole point of multi-queue is to have locality so that traffic from > a CPU goes through the CPU local queue. What's the point of having two > or more queues on a CPU in case of hotplug? > > The right answer to this is to utilize managed interrupts and have > according logic in your network driver to handle CPU hotplug. When a CPU > goes down, then the queue which is associated to that CPU is quiesced > and the interrupt core shuts down the relevant interrupt instead of > moving it to an online CPU (which causes the whole vector exhaustion > problem on x86). When the CPU comes online again, then the interrupt is > reenabled in the core and the driver reactivates the queue. IIRC then i40e does have something like that where it suspends all IRQs before hibernation and restores them when the CPU is back online. I am not particularly sure about the other drivers. This brings me to another discussion that Peter initiated that is to perform the proposed restriction without any condition for all non-managed IRQs. Something on the lines: + if (!pci_is_managed(dev)) + max_vecs = clamp(hk_cpus, min_vecs, max_vecs); I am not particularly sure about this because I am not sure what kind of performance penalty this will have on the drivers in general and if that will be acceptable at all. Any thoughts? However, this still doesn't solve the generic problem, and an ideal solution will be something that you suggested. Will it be sensible to think about having a generic API that can be consumed by all the drivers and that can do both the things you mentioned? > > Thanks, > > tglx > > >
On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote: > On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote: > > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote: > >> However, IMHO we would still need a logic to prevent the devices from > >> creating excess vectors. > > > > Managed interrupts are preventing exactly that by pinning the interrupts > > and queues to one or a set of CPUs, which prevents vector exhaustion on > > CPU hotplug. > > > > Non-managed, yes that is and always was a problem. One of the reasons > > why managed interrupts exist. > > But why is this only a problem for isolation? The very same problem > exists vs. CPU hotplug and therefore hibernation. > > On x86 we have at max. 204 vectors available for device interrupts per > CPU. So assumed the only device interrupt in use is networking then any > machine which has more than 204 network interrupts (queues, aux ...) > active will prevent the machine from hibernation. > > Aside of that it's silly to have multiple queues targeted at a single > CPU in case of hotplug. And that's not a theoretical problem. Some > power management schemes shut down sockets when the utilization of a > system is low enough, e.g. outside of working hours. > > The whole point of multi-queue is to have locality so that traffic from > a CPU goes through the CPU local queue. What's the point of having two > or more queues on a CPU in case of hotplug? > > The right answer to this is to utilize managed interrupts and have > according logic in your network driver to handle CPU hotplug. When a CPU > goes down, then the queue which is associated to that CPU is quiesced > and the interrupt core shuts down the relevant interrupt instead of > moving it to an online CPU (which causes the whole vector exhaustion > problem on x86). When the CPU comes online again, then the interrupt is > reenabled in the core and the driver reactivates the queue. I think Mellanox folks made some forays into managed irqs, but I don't remember/can't find the details now. For networking the locality / queue per core does not always work, since the incoming traffic is usually spread based on a hash. Many applications perform better when network processing is done on a small subset of CPUs, and application doesn't get interrupted every 100us. So we do need extra user control here. We have a bit of a uAPI problem since people had grown to depend on IRQ == queue == NAPI to configure their systems. "The right way" out would be a proper API which allows associating queues with CPUs rather than IRQs, then we can use managed IRQs and solve many other problems. Such new API has been in the works / discussions for a while now. (Magnus keep me honest here, if you disagree the queue API solves this.)
On 10/21/2020 5:02 PM, Jakub Kicinski wrote: > On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote: >> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote: >>> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote: >>>> However, IMHO we would still need a logic to prevent the devices from >>>> creating excess vectors. >>> >>> Managed interrupts are preventing exactly that by pinning the interrupts >>> and queues to one or a set of CPUs, which prevents vector exhaustion on >>> CPU hotplug. >>> >>> Non-managed, yes that is and always was a problem. One of the reasons >>> why managed interrupts exist. >> >> But why is this only a problem for isolation? The very same problem >> exists vs. CPU hotplug and therefore hibernation. >> >> On x86 we have at max. 204 vectors available for device interrupts per >> CPU. So assumed the only device interrupt in use is networking then any >> machine which has more than 204 network interrupts (queues, aux ...) >> active will prevent the machine from hibernation. >> >> Aside of that it's silly to have multiple queues targeted at a single >> CPU in case of hotplug. And that's not a theoretical problem. Some >> power management schemes shut down sockets when the utilization of a >> system is low enough, e.g. outside of working hours. >> >> The whole point of multi-queue is to have locality so that traffic from >> a CPU goes through the CPU local queue. What's the point of having two >> or more queues on a CPU in case of hotplug? >> >> The right answer to this is to utilize managed interrupts and have >> according logic in your network driver to handle CPU hotplug. When a CPU >> goes down, then the queue which is associated to that CPU is quiesced >> and the interrupt core shuts down the relevant interrupt instead of >> moving it to an online CPU (which causes the whole vector exhaustion >> problem on x86). When the CPU comes online again, then the interrupt is >> reenabled in the core and the driver reactivates the queue. > > I think Mellanox folks made some forays into managed irqs, but I don't > remember/can't find the details now. > I remember looking into this a few years ago, and not getting very far either. > For networking the locality / queue per core does not always work, > since the incoming traffic is usually spread based on a hash. Many > applications perform better when network processing is done on a small > subset of CPUs, and application doesn't get interrupted every 100us. > So we do need extra user control here. > > We have a bit of a uAPI problem since people had grown to depend on > IRQ == queue == NAPI to configure their systems. "The right way" out > would be a proper API which allows associating queues with CPUs rather > than IRQs, then we can use managed IRQs and solve many other problems. > I think we (Intel) hit some of the same issues you mention. I know I personally would like to see something that lets a lot of the current driver-specific policy be moved out. I think it should be possible to significantly simplify the abstraction used by the drivers. > Such new API has been in the works / discussions for a while now. > > (Magnus keep me honest here, if you disagree the queue API solves this.) >
On Wed, Oct 21 2020 at 17:02, Jakub Kicinski wrote: > On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote: >> The right answer to this is to utilize managed interrupts and have >> according logic in your network driver to handle CPU hotplug. When a CPU >> goes down, then the queue which is associated to that CPU is quiesced >> and the interrupt core shuts down the relevant interrupt instead of >> moving it to an online CPU (which causes the whole vector exhaustion >> problem on x86). When the CPU comes online again, then the interrupt is >> reenabled in the core and the driver reactivates the queue. > > I think Mellanox folks made some forays into managed irqs, but I don't > remember/can't find the details now. > > For networking the locality / queue per core does not always work, > since the incoming traffic is usually spread based on a hash. Many That makes it problematic and is fundamentally different from block I/O. > applications perform better when network processing is done on a small > subset of CPUs, and application doesn't get interrupted every 100us. > So we do need extra user control here. Ok. > We have a bit of a uAPI problem since people had grown to depend on > IRQ == queue == NAPI to configure their systems. "The right way" out > would be a proper API which allows associating queues with CPUs rather > than IRQs, then we can use managed IRQs and solve many other problems. > > Such new API has been in the works / discussions for a while now. If there is anything which needs to be done/extended on the irq side please let me know. Thanks tglx
On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote: > On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote: > > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote: > >> However, IMHO we would still need a logic to prevent the devices from > >> creating excess vectors. > > > > Managed interrupts are preventing exactly that by pinning the interrupts > > and queues to one or a set of CPUs, which prevents vector exhaustion on > > CPU hotplug. > > > > Non-managed, yes that is and always was a problem. One of the reasons > > why managed interrupts exist. > > But why is this only a problem for isolation? The very same problem > exists vs. CPU hotplug and therefore hibernation. > > On x86 we have at max. 204 vectors available for device interrupts per > CPU. So assumed the only device interrupt in use is networking then any > machine which has more than 204 network interrupts (queues, aux ...) > active will prevent the machine from hibernation. > > Aside of that it's silly to have multiple queues targeted at a single > CPU in case of hotplug. And that's not a theoretical problem. Some > power management schemes shut down sockets when the utilization of a > system is low enough, e.g. outside of working hours. Exactly. It seems the proper way to do handle this is to disable individual vectors rather than moving them. And that is needed for dynamic isolate / unisolate anyway... > The whole point of multi-queue is to have locality so that traffic from > a CPU goes through the CPU local queue. What's the point of having two > or more queues on a CPU in case of hotplug? > > The right answer to this is to utilize managed interrupts and have > according logic in your network driver to handle CPU hotplug. When a CPU > goes down, then the queue which is associated to that CPU is quiesced > and the interrupt core shuts down the relevant interrupt instead of > moving it to an online CPU (which causes the whole vector exhaustion > problem on x86). When the CPU comes online again, then the interrupt is > reenabled in the core and the driver reactivates the queue. Aha... But it would be necessary to do that from userspace (for runtime isolate/unisolate).
On 10/20/20 10:39 AM, Nitesh Narayan Lal wrote: > On 10/20/20 9:41 AM, Peter Zijlstra wrote: >> On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote: >>> On 10/20/20 3:30 AM, Peter Zijlstra wrote: >>>> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote: >>>>>> So I think it is important to figure out what that driver really wants >>>>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and >>>>>> only reduce the number of CPUs, the proposed interface is wrong. >>>>> It wants N interrupts per non-isolated (AKA housekeeping) CPU. >>>> Then the patch is wrong and the interface needs changing from @min_vecs, >>>> @max_vecs to something that expresses the N*nr_cpus relation. >>> Reading Marcelo's comment again I think what is really expected is 1 >>> interrupt per non-isolated (housekeeping) CPU (not N interrupts). >> Then what is the point of them asking for N*nr_cpus when there is no >> isolation? >> >> Either everybody wants 1 interrupts per CPU and we can do the clamp >> unconditionally, in which case we should go fix this user, or they want >> multiple per cpu and we should go fix the interface. >> >> It cannot be both. > Based on my understanding I don't think this is consistent, the number > of interrupts any driver can request varies to an extent that some > consumer of this API even request just one interrupt for its use. > > This was one of the reasons why I thought of having a conditional > restriction. > > But I agree there is a lack of consistency. > Hi Peter, So based on the suggestions from you and Thomas, I think something like the following should do the job within pci_alloc_irq_vectors_affinity(): + if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus())) + max_vecs = clamp(hk_cpus, min_vecs, max_vecs); I do know that you didn't like the usage of "hk_cpus < num_online_cpus()" and to an extent I agree that it does degrade the code clarity. However, since there is a certain inconsistency in the number of vectors that drivers request through this API IMHO we will need this, otherwise we could cause an impact on the drivers even in setups that doesn't have any isolated CPUs. If you agree, I can send the next version of the patch-set.
On Thu, Oct 22 2020 at 09:28, Marcelo Tosatti wrote: > On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote: >> The right answer to this is to utilize managed interrupts and have >> according logic in your network driver to handle CPU hotplug. When a CPU >> goes down, then the queue which is associated to that CPU is quiesced >> and the interrupt core shuts down the relevant interrupt instead of >> moving it to an online CPU (which causes the whole vector exhaustion >> problem on x86). When the CPU comes online again, then the interrupt is >> reenabled in the core and the driver reactivates the queue. > > Aha... But it would be necessary to do that from userspace (for runtime > isolate/unisolate). For anything which uses managed interrupts this is a non-problem and userspace has absolutely no business with it. Isolation does not shut down queues, at least not the block multi-queue ones which are only active when I/O is issued from that isolated CPU. So transitioning out of isolation requires no action at all. Transitioning in or changing the housekeeping mask needs some trivial tweak to handle the case where there is an overlap in the cpuset of a queue (housekeeping and isolated). This is handled already for setup and affinity changes, but of course not for runtime isolation mask changes, but that's a trivial thing to do. What's more interesting is how to deal with the network problem where there is no guarantee that the "response" ends up on the same queue as the "request" which is what the block people rely on. And that problem is not really an interrupt affinity problem in the first place. Thanks, tglx
On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote: > Hi Peter, > > So based on the suggestions from you and Thomas, I think something like the > following should do the job within pci_alloc_irq_vectors_affinity(): > > + if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus())) > + max_vecs = clamp(hk_cpus, min_vecs, max_vecs); > > I do know that you didn't like the usage of "hk_cpus < num_online_cpus()" > and to an extent I agree that it does degrade the code clarity. It's not just code clarity; I simply don't understand it. It feels like a band-aid that breaks thing. At the very least it needs a ginormous (and coherent) comment that explains: - the interface - the usage - this hack > However, since there is a certain inconsistency in the number of vectors > that drivers request through this API IMHO we will need this, otherwise > we could cause an impact on the drivers even in setups that doesn't > have any isolated CPUs. So shouldn't we then fix the drivers / interface first, to get rid of this inconsistency? > If you agree, I can send the next version of the patch-set. Well, it's not just me you have to convince.
On 10/23/20 4:58 AM, Peter Zijlstra wrote: > On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote: > >> Hi Peter, >> >> So based on the suggestions from you and Thomas, I think something like the >> following should do the job within pci_alloc_irq_vectors_affinity(): >> >> + if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus())) >> + max_vecs = clamp(hk_cpus, min_vecs, max_vecs); >> >> I do know that you didn't like the usage of "hk_cpus < num_online_cpus()" >> and to an extent I agree that it does degrade the code clarity. > It's not just code clarity; I simply don't understand it. It feels like > a band-aid that breaks thing. > > At the very least it needs a ginormous (and coherent) comment that > explains: > > - the interface > - the usage > - this hack That make sense. > >> However, since there is a certain inconsistency in the number of vectors >> that drivers request through this API IMHO we will need this, otherwise >> we could cause an impact on the drivers even in setups that doesn't >> have any isolated CPUs. > So shouldn't we then fix the drivers / interface first, to get rid of > this inconsistency? > Considering we agree that excess vector is a problem that needs to be solved across all the drivers and that you are comfortable with the other three patches in the set. If I may suggest the following: - We can pick those three patches for now, as that will atleast fix a driver that is currently impacting RT workloads. Is that a fair expectation? - In the meanwhile, I will start looking into individual drivers that consume this API to find out if there is a co-relation that can be derived between the max_vecs and number of CPUs. If that exists then I can go ahead and tweak the API's max_vecs accordingly. However, if this is absolutely random then I can come up with a sane comment before this check that covers the list of items you suggested. - I also want to explore the comments made by Thomas which may take some time.
On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote: > Extend nohz_full feature set to include isolation from managed IRQS. This So you say it's for managed-irqs, the feature is actually called MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs. Also, as per Thomas' earlier points, managed-irqs are in fact perfectly fine and don't need help at at... > is required specifically for setups that only uses nohz_full and still > requires isolation for maintaining lower latency for the listed CPUs. > > Suggested-by: Frederic Weisbecker <frederic@kernel.org> > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> > --- > kernel/sched/isolation.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c > index 5a6ea03f9882..9df9598a9e39 100644 > --- a/kernel/sched/isolation.c > +++ b/kernel/sched/isolation.c > @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str) > unsigned int flags; > > flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | > - HK_FLAG_MISC | HK_FLAG_KTHREAD; > + HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ; > > return housekeeping_setup(str, flags); > } > -- > 2.18.2 >
On Fri, Oct 23, 2020 at 03:25:05PM +0200, Peter Zijlstra wrote: > On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote: > > Extend nohz_full feature set to include isolation from managed IRQS. This > > So you say it's for managed-irqs, the feature is actually called > MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs. > > Also, as per Thomas' earlier points, managed-irqs are in fact perfectly > fine and don't need help at at... > > > is required specifically for setups that only uses nohz_full and still > > requires isolation for maintaining lower latency for the listed CPUs. > > > > Suggested-by: Frederic Weisbecker <frederic@kernel.org> Ah and yes there is this tag :-p So that's my bad, I really thought this thing was about managed IRQ. The problem is that I can't find a single documentation about them so I'm too clueless on that matter. Thanks. > > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> > > --- > > kernel/sched/isolation.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c > > index 5a6ea03f9882..9df9598a9e39 100644 > > --- a/kernel/sched/isolation.c > > +++ b/kernel/sched/isolation.c > > @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str) > > unsigned int flags; > > > > flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | > > - HK_FLAG_MISC | HK_FLAG_KTHREAD; > > + HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ; > > > > return housekeeping_setup(str, flags); > > } > > -- > > 2.18.2 > >
On 10/23/20 9:25 AM, Peter Zijlstra wrote: > On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote: >> Extend nohz_full feature set to include isolation from managed IRQS. This > So you say it's for managed-irqs, the feature is actually called > MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs. Ah my bad! I should replace the managed IRQS with MANAGED_IRQ. I can send another version with this fixed. > > Also, as per Thomas' earlier points, managed-irqs are in fact perfectly > fine and don't need help at at... Since the introduction of "genirq, sched/isolation: Isolate from handling managed interrupts" Within irq_do_set_affinity(), it is ensured that for managed intrrupts as well, the isolated CPUs are removed from the affinity mask. Hence, IMHO before this change managed interrupts were affecting the isolated CPUs. My intent of having this change is to basically allow isolation for nohz_full CPUs even when we don't have something like isolcpus. Does that make sense? > >> is required specifically for setups that only uses nohz_full and still >> requires isolation for maintaining lower latency for the listed CPUs. >> >> Suggested-by: Frederic Weisbecker <frederic@kernel.org> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> >> --- >> kernel/sched/isolation.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c >> index 5a6ea03f9882..9df9598a9e39 100644 >> --- a/kernel/sched/isolation.c >> +++ b/kernel/sched/isolation.c >> @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str) >> unsigned int flags; >> >> flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU | >> - HK_FLAG_MISC | HK_FLAG_KTHREAD; >> + HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ; >> >> return housekeeping_setup(str, flags); >> } >> -- >> 2.18.2 >>
On 10/23/20 9:29 AM, Frederic Weisbecker wrote: > On Fri, Oct 23, 2020 at 03:25:05PM +0200, Peter Zijlstra wrote: >> On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote: >>> Extend nohz_full feature set to include isolation from managed IRQS. This >> So you say it's for managed-irqs, the feature is actually called >> MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs. >> >> Also, as per Thomas' earlier points, managed-irqs are in fact perfectly >> fine and don't need help at at... >> >>> is required specifically for setups that only uses nohz_full and still >>> requires isolation for maintaining lower latency for the listed CPUs. >>> >>> Suggested-by: Frederic Weisbecker <frederic@kernel.org> > Ah and yes there is this tag :-p > > So that's my bad, I really thought this thing was about managed IRQ. > The problem is that I can't find a single documentation about them so I'm > too clueless on that matter. I am also confused with this terminology. So my bad for not taking care of this.
On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote: > On 10/23/20 4:58 AM, Peter Zijlstra wrote: >> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote: >> So shouldn't we then fix the drivers / interface first, to get rid of >> this inconsistency? >> > Considering we agree that excess vector is a problem that needs to be > solved across all the drivers and that you are comfortable with the other > three patches in the set. If I may suggest the following: > > - We can pick those three patches for now, as that will atleast fix a > driver that is currently impacting RT workloads. Is that a fair > expectation? No. Blindly reducing the maximum vectors to the number of housekeeping CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide what the right number of interrupts for this situation is. Many of these drivers need more than queue interrupts, admin, error interrupt and some operate best with seperate RX/TX interrupts per queue. They all can "work" with a single PCI interrupt of course, but the price you pay is performance. An isolated setup, which I'm familiar with, has two housekeeping CPUs. So far I restricted the number of network queues with a module argument to two, which allocates two management interrupts for the device and two interrupts (RX/TX) per queue, i.e. a total of six. Now I reduced the number of available interrupts to two according to your hack, which makes it use one queue RX/TX combined and one management interrupt. Guess what happens? Network performance tanks to the points that it breaks a carefully crafted setup. The same applies to a device which is application specific and wants one channel including an interrupt per isolated application core. Today I can isolate 8 out of 12 CPUs and let the device create 8 channels and set one interrupt and channel affine to each isolated CPU. With your hack, I get only 4 interrupts and channels. Fail! You cannot declare that all this is perfectly fine, just because it does not matter for your particular use case. So without information from the driver which tells what the best number of interrupts is with a reduced number of CPUs, this cutoff will cause more problems than it solves. Regressions guaranteed. Managed interrupts base their interrupt allocation and spreading on information which is handed in by the individual driver and not on crude assumptions. They are not imposing restrictions on the use case. It's perfectly fine for isolated work to save a data set to disk after computation has finished and that just works with the per-cpu I/O queue which is otherwise completely silent. All isolated workers can do the same in parallel without trampling on each other toes by competing for a reduced number of queues which are affine to the housekeeper CPUs. Unfortunately network multi-queue is substantially different from block multi-queue (as I learned in this conversation), so the concept cannot be applied one-to-one to networking as is. But there are certainly part of it which can be reused. This needs a lot more thought than just these crude hacks. Especially under the aspect that there are talks about making isolation runtime switchable. Are you going to rmmod/insmod the i40e network driver to do so? That's going to work fine if you do that reconfiguration over network... Thanks, tglx
On 10/23/20 5:00 PM, Thomas Gleixner wrote: > On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote: >> On 10/23/20 4:58 AM, Peter Zijlstra wrote: >>> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote: >>> So shouldn't we then fix the drivers / interface first, to get rid of >>> this inconsistency? >>> >> Considering we agree that excess vector is a problem that needs to be >> solved across all the drivers and that you are comfortable with the other >> three patches in the set. If I may suggest the following: >> >> - We can pick those three patches for now, as that will atleast fix a >> driver that is currently impacting RT workloads. Is that a fair >> expectation? > No. Blindly reducing the maximum vectors to the number of housekeeping > CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide > what the right number of interrupts for this situation is. > > Many of these drivers need more than queue interrupts, admin, error > interrupt and some operate best with seperate RX/TX interrupts per > queue. They all can "work" with a single PCI interrupt of course, but > the price you pay is performance. > > An isolated setup, which I'm familiar with, has two housekeeping > CPUs. So far I restricted the number of network queues with a module > argument to two, which allocates two management interrupts for the > device and two interrupts (RX/TX) per queue, i.e. a total of six. Does it somehow take num_online_cpus() into consideration while deciding the number of interrupts to create? > > Now I reduced the number of available interrupts to two according to > your hack, which makes it use one queue RX/TX combined and one > management interrupt. Guess what happens? Network performance tanks to > the points that it breaks a carefully crafted setup. > > The same applies to a device which is application specific and wants one > channel including an interrupt per isolated application core. Today I > can isolate 8 out of 12 CPUs and let the device create 8 channels and > set one interrupt and channel affine to each isolated CPU. With your > hack, I get only 4 interrupts and channels. Fail! > > You cannot declare that all this is perfectly fine, just because it does > not matter for your particular use case. I agree that does sound wrong. > > So without information from the driver which tells what the best number > of interrupts is with a reduced number of CPUs, this cutoff will cause > more problems than it solves. Regressions guaranteed. Indeed. I think one commonality among the drivers at the moment is the usage of num_online_cpus() to determine the vectors to create. So, maybe instead of doing this kind of restrictions in a generic level API, it will make more sense to do this on a per-device basis by replacing the number of online CPUs with the housekeeping CPUs? This is what I have done in the i40e patch. But that still sounds hackish and will impact the performance. > > Managed interrupts base their interrupt allocation and spreading on > information which is handed in by the individual driver and not on crude > assumptions. They are not imposing restrictions on the use case. Right, FWIU it is irq_do_set_affinity that prevents the spreading of managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled, isn't? > > It's perfectly fine for isolated work to save a data set to disk after > computation has finished and that just works with the per-cpu I/O queue > which is otherwise completely silent. All isolated workers can do the > same in parallel without trampling on each other toes by competing for a > reduced number of queues which are affine to the housekeeper CPUs. > > Unfortunately network multi-queue is substantially different from block > multi-queue (as I learned in this conversation), so the concept cannot > be applied one-to-one to networking as is. But there are certainly part > of it which can be reused. So this is one of the areas that I don't understand well yet and will explore now. > > This needs a lot more thought than just these crude hacks. Got it. I am indeed not in the favor of pushing a solution that has a possibility of regressing/breaking things afterward. > > Especially under the aspect that there are talks about making isolation > runtime switchable. Are you going to rmmod/insmod the i40e network > driver to do so? That's going to work fine if you do that > reconfiguration over network... That's an interesting point. However, for some of the drivers which uses something like cpu_online/possible_mask while creating its affinity mask reloading will still associate jobs to the isolated CPUs. -- Thanks Nitesh
On Mon, Oct 26 2020 at 09:35, Nitesh Narayan Lal wrote: > On 10/23/20 5:00 PM, Thomas Gleixner wrote: >> An isolated setup, which I'm familiar with, has two housekeeping >> CPUs. So far I restricted the number of network queues with a module >> argument to two, which allocates two management interrupts for the >> device and two interrupts (RX/TX) per queue, i.e. a total of six. > > Does it somehow take num_online_cpus() into consideration while deciding > the number of interrupts to create? No, I just tell it to create two queues :) >> So without information from the driver which tells what the best number >> of interrupts is with a reduced number of CPUs, this cutoff will cause >> more problems than it solves. Regressions guaranteed. > > Indeed. > I think one commonality among the drivers at the moment is the usage of > num_online_cpus() to determine the vectors to create. > > So, maybe instead of doing this kind of restrictions in a generic level > API, it will make more sense to do this on a per-device basis by replacing > the number of online CPUs with the housekeeping CPUs? > > This is what I have done in the i40e patch. > But that still sounds hackish and will impact the performance. You want an interface which allows the driver to say: I need N interrupts for general management and ideally M interrupts per queue. This is similar to the way drivers tell the core code about their requirements for managed interrupts for the spreading calculation. >> Managed interrupts base their interrupt allocation and spreading on >> information which is handed in by the individual driver and not on crude >> assumptions. They are not imposing restrictions on the use case. > > Right, FWIU it is irq_do_set_affinity that prevents the spreading of > managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled, > isn't? No. Spreading takes possible CPUs into account. HK_FLAG_MANAGED_IRQ does not influence spreading at all. It only handles the case that an interrupt is affine to more than one CPUs and the resulting affinity mask spawns both housekeeping and isolated CPUs. It then steers the interrupt to the housekeeping CPUs (as long as there is one online). Thanks, tglx
On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote: > On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote: > > On 10/23/20 4:58 AM, Peter Zijlstra wrote: > >> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote: > >> So shouldn't we then fix the drivers / interface first, to get rid of > >> this inconsistency? > >> > > Considering we agree that excess vector is a problem that needs to be > > solved across all the drivers and that you are comfortable with the other > > three patches in the set. If I may suggest the following: > > > > - We can pick those three patches for now, as that will atleast fix a > > driver that is currently impacting RT workloads. Is that a fair > > expectation? > > No. Blindly reducing the maximum vectors to the number of housekeeping > CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide > what the right number of interrupts for this situation is. > > Many of these drivers need more than queue interrupts, admin, error > interrupt and some operate best with seperate RX/TX interrupts per > queue. They all can "work" with a single PCI interrupt of course, but > the price you pay is performance. > > An isolated setup, which I'm familiar with, has two housekeeping > CPUs. So far I restricted the number of network queues with a module > argument to two, which allocates two management interrupts for the > device and two interrupts (RX/TX) per queue, i.e. a total of six. > > Now I reduced the number of available interrupts to two according to > your hack, which makes it use one queue RX/TX combined and one > management interrupt. Guess what happens? Network performance tanks to > the points that it breaks a carefully crafted setup. > > The same applies to a device which is application specific and wants one > channel including an interrupt per isolated application core. Today I > can isolate 8 out of 12 CPUs and let the device create 8 channels and > set one interrupt and channel affine to each isolated CPU. With your > hack, I get only 4 interrupts and channels. Fail! Good point. > You cannot declare that all this is perfectly fine, just because it does > not matter for your particular use case. > > So without information from the driver which tells what the best number > of interrupts is with a reduced number of CPUs, this cutoff will cause > more problems than it solves. Regressions guaranteed. One might want to move from one interrupt per isolated app core to zero, or vice versa. It seems that "best number of interrupts is with reduced number of CPUs" information, is therefore in userspace, not in driver... No? > Managed interrupts base their interrupt allocation and spreading on > information which is handed in by the individual driver and not on crude > assumptions. They are not imposing restrictions on the use case. > > It's perfectly fine for isolated work to save a data set to disk after > computation has finished and that just works with the per-cpu I/O queue > which is otherwise completely silent. Userspace could only change the mask of interrupts which are not triggered by requests from the local CPU (admin, error, mgmt, etc), to avoid the vector exhaustion problem. However, there is no explicit way for userspace to know that, as far as i know. 130: 34845 0 0 0 0 0 0 0 IR-PCI-MSI 33554433-edge nvme0q1 131: 0 27062 0 0 0 0 0 0 IR-PCI-MSI 33554434-edge nvme0q2 132: 0 0 24393 0 0 0 0 0 IR-PCI-MSI 33554435-edge nvme0q3 133: 0 0 0 24313 0 0 0 0 IR-PCI-MSI 33554436-edge nvme0q4 134: 0 0 0 0 20608 0 0 0 IR-PCI-MSI 33554437-edge nvme0q5 135: 0 0 0 0 0 22163 0 0 IR-PCI-MSI 33554438-edge nvme0q6 136: 0 0 0 0 0 0 23020 0 IR-PCI-MSI 33554439-edge nvme0q7 137: 0 0 0 0 0 0 0 24285 IR-PCI-MSI 33554440-edge nvme0q8 Can that be retrieved from PCI-MSI information, or drivers have to inform this? > All isolated workers can do the > same in parallel without trampling on each other toes by competing for a > reduced number of queues which are affine to the housekeeper CPUs. > > Unfortunately network multi-queue is substantially different from block > multi-queue (as I learned in this conversation), so the concept cannot > be applied one-to-one to networking as is. But there are certainly part > of it which can be reused. > > This needs a lot more thought than just these crude hacks. > > Especially under the aspect that there are talks about making isolation > runtime switchable. Are you going to rmmod/insmod the i40e network > driver to do so? That's going to work fine if you do that > reconfiguration over network... > > Thanks, > > tglx
On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote: > On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote: >> So without information from the driver which tells what the best number >> of interrupts is with a reduced number of CPUs, this cutoff will cause >> more problems than it solves. Regressions guaranteed. > > One might want to move from one interrupt per isolated app core > to zero, or vice versa. It seems that "best number of interrupts > is with reduced number of CPUs" information, is therefore in userspace, > not in driver... How does userspace know about the driver internals? Number of management interrupts, optimal number of interrupts per queue? >> Managed interrupts base their interrupt allocation and spreading on >> information which is handed in by the individual driver and not on crude >> assumptions. They are not imposing restrictions on the use case. >> >> It's perfectly fine for isolated work to save a data set to disk after >> computation has finished and that just works with the per-cpu I/O queue >> which is otherwise completely silent. > > Userspace could only change the mask of interrupts which are not > triggered by requests from the local CPU (admin, error, mgmt, etc), > to avoid the vector exhaustion problem. > > However, there is no explicit way for userspace to know that, as far as > i know. > > 130: 34845 0 0 0 0 0 0 0 IR-PCI-MSI 33554433-edge nvme0q1 > 131: 0 27062 0 0 0 0 0 0 IR-PCI-MSI 33554434-edge nvme0q2 > 132: 0 0 24393 0 0 0 0 0 IR-PCI-MSI 33554435-edge nvme0q3 > 133: 0 0 0 24313 0 0 0 0 IR-PCI-MSI 33554436-edge nvme0q4 > 134: 0 0 0 0 20608 0 0 0 IR-PCI-MSI 33554437-edge nvme0q5 > 135: 0 0 0 0 0 22163 0 0 IR-PCI-MSI 33554438-edge nvme0q6 > 136: 0 0 0 0 0 0 23020 0 IR-PCI-MSI 33554439-edge nvme0q7 > 137: 0 0 0 0 0 0 0 24285 IR-PCI-MSI 33554440-edge nvme0q8 > > Can that be retrieved from PCI-MSI information, or drivers > have to inform this? The driver should use a different name for the admin queues. Thanks, tglx
On Mon, Oct 26, 2020 at 08:00:39PM +0100, Thomas Gleixner wrote: > On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote: > > On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote: > >> So without information from the driver which tells what the best number > >> of interrupts is with a reduced number of CPUs, this cutoff will cause > >> more problems than it solves. Regressions guaranteed. > > > > One might want to move from one interrupt per isolated app core > > to zero, or vice versa. It seems that "best number of interrupts > > is with reduced number of CPUs" information, is therefore in userspace, > > not in driver... > > How does userspace know about the driver internals? Number of management > interrupts, optimal number of interrupts per queue? > > >> Managed interrupts base their interrupt allocation and spreading on > >> information which is handed in by the individual driver and not on crude > >> assumptions. They are not imposing restrictions on the use case. > >> > >> It's perfectly fine for isolated work to save a data set to disk after > >> computation has finished and that just works with the per-cpu I/O queue > >> which is otherwise completely silent. > > > > Userspace could only change the mask of interrupts which are not > > triggered by requests from the local CPU (admin, error, mgmt, etc), > > to avoid the vector exhaustion problem. > > > > However, there is no explicit way for userspace to know that, as far as > > i know. > > > > 130: 34845 0 0 0 0 0 0 0 IR-PCI-MSI 33554433-edge nvme0q1 > > 131: 0 27062 0 0 0 0 0 0 IR-PCI-MSI 33554434-edge nvme0q2 > > 132: 0 0 24393 0 0 0 0 0 IR-PCI-MSI 33554435-edge nvme0q3 > > 133: 0 0 0 24313 0 0 0 0 IR-PCI-MSI 33554436-edge nvme0q4 > > 134: 0 0 0 0 20608 0 0 0 IR-PCI-MSI 33554437-edge nvme0q5 > > 135: 0 0 0 0 0 22163 0 0 IR-PCI-MSI 33554438-edge nvme0q6 > > 136: 0 0 0 0 0 0 23020 0 IR-PCI-MSI 33554439-edge nvme0q7 > > 137: 0 0 0 0 0 0 0 24285 IR-PCI-MSI 33554440-edge nvme0q8 > > > > Can that be retrieved from PCI-MSI information, or drivers > > have to inform this? > > The driver should use a different name for the admin queues. Works for me. Sounds more like a heuristic which can break, so documenting this as an "interface" seems appropriate.
On 10/26/2020 12:00 PM, Thomas Gleixner wrote: > On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote: >> On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote: >>> So without information from the driver which tells what the best number >>> of interrupts is with a reduced number of CPUs, this cutoff will cause >>> more problems than it solves. Regressions guaranteed. >> >> One might want to move from one interrupt per isolated app core >> to zero, or vice versa. It seems that "best number of interrupts >> is with reduced number of CPUs" information, is therefore in userspace, >> not in driver... > > How does userspace know about the driver internals? Number of management > interrupts, optimal number of interrupts per queue? > I guess this is the problem solved in part by the queue management work that would make queues a thing that userspace is aware of. Are there drivers which use more than one interrupt per queue? I know drivers have multiple management interrupts.. and I guess some drivers do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to have multiple queues for one interrupt .. I'm not sure how a single queue with multiple interrupts would work though. >>> Managed interrupts base their interrupt allocation and spreading on >>> information which is handed in by the individual driver and not on crude >>> assumptions. They are not imposing restrictions on the use case. >>> >>> It's perfectly fine for isolated work to save a data set to disk after >>> computation has finished and that just works with the per-cpu I/O queue >>> which is otherwise completely silent. >> >> Userspace could only change the mask of interrupts which are not >> triggered by requests from the local CPU (admin, error, mgmt, etc), >> to avoid the vector exhaustion problem. >> >> However, there is no explicit way for userspace to know that, as far as >> i know. >> >> 130: 34845 0 0 0 0 0 0 0 IR-PCI-MSI 33554433-edge nvme0q1 >> 131: 0 27062 0 0 0 0 0 0 IR-PCI-MSI 33554434-edge nvme0q2 >> 132: 0 0 24393 0 0 0 0 0 IR-PCI-MSI 33554435-edge nvme0q3 >> 133: 0 0 0 24313 0 0 0 0 IR-PCI-MSI 33554436-edge nvme0q4 >> 134: 0 0 0 0 20608 0 0 0 IR-PCI-MSI 33554437-edge nvme0q5 >> 135: 0 0 0 0 0 22163 0 0 IR-PCI-MSI 33554438-edge nvme0q6 >> 136: 0 0 0 0 0 0 23020 0 IR-PCI-MSI 33554439-edge nvme0q7 >> 137: 0 0 0 0 0 0 0 24285 IR-PCI-MSI 33554440-edge nvme0q8 >> >> Can that be retrieved from PCI-MSI information, or drivers >> have to inform this? > > The driver should use a different name for the admin queues. > > Thanks, > > tglx >
On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote: > On 10/26/2020 12:00 PM, Thomas Gleixner wrote: >> How does userspace know about the driver internals? Number of management >> interrupts, optimal number of interrupts per queue? > > I guess this is the problem solved in part by the queue management work > that would make queues a thing that userspace is aware of. > > Are there drivers which use more than one interrupt per queue? I know > drivers have multiple management interrupts.. and I guess some drivers > do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to > have multiple queues for one interrupt .. I'm not sure how a single > queue with multiple interrupts would work though. For block there is always one interrupt per queue. Some Network drivers seem to have seperate RX and TX interrupts per queue. Thanks, tglx
On 10/26/2020 1:11 PM, Thomas Gleixner wrote: > On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote: >> On 10/26/2020 12:00 PM, Thomas Gleixner wrote: >>> How does userspace know about the driver internals? Number of management >>> interrupts, optimal number of interrupts per queue? >> >> I guess this is the problem solved in part by the queue management work >> that would make queues a thing that userspace is aware of. >> >> Are there drivers which use more than one interrupt per queue? I know >> drivers have multiple management interrupts.. and I guess some drivers >> do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to >> have multiple queues for one interrupt .. I'm not sure how a single >> queue with multiple interrupts would work though. > > For block there is always one interrupt per queue. Some Network drivers > seem to have seperate RX and TX interrupts per queue. > > Thanks, > > tglx > That's true when thinking of Tx and Rx as a single queue. Another way to think about it is "one rx queue" and "one tx queue" each with their own interrupt... Even if there are devices which force there to be exactly queue pairs, you could still think of them as separate entities? Hmm.
On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote: > On 10/26/2020 1:11 PM, Thomas Gleixner wrote: >> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote: >>> Are there drivers which use more than one interrupt per queue? I know >>> drivers have multiple management interrupts.. and I guess some drivers >>> do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to >>> have multiple queues for one interrupt .. I'm not sure how a single >>> queue with multiple interrupts would work though. >> >> For block there is always one interrupt per queue. Some Network drivers >> seem to have seperate RX and TX interrupts per queue. > That's true when thinking of Tx and Rx as a single queue. Another way to > think about it is "one rx queue" and "one tx queue" each with their own > interrupt... > > Even if there are devices which force there to be exactly queue pairs, > you could still think of them as separate entities? Interesting thought. But as Jakub explained networking queues are fundamentally different from block queues on the RX side. For block the request issued on queue X will raise the complete interrupt on queue X. For networking the TX side will raise the TX interrupt on the queue on which the packet was queued obviously or should I say hopefully. :) But incoming packets will be directed to some receive queue based on a hash or whatever crystallball logic the firmware decided to implement. Which makes this not really suitable for the managed interrupt and spreading approach which is used by block-mq. Hrm... But I still think that for curing that isolation stuff we want at least some information from the driver. Alternative solution would be to grant the allocation of interrupts and queues and have some sysfs knob to shut down queues at runtime. If that shutdown results in releasing the queue interrupt (via free_irq()) then the vector exhaustion problem goes away. Needs more thought and information (for network oblivious folks like me). Thanks, tglx
On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote: > On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote: > > On 10/26/2020 1:11 PM, Thomas Gleixner wrote: > >> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote: > >>> Are there drivers which use more than one interrupt per queue? I know > >>> drivers have multiple management interrupts.. and I guess some drivers > >>> do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to > >>> have multiple queues for one interrupt .. I'm not sure how a single > >>> queue with multiple interrupts would work though. > >> > >> For block there is always one interrupt per queue. Some Network drivers > >> seem to have seperate RX and TX interrupts per queue. > > That's true when thinking of Tx and Rx as a single queue. Another way to > > think about it is "one rx queue" and "one tx queue" each with their own > > interrupt... > > > > Even if there are devices which force there to be exactly queue pairs, > > you could still think of them as separate entities? > > Interesting thought. > > But as Jakub explained networking queues are fundamentally different > from block queues on the RX side. For block the request issued on queue > X will raise the complete interrupt on queue X. > > For networking the TX side will raise the TX interrupt on the queue on > which the packet was queued obviously or should I say hopefully. :) > > But incoming packets will be directed to some receive queue based on a > hash or whatever crystallball logic the firmware decided to implement. > > Which makes this not really suitable for the managed interrupt and > spreading approach which is used by block-mq. Hrm... > > But I still think that for curing that isolation stuff we want at least > some information from the driver. Alternative solution would be to grant > the allocation of interrupts and queues and have some sysfs knob to shut > down queues at runtime. If that shutdown results in releasing the queue > interrupt (via free_irq()) then the vector exhaustion problem goes away. > > Needs more thought and information (for network oblivious folks like > me). One piece of information that may be useful is that even tho the RX packets may be spread semi-randomly the user space can still control which queues are included in the mechanism. There is an indirection table in the HW which allows to weigh queues differently, or exclude selected queues from the spreading. Other mechanisms exist to filter flows onto specific queues. IOW just because a core has an queue/interrupt does not mean that interrupt will ever fire, provided its excluded from RSS. Another piece is that by default we suggest drivers allocate 8 RX queues, and online_cpus TX queues. The number of queues can be independently controlled via ethtool -L. Drivers which can't support separate queues will default to online_cpus queue pairs, and let ethtool -L only set the "combined" parameter. There are drivers which always allocate online_cpus interrupts, and then some of them will go unused if #qs < #cpus. My unpopular opinion is that for networking devices all the heuristics we may come up with are going to be a dead end. We need an explicit API to allow users placing queues on cores, and use managed IRQs for data queues. (I'm assuming that managed IRQs will let us reliably map a MSI-X vector to a core :))
On 10/26/20 5:50 PM, Thomas Gleixner wrote: > On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote: >> On 10/26/2020 1:11 PM, Thomas Gleixner wrote: >>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote: >>>> Are there drivers which use more than one interrupt per queue? I know >>>> drivers have multiple management interrupts.. and I guess some drivers >>>> do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to >>>> have multiple queues for one interrupt .. I'm not sure how a single >>>> queue with multiple interrupts would work though. >>> For block there is always one interrupt per queue. Some Network drivers >>> seem to have seperate RX and TX interrupts per queue. >> That's true when thinking of Tx and Rx as a single queue. Another way to >> think about it is "one rx queue" and "one tx queue" each with their own >> interrupt... >> >> Even if there are devices which force there to be exactly queue pairs, >> you could still think of them as separate entities? > Interesting thought. > > But as Jakub explained networking queues are fundamentally different > from block queues on the RX side. For block the request issued on queue > X will raise the complete interrupt on queue X. > > For networking the TX side will raise the TX interrupt on the queue on > which the packet was queued obviously or should I say hopefully. :) This is my impression as well. > But incoming packets will be directed to some receive queue based on a > hash or whatever crystallball logic the firmware decided to implement. > > Which makes this not really suitable for the managed interrupt and > spreading approach which is used by block-mq. Hrm... > > But I still think that for curing that isolation stuff we want at least > some information from the driver. Alternative solution would be to grant > the allocation of interrupts and queues and have some sysfs knob to shut > down queues at runtime. If that shutdown results in releasing the queue > interrupt (via free_irq()) then the vector exhaustion problem goes away. I think this is close to what I and Marcelo were discussing earlier today privately. I don't think there is currently a way to control the enablement/disablement of interrupts from the userspace. I think in terms of the idea we need something similar to what i40e does, that is shutdown all IRQs when CPU is suspended and restores the interrupt schema when the CPU is back online. The two key difference will be that this API needs to be generic and also needs to be exposed to the userspace through something like sysfs as you have mentioned.
On Mon, Oct 26 2020 at 15:13, Jakub Kicinski wrote: > On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote: >> But I still think that for curing that isolation stuff we want at least >> some information from the driver. Alternative solution would be to grant >> the allocation of interrupts and queues and have some sysfs knob to shut >> down queues at runtime. If that shutdown results in releasing the queue >> interrupt (via free_irq()) then the vector exhaustion problem goes away. >> >> Needs more thought and information (for network oblivious folks like >> me). > > One piece of information that may be useful is that even tho the RX > packets may be spread semi-randomly the user space can still control > which queues are included in the mechanism. There is an indirection > table in the HW which allows to weigh queues differently, or exclude > selected queues from the spreading. Other mechanisms exist to filter > flows onto specific queues. > > IOW just because a core has an queue/interrupt does not mean that > interrupt will ever fire, provided its excluded from RSS. > > Another piece is that by default we suggest drivers allocate 8 RX > queues, and online_cpus TX queues. The number of queues can be > independently controlled via ethtool -L. Drivers which can't support > separate queues will default to online_cpus queue pairs, and let > ethtool -L only set the "combined" parameter. > > There are drivers which always allocate online_cpus interrupts, > and then some of them will go unused if #qs < #cpus. Thanks for the enlightment. > My unpopular opinion is that for networking devices all the heuristics > we may come up with are going to be a dead end. I agree. Heuristics suck. > We need an explicit API to allow users placing queues on cores, and > use managed IRQs for data queues. (I'm assuming that managed IRQs will > let us reliably map a MSI-X vector to a core :)) Yes, they allow you to do that. That will need some tweaks to theway they work today (coming from the strict block mq semantics). You also need to be aware that managed irqs have also strict semantics vs. CPU hotplug. If the last CPU in the managed affinity set goes down then the interrupt is shut down by the irq core which means that you need to quiesce the associated queue before that happens. When the first CPU comes online again the interrupt is reenabled, so the queue should be able to handle it or has ensured that the device does not raise one before it is able to do so. Thanks, tglx
On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote: > On 10/26/20 5:50 PM, Thomas Gleixner wrote: >> But I still think that for curing that isolation stuff we want at least >> some information from the driver. Alternative solution would be to grant >> the allocation of interrupts and queues and have some sysfs knob to shut >> down queues at runtime. If that shutdown results in releasing the queue >> interrupt (via free_irq()) then the vector exhaustion problem goes away. > > I think this is close to what I and Marcelo were discussing earlier today > privately. > > I don't think there is currently a way to control the enablement/disablement of > interrupts from the userspace. You cannot just disable the interrupt. You need to make sure that the associated queue is shutdown or quiesced _before_ the interrupt is shut down. Thanks, tglx
On 10/26/2020 3:13 PM, Jakub Kicinski wrote: > On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote: >> On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote: >>> On 10/26/2020 1:11 PM, Thomas Gleixner wrote: >>>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote: >>>>> Are there drivers which use more than one interrupt per queue? I know >>>>> drivers have multiple management interrupts.. and I guess some drivers >>>>> do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to >>>>> have multiple queues for one interrupt .. I'm not sure how a single >>>>> queue with multiple interrupts would work though. >>>> >>>> For block there is always one interrupt per queue. Some Network drivers >>>> seem to have seperate RX and TX interrupts per queue. >>> That's true when thinking of Tx and Rx as a single queue. Another way to >>> think about it is "one rx queue" and "one tx queue" each with their own >>> interrupt... >>> >>> Even if there are devices which force there to be exactly queue pairs, >>> you could still think of them as separate entities? >> >> Interesting thought. >> >> But as Jakub explained networking queues are fundamentally different >> from block queues on the RX side. For block the request issued on queue >> X will raise the complete interrupt on queue X. >> >> For networking the TX side will raise the TX interrupt on the queue on >> which the packet was queued obviously or should I say hopefully. :) >> >> But incoming packets will be directed to some receive queue based on a >> hash or whatever crystallball logic the firmware decided to implement. >> >> Which makes this not really suitable for the managed interrupt and >> spreading approach which is used by block-mq. Hrm... >> >> But I still think that for curing that isolation stuff we want at least >> some information from the driver. Alternative solution would be to grant >> the allocation of interrupts and queues and have some sysfs knob to shut >> down queues at runtime. If that shutdown results in releasing the queue >> interrupt (via free_irq()) then the vector exhaustion problem goes away. >> >> Needs more thought and information (for network oblivious folks like >> me). > > One piece of information that may be useful is that even tho the RX > packets may be spread semi-randomly the user space can still control > which queues are included in the mechanism. There is an indirection > table in the HW which allows to weigh queues differently, or exclude > selected queues from the spreading. Other mechanisms exist to filter > flows onto specific queues. > > IOW just because a core has an queue/interrupt does not mean that > interrupt will ever fire, provided its excluded from RSS. > > Another piece is that by default we suggest drivers allocate 8 RX > queues, and online_cpus TX queues. The number of queues can be > independently controlled via ethtool -L. Drivers which can't support > separate queues will default to online_cpus queue pairs, and let > ethtool -L only set the "combined" parameter. > I know the Intel drivers usually have defaulted to trying to maintain queue pairs. I do not believe this is technically a HW restriction, but it is heavily built into the way the drivers work today. > There are drivers which always allocate online_cpus interrupts, > and then some of them will go unused if #qs < #cpus. > > Right. > My unpopular opinion is that for networking devices all the heuristics > we may come up with are going to be a dead end. We need an explicit API > to allow users placing queues on cores, and use managed IRQs for data > queues. (I'm assuming that managed IRQs will let us reliably map a MSI-X > vector to a core :)) > I don't think it is that unpopular... This is the direction I'd like to see us go as well.
On 10/26/2020 3:49 PM, Thomas Gleixner wrote: > On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote: >> On 10/26/20 5:50 PM, Thomas Gleixner wrote: >>> But I still think that for curing that isolation stuff we want at least >>> some information from the driver. Alternative solution would be to grant >>> the allocation of interrupts and queues and have some sysfs knob to shut >>> down queues at runtime. If that shutdown results in releasing the queue >>> interrupt (via free_irq()) then the vector exhaustion problem goes away. >> >> I think this is close to what I and Marcelo were discussing earlier today >> privately. >> >> I don't think there is currently a way to control the enablement/disablement of >> interrupts from the userspace. > > You cannot just disable the interrupt. You need to make sure that the > associated queue is shutdown or quiesced _before_ the interrupt is shut > down. > > Thanks, > > tglx > Could this be handled with a callback to the driver/hw? I know Intel HW should support this type of quiesce/shutdown. Thanks, Jake
On Mon, Oct 26, 2020 at 06:22:29PM -0400, Nitesh Narayan Lal wrote: > > On 10/26/20 5:50 PM, Thomas Gleixner wrote: > > On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote: > >> On 10/26/2020 1:11 PM, Thomas Gleixner wrote: > >>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote: > >>>> Are there drivers which use more than one interrupt per queue? I know > >>>> drivers have multiple management interrupts.. and I guess some drivers > >>>> do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to > >>>> have multiple queues for one interrupt .. I'm not sure how a single > >>>> queue with multiple interrupts would work though. > >>> For block there is always one interrupt per queue. Some Network drivers > >>> seem to have seperate RX and TX interrupts per queue. > >> That's true when thinking of Tx and Rx as a single queue. Another way to > >> think about it is "one rx queue" and "one tx queue" each with their own > >> interrupt... > >> > >> Even if there are devices which force there to be exactly queue pairs, > >> you could still think of them as separate entities? > > Interesting thought. > > > > But as Jakub explained networking queues are fundamentally different > > from block queues on the RX side. For block the request issued on queue > > X will raise the complete interrupt on queue X. > > > > For networking the TX side will raise the TX interrupt on the queue on > > which the packet was queued obviously or should I say hopefully. :) > > This is my impression as well. > > > But incoming packets will be directed to some receive queue based on a > > hash or whatever crystallball logic the firmware decided to implement. > > > > Which makes this not really suitable for the managed interrupt and > > spreading approach which is used by block-mq. Hrm... > > > > But I still think that for curing that isolation stuff we want at least > > some information from the driver. Alternative solution would be to grant > > the allocation of interrupts and queues and have some sysfs knob to shut > > down queues at runtime. If that shutdown results in releasing the queue > > interrupt (via free_irq()) then the vector exhaustion problem goes away. > > I think this is close to what I and Marcelo were discussing earlier today > privately. > > I don't think there is currently a way to control the enablement/disablement of > interrupts from the userspace. As long as the interrupt obeys the "trigger when request has been performed by local CPU" rule (#1) (for MSI type interrupts, where driver allocates one I/O interrupt per CPU), don't see a need for the interface. For other types of interrupts, interrupt controller should be programmed to not include the isolated CPU on its "destination CPU list". About the block VS network discussion, what we are trying to do at skb level (Paolo Abeni CC'ed, author of the suggestion) is to use RPS to avoid skbs from being queued to a CPU (on RX), and to queue skbs on housekeeping CPUs for processing (TX). However, if per-CPU interrupts are not disabled, then the (for example) network device is free to include the CPU in its list of destinations. Which would require one to say, configure RPS (or whatever mechanism is distributing interrupts). Hum, it would feel safer (rather than trust the #1 rule to be valid in all cases) to ask the driver to disable the interrupt (after shutting down queue) for that particular CPU. BTW, Thomas, software is free to configure a particular MSI-X interrupt to point to any CPU: 10.11 MESSAGE SIGNALLED INTERRUPTS The PCI Local Bus Specification, Rev 2.2 (www.pcisig.com) introduces the concept of message signalled interrupts. As the specification indicates: “Message signalled interrupts (MSI) is an optional feature that enables PCI devices to request service by writing a system-specified message to a system-specified address (PCI DWORD memory write transaction). The transaction address specifies the message destination while the transaction data specifies the message. System software is expected to initialize the message destination and message during device configuration, allocating one or more non-shared messages to each MSI capable function.” Fields in the Message Address Register are as follows: 1. Bits 31-20 — These bits contain a fixed value for interrupt messages (0FEEH). This value locates interrupts at the 1-MByte area with a base address of 4G – 18M. All accesses to this region are directed as interrupt messages. Care must to be taken to ensure that no other device claims the region as I/O space. 2. Destination ID — This field contains an 8-bit destination ID. It identifies the message’s target processor(s). The destination ID corresponds to bits 63:56 of the I/O APIC Redirection Table Entry if the IOAPIC is used to dispatch the interrupt to the processor(s). --- So taking the example where computation happens while isolated and later stored via block interface, aren't we restricting the usage scenarios by enforcing the "per-CPU queue has interrupt pointing to owner CPU" rule? > I think in terms of the idea we need something similar to what i40e does, > that is shutdown all IRQs when CPU is suspended and restores the interrupt > schema when the CPU is back online. > > The two key difference will be that this API needs to be generic and also > needs to be exposed to the userspace through something like sysfs as you > have mentioned. > > -- > Thanks > Nitesh >
On Mon, Oct 26 2020 at 16:08, Jacob Keller wrote: > On 10/26/2020 3:49 PM, Thomas Gleixner wrote: >> On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote: >>> I don't think there is currently a way to control the enablement/disablement of >>> interrupts from the userspace. >> >> You cannot just disable the interrupt. You need to make sure that the >> associated queue is shutdown or quiesced _before_ the interrupt is shut >> down. > > Could this be handled with a callback to the driver/hw? I know Intel HW > should support this type of quiesce/shutdown. We can't have a callback from the interrupt shutdown code as you have to wait for the queue to drain packets in flight. Something like this mark queue as going down (no more tx queueing) tell hardware not to route RX packets to it consume pending RX wait for already queued TX packets to be sent Look what the block people did. They have a common multi-instance hotplug state and they register each context (queue) as an instance. The hotplug core invokes the corresponding callbacks when bringing a CPU up or when shutting it down. Thanks, tglx
On Tue, Oct 27 2020 at 08:47, Marcelo Tosatti wrote: > On Mon, Oct 26, 2020 at 06:22:29PM -0400, Nitesh Narayan Lal wrote: > However, if per-CPU interrupts are not disabled, then the (for example) > network device is free to include the CPU in its list of destinations. > Which would require one to say, configure RPS (or whatever mechanism > is distributing interrupts). And why is that a problem? If that's possible then you can prevent getting RX interrupts already today. > Hum, it would feel safer (rather than trust the #1 rule to be valid > in all cases) to ask the driver to disable the interrupt (after shutting > down queue) for that particular CPU. > > BTW, Thomas, software is free to configure a particular MSI-X interrupt > to point to any CPU: > > 10.11 MESSAGE SIGNALLED INTERRUPTS I know how MSI works :) > So taking the example where computation happens while isolated and later > stored via block interface, aren't we restricting the usage scenarios > by enforcing the "per-CPU queue has interrupt pointing to owner CPU" > rule? No. For block this is the ideal configuration (think locality) and it prevents vector exhaustion. If you make these interrupts freely routable then you bring back the vector exhaustion problem right away. Now we already established that networking has different requirements, so you have to come up with a different solution for it which allows to work for all use cases. Thanks, tglx