Message ID | 20200928183529.471328-5-nitesh@redhat.com |
---|---|
State | New |
Headers | show |
Series | isolation: limit msix vectors to housekeeping CPUs | expand |
[to: Christoph in case he has comments, since I think he wrote this code] On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote: > If we have isolated CPUs dedicated for use by real-time tasks, we try to > move IRQs to housekeeping CPUs from the userspace to reduce latency > overhead on the isolated CPUs. > > If we allocate too many IRQ vectors, moving them all to housekeeping CPUs > may exceed per-CPU vector limits. > > When we have isolated CPUs, limit the number of vectors allocated by > pci_alloc_irq_vectors() to the minimum number required by the driver, or > to one per housekeeping CPU if that is larger. > > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> > --- > drivers/pci/msi.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index 30ae4ffda5c1..8c156867803c 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -23,6 +23,7 @@ > #include <linux/slab.h> > #include <linux/irqdomain.h> > #include <linux/of_irq.h> > +#include <linux/sched/isolation.h> > > #include "pci.h" > > @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, > struct irq_affinity *affd) > { > struct irq_affinity msi_default_affd = {0}; > + unsigned int hk_cpus; > int nvecs = -ENOSPC; > > + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > + > + /* > + * If we have isolated CPUs for use by real-time tasks, to keep the > + * latency overhead to a minimum, device-specific IRQ vectors are moved > + * to the housekeeping CPUs from the userspace by changing their > + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > + * running out of IRQ vectors. > + */ > + if (hk_cpus < num_online_cpus()) { > + if (hk_cpus < min_vecs) > + max_vecs = min_vecs; > + else if (hk_cpus < max_vecs) > + max_vecs = hk_cpus; > + } > + > if (flags & PCI_IRQ_AFFINITY) { > if (!affd) > affd = &msi_default_affd; > -- > 2.18.2 >
On Mon, Sep 28, 2020 at 04:59:31PM -0500, Bjorn Helgaas wrote: > [to: Christoph in case he has comments, since I think he wrote this code] I think I actually suggested this a few iterations back. > > + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > > + > > + /* > > + * If we have isolated CPUs for use by real-time tasks, to keep the > > + * latency overhead to a minimum, device-specific IRQ vectors are moved > > + * to the housekeeping CPUs from the userspace by changing their > > + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > > + * running out of IRQ vectors. > > + */ > > + if (hk_cpus < num_online_cpus()) { I woukd have moved the assignment to hk_cpus below the comment and just above the if, but that is really just a minor style preference. Otherwise this looks good: Acked-by: Christoph Hellwig <hch@lst.de>
On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote: > If we have isolated CPUs dedicated for use by real-time tasks, we try to > move IRQs to housekeeping CPUs from the userspace to reduce latency > overhead on the isolated CPUs. > > If we allocate too many IRQ vectors, moving them all to housekeeping CPUs > may exceed per-CPU vector limits. > > When we have isolated CPUs, limit the number of vectors allocated by > pci_alloc_irq_vectors() to the minimum number required by the driver, or > to one per housekeeping CPU if that is larger. > > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> > --- > drivers/pci/msi.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index 30ae4ffda5c1..8c156867803c 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -23,6 +23,7 @@ > #include <linux/slab.h> > #include <linux/irqdomain.h> > #include <linux/of_irq.h> > +#include <linux/sched/isolation.h> > > #include "pci.h" > > @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, > struct irq_affinity *affd) > { > struct irq_affinity msi_default_affd = {0}; > + unsigned int hk_cpus; > int nvecs = -ENOSPC; > > + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > + > + /* > + * If we have isolated CPUs for use by real-time tasks, to keep the > + * latency overhead to a minimum, device-specific IRQ vectors are moved > + * to the housekeeping CPUs from the userspace by changing their > + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > + * running out of IRQ vectors. > + */ > + if (hk_cpus < num_online_cpus()) { > + if (hk_cpus < min_vecs) > + max_vecs = min_vecs; > + else if (hk_cpus < max_vecs) > + max_vecs = hk_cpus; is that: max_vecs = clamp(hk_cpus, min_vecs, max_vecs); Also, do we really need to have that conditional on hk_cpus < num_online_cpus()? That is, why can't we do this unconditionally? And what are the (desired) semantics vs hotplug? Using a cpumask without excluding hotplug is racy. > + } > + > if (flags & PCI_IRQ_AFFINITY) { > if (!affd) > affd = &msi_default_affd; > -- > 2.18.2 >
On 10/16/20 8:20 AM, Peter Zijlstra wrote: > On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote: >> If we have isolated CPUs dedicated for use by real-time tasks, we try to >> move IRQs to housekeeping CPUs from the userspace to reduce latency >> overhead on the isolated CPUs. >> >> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs >> may exceed per-CPU vector limits. >> >> When we have isolated CPUs, limit the number of vectors allocated by >> pci_alloc_irq_vectors() to the minimum number required by the driver, or >> to one per housekeeping CPU if that is larger. >> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> >> --- >> drivers/pci/msi.c | 18 ++++++++++++++++++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c >> index 30ae4ffda5c1..8c156867803c 100644 >> --- a/drivers/pci/msi.c >> +++ b/drivers/pci/msi.c >> @@ -23,6 +23,7 @@ >> #include <linux/slab.h> >> #include <linux/irqdomain.h> >> #include <linux/of_irq.h> >> +#include <linux/sched/isolation.h> >> >> #include "pci.h" >> >> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, >> struct irq_affinity *affd) >> { >> struct irq_affinity msi_default_affd = {0}; >> + unsigned int hk_cpus; >> int nvecs = -ENOSPC; >> >> + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); >> + >> + /* >> + * If we have isolated CPUs for use by real-time tasks, to keep the >> + * latency overhead to a minimum, device-specific IRQ vectors are moved >> + * to the housekeeping CPUs from the userspace by changing their >> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from >> + * running out of IRQ vectors. >> + */ >> + if (hk_cpus < num_online_cpus()) { >> + if (hk_cpus < min_vecs) >> + max_vecs = min_vecs; >> + else if (hk_cpus < max_vecs) >> + max_vecs = hk_cpus; > is that: > > max_vecs = clamp(hk_cpus, min_vecs, max_vecs); Yes, I think this will do. > > Also, do we really need to have that conditional on hk_cpus < > num_online_cpus()? That is, why can't we do this unconditionally? FWIU most of the drivers using this API already restricts the number of vectors based on the num_online_cpus, if we do it unconditionally we can unnecessary duplicate the restriction for cases where we don't have any isolated CPUs. Also, different driver seems to take different factors into consideration along with num_online_cpus while finding the max_vecs to request, for example in the case of mlx5: MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() + MLX5_EQ_VEC_COMP_BASE Having hk_cpus < num_online_cpus() helps us ensure that we are only changing the behavior when we have isolated CPUs. Does that make sense? > > And what are the (desired) semantics vs hotplug? Using a cpumask without > excluding hotplug is racy. The housekeeping_mask should still remain constant, isn't? In any case, I can double check this. > >> + } >> + >> if (flags & PCI_IRQ_AFFINITY) { >> if (!affd) >> affd = &msi_default_affd; >> -- >> 2.18.2 >>
On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote: > >> + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > >> + > >> + /* > >> + * If we have isolated CPUs for use by real-time tasks, to keep the > >> + * latency overhead to a minimum, device-specific IRQ vectors are moved > >> + * to the housekeeping CPUs from the userspace by changing their > >> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > >> + * running out of IRQ vectors. > >> + */ > >> + if (hk_cpus < num_online_cpus()) { > >> + if (hk_cpus < min_vecs) > >> + max_vecs = min_vecs; > >> + else if (hk_cpus < max_vecs) > >> + max_vecs = hk_cpus; > > is that: > > > > max_vecs = clamp(hk_cpus, min_vecs, max_vecs); > > Yes, I think this will do. > > > > > Also, do we really need to have that conditional on hk_cpus < > > num_online_cpus()? That is, why can't we do this unconditionally? > > FWIU most of the drivers using this API already restricts the number of > vectors based on the num_online_cpus, if we do it unconditionally we can > unnecessary duplicate the restriction for cases where we don't have any > isolated CPUs. unnecessary isn't really a concern here, this is a slow path. What's important is code clarity. > Also, different driver seems to take different factors into consideration > along with num_online_cpus while finding the max_vecs to request, for > example in the case of mlx5: > MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() + > MLX5_EQ_VEC_COMP_BASE > > Having hk_cpus < num_online_cpus() helps us ensure that we are only > changing the behavior when we have isolated CPUs. > > Does that make sense? That seems to want to allocate N interrupts per cpu (plus some random static amount, which seems weird, but whatever). This patch breaks that. So I think it is important to figure out what that driver really wants in the nohz_full case. If it wants to retain N interrupts per CPU, and only reduce the number of CPUs, the proposed interface is wrong. > > And what are the (desired) semantics vs hotplug? Using a cpumask without > > excluding hotplug is racy. > > The housekeeping_mask should still remain constant, isn't? > In any case, I can double check this. The goal is very much to have that dynamically configurable.
On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote: > > + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > + > + /* > + * If we have isolated CPUs for use by real-time tasks, to keep the > + * latency overhead to a minimum, device-specific IRQ vectors are moved > + * to the housekeeping CPUs from the userspace by changing their > + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > + * running out of IRQ vectors. > + */ This is not true for managed interrupts. The interrupts affinity of those cannot be changed by user space. > + if (hk_cpus < num_online_cpus()) { > + if (hk_cpus < min_vecs) > + max_vecs = min_vecs; > + else if (hk_cpus < max_vecs) > + max_vecs = hk_cpus; > + } So now with that assume a 16 core machine (HT off for simplicity) 17 Requested interrupts (1 general, 16 queues) Managed interrupts will allocate 1 general interrupt which is free movable by user space 16 managed interrupts for queues (one per CPU) This allows the driver to have 16 queues, i.e. one queue per CPU. These interrupts are only used when an application on a CPU issues I/O. With the above change this will result 1 general interrupt which is free movable by user space 1 managed interrupts (possible affinity to all 16 CPUs, but routed to housekeeping CPU as long as there is one online) So the device is now limited to a single queue which also affects the housekeeping CPUs because now they have to share a single queue. With larger machines this gets even worse. So no. This needs way more thought for managed interrupts and you cannot do that at the PCI layer. Only the affinity spreading mechanism can do the right thing here. Thanks, tglx
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 30ae4ffda5c1..8c156867803c 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -23,6 +23,7 @@ #include <linux/slab.h> #include <linux/irqdomain.h> #include <linux/of_irq.h> +#include <linux/sched/isolation.h> #include "pci.h" @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, struct irq_affinity *affd) { struct irq_affinity msi_default_affd = {0}; + unsigned int hk_cpus; int nvecs = -ENOSPC; + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); + + /* + * If we have isolated CPUs for use by real-time tasks, to keep the + * latency overhead to a minimum, device-specific IRQ vectors are moved + * to the housekeeping CPUs from the userspace by changing their + * affinity mask. Limit the vector usage to keep housekeeping CPUs from + * running out of IRQ vectors. + */ + if (hk_cpus < num_online_cpus()) { + if (hk_cpus < min_vecs) + max_vecs = min_vecs; + else if (hk_cpus < max_vecs) + max_vecs = hk_cpus; + } + if (flags & PCI_IRQ_AFFINITY) { if (!affd) affd = &msi_default_affd;
If we have isolated CPUs dedicated for use by real-time tasks, we try to move IRQs to housekeeping CPUs from the userspace to reduce latency overhead on the isolated CPUs. If we allocate too many IRQ vectors, moving them all to housekeeping CPUs may exceed per-CPU vector limits. When we have isolated CPUs, limit the number of vectors allocated by pci_alloc_irq_vectors() to the minimum number required by the driver, or to one per housekeeping CPU if that is larger. Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> --- drivers/pci/msi.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)